HAPPIER LIVES INSTITUTE
  • About
    • Our strategy
    • Our story
    • Our team
  • Research
    • Research agenda
    • Theoretical research >
      • Philosophy of well-being
      • Measuring happiness
      • Life satisfaction theories
      • Subjective scales: comparability
    • Intervention reports >
      • Cash transfers
    • Mental health programme evaluation
    • Problem area reports >
      • Pain
      • Mental health
      • Positive education
  • Blog
  • Take action
    • Reading list
    • Donation advice
    • Career advice
    • Vacancies
  • Donate

Measuring happiness

  1. Comparing outcomes
  2. Subjective well-being
  3. A brief history of measuring happiness
  4. Validity and reliability
  5. Comparing individuals
  6. Well-being adjusted life years (WALYs)
  7. The problem with health metrics (QALYs and DALYs)
Validity and reliability

How confident can we be that SWB measures are accurate?  Do they succeed in measuring what they set out to measure? 

In the social sciences, the accuracy of a measure is usually assessed in terms of its validity and reliability. 

Validity refers to whether the measure captures the underlying concept that it purports to measure. Suppose I try to measure your height by weighing you on a set of bathroom scales. The scales might be a valid measure of weight but it’s clear, I hope, they are not a valid measure of height. 

Reliability is about whether the measure gives consistent results in identical circumstances (i.e. it has a high signal-to-noise ratio). If my scales produce a random number every time I step on them, they are not reliable. 

Reliability is necessary but not sufficient for validity. If you used a normal, non-broken set of scales to measure your height it would give you the same score, and so be reliable (assuming your weight doesn’t fluctuate), but still wouldn’t be valid. The reliability and validity of SWB scales has been covered at great length in (OECD 2013) and elsewhere. The following sections provide a summary of the key points. 

Assessing reliability

Reliability can be assessed in two ways: 

  1. Internal consistency - whether the items with a multi-item scale correlate, or different scales of the same measure correlate.
  2. Test-retest reliability - where the same question is given to the same respondent more than once at different times. If the item in question genuinely does change between measures, we would expect the test-retest reliability to be low.

Regarding life evaluations, quoting (OECD 2013, p47):
Bjornshov (2010), for example, finds a correlation of 0.75 between the average Cantril Ladder measure of life evaluation from the Gallup World Poll and life satisfaction as measured in the World Values Survey for a sample of over 90 countries. [...] Test-retest results for single item life evaluation measure tend to yield correlations of between 0.5 and 0.7 for time period of 1 day to 2 weeks (Krueger and Schkade, 2008). Michalos and Kahlke (2010) report that a single-item measure of life satisfaction had a correlation of 0.65 for a one year period and of 0.65 for a two-year period.
And regarding affect/experience measures:
There is less information available on the reliability of measure of affect and eudaimonic well-being than is the case for measures of life evaluation. However, the available information is largely consistent with the picture for life satisfaction. In terms of internal consistency reliability, Diener et al. (2009) report [...] the positive, negative and affective balance subscale of their Scale of Positive and Negative Experience (SPANE) have alphas of 0.84, 0.88, and 0.88 respectively. [...] In the case of test-retest reliability, [...] Krueger and Schkade (2008) report test-retest scores of 0.5 and 0.7 for a range of different measures of affect over a 2-week period.
The authors of OECD (2013) conclude the life evaluation and affect measures exhibit sufficient correlation, by the standards of social science, to be deemed acceptably reliable.

Assessing validity

Validity, by contrast, is somewhat harder to test than reliability for SWB measures because the underlying phenomena are subjective, hence there is no objective way to demonstrate success. If you could measure something subjective objectively, it would not be subjective. Nevertheless, there are various ways to assess validity. All of these ultimately rely on whether the measures conform to our expectations about the item we are intending to measure.

The first is face validity - do respondents judge the questions as an appropriate way to measure the concept of interest? If not, it’s likely the measures aren’t valid. In the case of SWB measures, it’s somewhat obvious this is the case, e.g. that asking people whether they felt sad yesterday is a good way to assess whether they felt sad yesterday. Participants aren’t generally asked about face validity, but this can be tested by (a) response speed and (b) non-response rates: if people don’t take a long time, or don’t answer, that suggests they don’t understand the question. Median response rates for SWB questions are around 30 seconds for single item measures, suggesting the questions are not conceptually difficult (ONS, 2011). Quoting from (OECD 2013, p49): “in a large analysis by Smith (2013) covering three datasets [...] and over 400,000 observations, item-specific non-response rates for life evaluation and affect were found to be similar for those for [the straightforward] measures of educational attainment, marital and labour force status” which, again, supports the face validity of the questions.

The second is convergent validity - does the item correlate with other proxy measures for the same concept? Kahneman and Krueger (2006) list the following as correlates of both high life satisfaction and happiness: smiling frequency; smiling with the eyes (“unfakeable smile”); rating of one’s happiness made by friends; frequent verbal expressions of positive emotions; happiness of close relatives; self-reported health. In addition, OECD (2013) states:
Diener (2011), summarising the research in this area, notes that life satisfaction predicts suicidal ideation (r=0.44) and the low life satisfaction scores predicted suicide 20 years later in a later epidemiological survey from Finland (after controlling for other risk factors). Such items allow us to assess the measures from the perspective of falsifiability: if we expect that (say) those with low life satisfaction would commit suicide more often, but our measure of life satisfaction found those with high LS commit suicide more often, that would suggest the measure lacked validity. As it stands, the results support the validity of the experience and evaluation measures of SWB.
The third is construct validity - while convergent validity assesses how closely the measure correlates with other proxy measures of the same concept, construct validity concerns itself with whether the measure performs in the way we expect it to. From OECD (2013, p51):

Measures of SWB broadly show the expected relationship with other individual, social and economic determinants. Among individuals, higher incomes are associated with higher levels of life satisfaction and affect, and wealthier countries have higher average levels of both types of subjective well-being than poorer countries (Sacks, Stevenson and Wolfers, 2010). At the individual level, health status, social contact and education and being in a stable relationship with a partner are all associated with higher levels of life satisfaction (Dolan, Peasgood and White, 2008), while unemployment has a large negative impact on life satisfaction (Winkelmann and Winkelmann, 1998). Kahneman and Krueger (2006) report intimate relations, socialising, relaxing, eating and praying are associated with higher levels of positive affect; conversely, commuting, working and childcare and housework are associated with low level of net positive affect. Boarini et al. (2012) find that affect measures have the same broad set of drivers as measures of life satisfaction, although the relative importance of some factors changes.

Major life events, such as unemployment, marriage, divorce and widowhood, are shown to result in long-term, substantial changes to SWB, just as one would expect them to. The time-series in figure 1, from Clark, Diener, Geogellis and Lucas (2007), displays the LS-impact of such events for males (controlling for other variables) before, during and after they occur (y-axis records the change in LS on a 0-10 scale; the results are similar for females). Note the time series sometime shows anticipation of the event. We can see, for example, a decrease in LS leading up to a divorce, whereas widowhood is barely anticipated and comes as a huge shock.
Picture
Figure 1. The dynamic effects of life and labour market events on life satisfaction (male) (Clark, Diener, Geogellis and Lucas 2007) (Y-axis represents absolute change in of life satisfaction on 1-10 scale)
Figure 2 from Clark, Fleche, Layard, Powdthavee, Ward (2017, p100) shows a similar time-series, this time for disability from three different data-sets. Individuals seem to partially, rather than fully, adapt to disability.[1] This is what we might suppose would happen: becoming disabled is very bad, but being disabled is somewhat less bad as one’s lifestyle and mindset adjusts. It’s worth noting here one major potential objection to the use of SWB measures is that people do not really adapt to changes in circumstances, they simply change how they use their scales. However, if scale re-norming did take place, we would expect to see adaptation to all conditions. Yet, we do not see this: the LS scores in figure 1 above show people adapt to some things and not others. Further, Oswald and Powdthawee (2008) find there is less adaptation to severe disability than to mild or moderate disability, suggesting scale norming is not occuring and that the SWB scores are reflecting reality.
Picture
Figure 2. Adaptation to disability in different country data-sets (Clark et al. 2017, p100)
As mentioned before, if the SWB measures had produced counter-intuitive results (the ‘wrong answers’) that could lead us to conclude they were not valid. Yet, the above seems to match our ‘folk psychological’ expectations.

The Easterlin Paradox

One finding that might, at least at first, seem counterintuitive is the relationship between SWB and income. While there is little disagreement that richer people within a given country report higher SWB (both on experience and evaluation measures), and richer countries report higher SWB, there is less consensus over whether SWB increases over time as countries become wealthier. This is the so-called ‘Easterlin Paradox’, displayed in figure 3 below Clark et al. (2018, p203). A critical response to SWB measures could be made as follows: “the Easterlin Paradox shows increasing overall economic prosperity doesn’t increase SWB. But it’s obvious increasing overall economic should raise SWB. Therefore, the SWB measures must be wrong”.

Such a response would be too quick. First, the debate still rages over whether the Easterlin Paradox holds - Stevenson and Wolfers (2008) argues it does not, Easterlin et al. (2016) reply.

Second, as Clark (2016) notes, a large body of research finds individual SWB depends not just on the individual’s own income, but also their income relative to that of the reference group they compared their own income to. Thus, if I am a wealthier than you, I should expect to have higher SWB. However, if my income rises but the income of those I compare my income to also rises, these effects cancel out, leaving my SWB unchanged. Hence the Easterlin paradox can be explained in large part by the phenomenon of social comparison: we judge our lives against those of others.  
Picture
Figure 3. Change in subjective well-being and GDP/head over time
In a particularly insightful study, Solnick and Hemenway (2005), individuals were asked to choose between different states of the world, as follows.

A: Your current yearly income is $50,000; others earn $25,000

B: Your current yearly income is $100,000; others earn $200,000

Absolute income is higher in B than in A, while relative income is higher in A than in B. Individuals express a marked preference for A, highlighting the importance of relative income. Hence, with further analysis, the Easterlin paradox is no longer as counter-intuitive as it first seemed.

Overall, the evaluation and experience SWB measures seem both reliable and valid.
​⇨  Next  ​
      Comparing individuals

⇦  Previous
      A brief history of measuring happiness
Endnotes

​[1] In a meta-analysis, Luhmann et al. (2012) compare the rates of adaptation on evaluative and experience measures of SWB, finding some differences.

About

Our strategy
Our story
​Our team

Research

​Research agenda
Philosophy of well-being
Measuring happiness
Life satisfaction theories
Subjective scales: comparability
Cash transfers: systematic review
Mental health: programme evaluation
Mental health: problem area report
Pain: problem area report
Positive education: research plan

Take action

Reading list
Donation advice
​Career advice
​
Vacancies
Contact us at:
hello@happierlivesinstitute.org

​
Support our work
© COPYRIGHT 2019-2020.
​ALL RIGHTS RESERVED.
The Happier Lives Institute (“HLI”) is operating through a fiscal sponsorship with Players Philanthropy Fund (Federal Tax ID: 27-6601178), a Maryland charitable trust with federal tax-exempt status as a public charity under Section 501(c)(3) of the Internal Revenue Code. Contributions to HLI are tax-deductible to the fullest extent of the law.
  • About
    • Our strategy
    • Our story
    • Our team
  • Research
    • Research agenda
    • Theoretical research >
      • Philosophy of well-being
      • Measuring happiness
      • Life satisfaction theories
      • Subjective scales: comparability
    • Intervention reports >
      • Cash transfers
    • Mental health programme evaluation
    • Problem area reports >
      • Pain
      • Mental health
      • Positive education
  • Blog
  • Take action
    • Reading list
    • Donation advice
    • Career advice
    • Vacancies
  • Donate