Validity and reliability
How confident can we be that subjective well-being (SWB) measures are accurate? Do they succeed in measuring what they set out to measure?
In the social sciences, the accuracy of a measure is usually assessed in terms of its validity and reliability.
Validity refers to whether the measure captures the underlying concept that it purports to measure. Suppose I try to measure your height by weighing you on a set of bathroom scales. The scales might be a valid measure of weight but it’s clear, I hope, they are not a valid measure of height.
Reliability is about whether the measure gives consistent results in identical circumstances (i.e. it has a high signal-to-noise ratio). If my scales produce a random number every time I step on them, they are not reliable.
Reliability is necessary but not sufficient for validity. If you used a normal, non-broken set of scales to measure your height it would give you the same score, and so be reliable (assuming your weight doesn’t fluctuate), but still wouldn’t be valid. The reliability and validity of SWB scales has been covered at great length in (OECD 2013) and elsewhere. The following sections provide a summary of the key points.
Reliability can be assessed in two ways:
Regarding life evaluations, quoting (OECD 2013, p47):
Bjornshov (2010), for example, finds a correlation of 0.75 between the average Cantril Ladder measure of life evaluation from the Gallup World Poll and life satisfaction as measured in the World Values Survey for a sample of over 90 countries. [...] Test-retest results for single item life evaluation measure tend to yield correlations of between 0.5 and 0.7 for time period of 1 day to 2 weeks (Krueger and Schkade, 2008). Michalos and Kahlke (2010) report that a single-item measure of life satisfaction had a correlation of 0.65 for a one year period and of 0.65 for a two-year period.
And regarding affect/experience measures:
There is less information available on the reliability of measure of affect and eudaimonic well-being than is the case for measures of life evaluation. However, the available information is largely consistent with the picture for life satisfaction. In terms of internal consistency reliability, Diener et al. (2009) report [...] the positive, negative and affective balance subscale of their Scale of Positive and Negative Experience (SPANE) have alphas of 0.84, 0.88, and 0.88 respectively. [...] In the case of test-retest reliability, [...] Krueger and Schkade (2008) report test-retest scores of 0.5 and 0.7 for a range of different measures of affect over a 2-week period.
The authors of OECD (2013) conclude the life evaluation and affect measures exhibit sufficient correlation, by the standards of social science, to be deemed acceptably reliable.
Validity, by contrast, is somewhat harder to test than reliability for SWB measures because the underlying phenomena are subjective, hence there is no objective way to demonstrate success. If you could measure something subjective objectively, it would not be subjective. Nevertheless, there are various ways to assess validity. All of these ultimately rely on whether the measures conform to our expectations about the item we are intending to measure.
The first is face validity - do respondents judge the questions as an appropriate way to measure the concept of interest? If not, it’s likely the measures aren’t valid. In the case of SWB measures, it’s somewhat obvious this is the case, e.g. that asking people whether they felt sad yesterday is a good way to assess whether they felt sad yesterday. Participants aren’t generally asked about face validity, but this can be tested by (a) response speed and (b) non-response rates: if people don’t take a long time, or don’t answer, that suggests they don’t understand the question. Median response rates for SWB questions are around 30 seconds for single item measures, suggesting the questions are not conceptually difficult (ONS, 2011). Quoting from (OECD 2013, p49): “in a large analysis by Smith (2013) covering three datasets [...] and over 400,000 observations, item-specific non-response rates for life evaluation and affect were found to be similar for those for [the straightforward] measures of educational attainment, marital and labour force status” which, again, supports the face validity of the questions.
The second is convergent validity - does the item correlate with other proxy measures for the same concept? Kahneman and Krueger (2006) list the following as correlates of both high life satisfaction and happiness: smiling frequency; smiling with the eyes (“unfakeable smile”); rating of one’s happiness made by friends; frequent verbal expressions of positive emotions; happiness of close relatives; self-reported health. In addition, OECD (2013) states:
Diener (2011), summarising the research in this area, notes that life satisfaction predicts suicidal ideation (r=0.44) and the low life satisfaction scores predicted suicide 20 years later in a later epidemiological survey from Finland (after controlling for other risk factors). Such items allow us to assess the measures from the perspective of falsifiability: if we expect that (say) those with low life satisfaction would commit suicide more often, but our measure of life satisfaction found those with high LS commit suicide more often, that would suggest the measure lacked validity. As it stands, the results support the validity of the experience and evaluation measures of SWB.
The third is construct validity - while convergent validity assesses how closely the measure correlates with other proxy measures of the same concept, construct validity concerns itself with whether the measure performs in the way we expect it to. From OECD (2013, p51):
Measures of SWB broadly show the expected relationship with other individual, social and economic determinants. Among individuals, higher incomes are associated with higher levels of life satisfaction and affect, and wealthier countries have higher average levels of both types of subjective well-being than poorer countries (Sacks, Stevenson and Wolfers, 2010). At the individual level, health status, social contact and education and being in a stable relationship with a partner are all associated with higher levels of life satisfaction (Dolan, Peasgood and White, 2008), while unemployment has a large negative impact on life satisfaction (Winkelmann and Winkelmann, 1998). Kahneman and Krueger (2006) report intimate relations, socialising, relaxing, eating and praying are associated with higher levels of positive affect; conversely, commuting, working and childcare and housework are associated with low level of net positive affect. Boarini et al. (2012) find that affect measures have the same broad set of drivers as measures of life satisfaction, although the relative importance of some factors changes.
Major life events, such as unemployment, marriage, divorce and widowhood, are shown to result in long-term, substantial changes to SWB, just as one would expect them to. The time-series in figure 1, from Clark, Diener, Geogellis and Lucas (2007), displays the LS-impact of such events for males (controlling for other variables) before, during and after they occur (y-axis records the change in LS on a 0-10 scale; the results are similar for females). Note the time series sometime shows anticipation of the event. We can see, for example, a decrease in LS leading up to a divorce, whereas widowhood is barely anticipated and comes as a huge shock.
Figure 2 from Clark, Fleche, Layard, Powdthavee, Ward (2017, p100) shows a similar time-series, this time for disability from three different data-sets. Individuals seem to partially, rather than fully, adapt to disability. This is what we might suppose would happen: becoming disabled is very bad, but being disabled is somewhat less bad as one’s lifestyle and mindset adjusts. It’s worth noting here one major potential objection to the use of SWB measures is that people do not really adapt to changes in circumstances, they simply change how they use their scales. However, if scale re-norming did take place, we would expect to see adaptation to all conditions. Yet, we do not see this: the LS scores in figure 1 above show people adapt to some things and not others. Further, Oswald and Powdthawee (2008) find there is less adaptation to severe disability than to mild or moderate disability, suggesting scale norming is not occuring and that the SWB scores are reflecting reality.
As mentioned before, if the SWB measures had produced counter-intuitive results (the ‘wrong answers’) that could lead us to conclude they were not valid. Yet, the above seems to match our ‘folk psychological’ expectations.
The Easterlin Paradox
One finding that might, at least at first, seem counterintuitive is the relationship between SWB and income. While there is little disagreement that richer people within a given country report higher SWB (both on experience and evaluation measures), and richer countries report higher SWB, there is less consensus over whether SWB increases over time as countries become wealthier. This is the so-called ‘Easterlin Paradox’, displayed in figure 3 below Clark et al. (2018, p203). A critical response to SWB measures could be made as follows: “the Easterlin Paradox shows increasing overall economic prosperity doesn’t increase SWB. But it’s obvious increasing overall economic should raise SWB. Therefore, the SWB measures must be wrong”.
Such a response would be too quick. First, the debate still rages over whether the Easterlin Paradox holds - Stevenson and Wolfers (2008) argues it does not, Easterlin et al. (2016) reply.
Second, as Clark (2016) notes, a large body of research finds individual SWB depends not just on the individual’s own income, but also their income relative to that of the reference group they compared their own income to. Thus, if I am a wealthier than you, I should expect to have higher SWB. However, if my income rises but the income of those I compare my income to also rises, these effects cancel out, leaving my SWB unchanged. Hence the Easterlin paradox can be explained in large part by the phenomenon of social comparison: we judge our lives against those of others.
In a particularly insightful study, Solnick and Hemenway (2005), individuals were asked to choose between different states of the world, as follows.
A: Your current yearly income is $50,000; others earn $25,000
B: Your current yearly income is $100,000; others earn $200,000
Absolute income is higher in B than in A, while relative income is higher in A than in B. Individuals express a marked preference for A, highlighting the importance of relative income. Hence, with further analysis, the Easterlin paradox is no longer as counter-intuitive as it first seemed.
Overall, the evaluation and experience SWB measures seem both reliable and valid.
 In a meta-analysis, Luhmann et al. (2012) compare the rates of adaptation on evaluative and experience measures of SWB, finding some differences.