HAPPIER LIVES INSTITUTE
  • About
    • Our strategy
    • Our story
    • Our team
  • Research
    • Research agenda
    • Theoretical research >
      • Philosophy of well-being
      • Measuring happiness
      • Life satisfaction theories
      • Subjective scales: comparability
    • Intervention reports >
      • Cash transfers
    • Moral Weights
    • Mental health programme evaluation
    • Problem area reports >
      • Pain
      • Mental health
      • Positive education
  • Blog
  • Take action
    • Reading list
    • Donation advice
    • Career advice
    • Vacancies
  • Donate

Results of the second inter-rater reliability analysis

A) Introduction

This is a write-up for the second inter-rater reliability analysis as part of the MHIN screening. The rationale for this analysis has been provided elsewhere.
The screening process is described here - including a (preliminary) decision rule indicating in which case an intervention is screened “in” or “out”.


As for the first round (results here), 6 mental health interventions have been screened by 8 raters (however currently only 6 as of 4th April).

B) Key Findings and inferences (preliminary):

In general:
  • There was lower agreement in the second round compared to the first round of ratings (around 0.56).
  • In the case of three interventions, at least 42% of raters could not make a judgment about the cost-effectiveness as data were not provided or not deemed sufficient. Overall, 20/34 (ca. 59%) of ratings included an estimation of cost-effectiveness. This number is lower than in the first round.
  • “Abwenzi Pa Za Umoyo: Integrating the MESH MH model in Malawi” was rated “in” by all six raters who made a judgment about cost-effectiveness assuming a threshold of 6.
  • In the case of all other interventions, there was (more or less) disagreement about screening the intervention “in” or “out” assuming a threshold of 6.

Individual rating behaviour:
  • Still waiting for the last ratings as they will change the means and standard deviations which we compare to. Does currently not make sense to conduct an analysis here, but some trends are visible. Some raters do apparently tend to consistently score differently though, not sure what to do about that.

C) Discussion

  • The lower obtained proportion of agreement for round II may have been influenced by the fact that less interventions were known to raters. In the first round, most raters had probably already heard of the  “Friendship Bench” and “Strongminds”, which were then called “in” by everyone. However, they were also just very clearly promising interventions, so the bias may not have been too important here and instead, we may have had an overall worse quality of interventions compared to the first round.
 
  • Even encouraging raters to estimate costs and effectiveness under uncertainty did not help in obtaining more cost-effectiveness estimates. This suggests that the information available via the MHIN is simply insufficient for many projects. These data could, however, potentially be obtained through further research in many cases. This (to me) underscores the need to rely more on intuitive scores when (almost) none of the raters were able to estimate cost-effectiveness. We have touched on discussing this already.
 
  • Experimenting with the threshold gives interesting results. I have done this because the average CE-estimate of 7 out of ten interventions where we have such estimations (first round and second round taken together) is between 5.5 and 7.7 - very much around our proposed threshold of 6. It seems reasonable that overall agreement will be better if we either slightly lower our threshold (to 5) or slightly increase it to greater than 8 (which would exclude interventions rated 2 (costs) and 4 (effectiveness) or 4 and 2) or to greater than 9 (which would also exclude those rated 3 and 3).

  • In fact:
→ altering the threshold to 5 (i.e., the CE-estimate of an intervention needs to be bigger than 5 to be screened “in”) gives a weighted proportion of agreement of 0.91 (round 1) and 0.75 (round 2). Out of 12 interventions, we would clearly screen “in” 7 or 8.
→ altering the threshold to 8 (i.e., the CE-estimate of an intervention needs to be bigger than 8 to be screened “in”) gives a weighted proportion of agreement of 0.89 (round 1) and 0.65 (round 2). Out of 12 interventions, we would clearly screen “in” 3.
→ altering the threshold to 9 (i.e., the CE-estimate of an intervention needs to be bigger than 8 to be screened “in”) gives a weighted proportion of agreement of 0.89 (round 1) and 0.65 (round 2). Out of 12 interventions, we would clearly screen “in” 3.

From these data, we can see that setting the threshold is very important for inter rater agreement as well as our overall sensitivity.


  • On the other hand, raters were probably aware of the threshold and rated accordingly. Is this at all something we want to encourage? Maybe it would actually be better to rate all of the remaining interventions without a preset threshold but instead a range (say, 5-9) and then see which threshold within this range makes sense afterwards?
  • Interestingly (and mostly unsurprisingly), setting the threshold to 8 or 9 did not "kick out" former clear-ins, but mainly resolved disagreement. The same three clear-ins indeed remained clear-ins, whereas the ones with disagreement were now all kicked out with mostly reasonable agreement.

D) Recommendations (in a nutshell, up for debate):

  • The available data should be sufficient to justify going ahead with some follow-up trainings (?) for two or three raters probably.
  • Especially if we potentially alter the threshold as described above, it seems reasonable that the inter rater reliability is high enough overall to proceed - disagreement seems to largely stem from insufficient information rather than a strong lack of judgment of our raters.
  • Set a threshold range as an orientation instead of a clear-cut threshold and define the threshold only after obtaining the ratings for all interventions.
  • Create a second way of screening an intervention “in”: E.g. if two of our raters in the next round are unable to estimate CE, then look at intuitive score instead. Threshold here would need to be defined, but obvious seems six or greater. ​

About

Our strategy
Our story
​Our team

Research

​Research agenda
Philosophy of well-being
Measuring happiness
Life satisfaction theories
Subjective scales: comparability
Cash transfers: systematic review
Mental health: programme evaluation
Mental health: problem area report
Pain: problem area report
Positive education: research plan

Take action

Reading list
Donation advice
​Career advice
​
Vacancies
Contact us at:
hello@happierlivesinstitute.org

​
Support our work
© COPYRIGHT 2019-2020.
​ALL RIGHTS RESERVED.
The Happier Lives Institute (“HLI”) is operating through a fiscal sponsorship with Players Philanthropy Fund (Federal Tax ID: 27-6601178), a Maryland charitable trust with federal tax-exempt status as a public charity under Section 501(c)(3) of the Internal Revenue Code. Contributions to HLI are tax-deductible to the fullest extent of the law.
  • About
    • Our strategy
    • Our story
    • Our team
  • Research
    • Research agenda
    • Theoretical research >
      • Philosophy of well-being
      • Measuring happiness
      • Life satisfaction theories
      • Subjective scales: comparability
    • Intervention reports >
      • Cash transfers
    • Moral Weights
    • Mental health programme evaluation
    • Problem area reports >
      • Pain
      • Mental health
      • Positive education
  • Blog
  • Take action
    • Reading list
    • Donation advice
    • Career advice
    • Vacancies
  • Donate