HAPPIER LIVES INSTITUTE
  • About
    • Our strategy
    • Our story
    • Our team
  • Research
    • Research agenda
    • Theoretical research >
      • Philosophy of well-being
      • Measuring happiness
      • Life satisfaction theories
      • Subjective scales: comparability
    • Intervention reports >
      • Cash transfers
    • Mental health programme evaluation
    • Problem area reports >
      • Pain
      • Mental health
      • Positive education
  • Blog
  • Take action
    • Reading list
    • Donation advice
    • Career advice
    • Vacancies
  • Donate

Results of the first inter-rater reliability analysis

A) Introduction

This is a write-up for the first inter-rater reliability analysis as part of the MHIN screening. The rationale for this analysis has been provided elsewhere - along with some ideas which mostly could not be implemented due to the data available from the first screening round. The screening process is described here - including a decision rule indicating in which case an intervention is screened “in” or “out”.

The calculations for this analysis are found in this document on the sheet “Analysis”.
In total, this analysis is based on 58 attempted screenings for 9 interventions (6,44 attempted screenings/intervention).
9 Raters have contributed with at least two attempted screenings.

B)  Key findings and inferences


  1. In general terms:
 
  • Out of 58 attempted screenings, in 15 cases (=25.86%)  raters stated that the information available was not sufficient to screen an intervention “in” or “out”. This may have been due to a lack in clarity about costs or benefits of the intervention (or both).
    • It might be reasonable to put an intervention on a separate list (separate to the ones which can clearly be screened “in” or “out”) if not enough information to assess its CE is available. This might e.g. be the case because the intervention is a future project.
 
  • For the eight interventions which were screened more than once, there was either very strong agreement between raters or close to no agreement.
    • This suggests that we have a broad distinction into rather clear and not-so-clear interventions. If we are meant to focus on only a couple of interventions, we might focus on the “clear ins” first. However it is noteworthy that the two interventions clearly rated “in” (see below)  probably were not new to most raters.
 
  • The interventions “friendship bench” and “StrongMinds” were clearly rated “in” whereas “Rising Sun” was clearly rated “out”.
 
  • Fleiss’ Kappa could not be calculated as this measure apparently assumes that the number of raters is the same for every intervention (which is not the case here - and even if it were, the data would differ because the number of “failed screenings” (where no estimate could be made) differs between interventions). Instead, a weighted average of the proportion of agreement was calculated and added up to 0.76. This is between 1 (total agreement) and 0 (total disagreement - allocation by chance) and suggests strong agreement overall.  
 
  • Overall, especially if assessing individual rating behaviour, the quantity of the data appears insufficient to make strong claims about inter-rater reliability. More data would be appreciated to gain higher confidence in the outcome of this analysis. This becomes especially evident when looking at the standard deviations of cost x effectiveness scores.


       2. With regards to individual rating patterns:

  • By looking at the estimations of the cost x benefit score which then determines whether an intervention is screened “in” or “out”, most raters were found not to deviate from the mean score consistently by at least 1 SD. Only one rater (Eemaan) had more ratings which deviated by +1 SD than ratings that were less than 1SD from the average (3 compared to 2). Most raters either had no or only one deviation of at least 1 SD. This indicates that individual assessment of costs and benefits was roughly similar. It needs to be emphasized that not much data is available here - most raters had an overall number of around 4-6 ratings that led to screening “In” or “out”.
    • We might still want to tell Eemaan that she tended to deviate from the overall average to increase awareness for the next round.
  • It would have been interesting  to also look at the intuitive 1-10 score. However, this could not be calculated as too many ratings were invalid (see below).

C) Most important things to consider for raters to improve future inter-rater reliability.

  • Data need to be entered in the correct format and consistently. E.g. please do include the name of the rater, and do so consistently and do not add a space interchangeably (“Tim” vs. “Tim “). These things are extremely hard to spot when running the analysis.
    • We might consider restricting the cell format in certain cases to avoid this in the future.
    • We should develop an accepted way of stating costs/effectiveness cannot be estimated, e.g. “NE”
    • Unfortunately, when asked to provide a score between 1-10, we need a definitive score. Many people have indicated ranges here, but this is hard to evaluate. Alternatively, we could also include a x%-CI here as to reflect differences in uncertainty.
  • Whenever possible, raters should try to make an estimate regarding the costs and the effectiveness (because otherwise the intervention can neither be screened in nor out).
    • We might, however, also change the criterion for screening in or out. We could e.g. say that if costs/effectiveness cannot be estimated, then the subjective assessment of CE (1-10) can lead to a conditional screening “in” (would be a third option then), e.g. if >5. That way we could make sure not to miss out on interventions which have not provided adequate data but might nonetheless be very promising.

D) Please note:
  • I am not an expert in this field - neither in inter-rater reliability analysis, nor in Excel. A double-checking of my calculations and inferences would certainly be helpful if possible.
  • This analysis does not take qualitative aspects into account as those probably stated in the column for general feedback. These should be analyzed separately.

Two aspects seem to stand out with regard to this, though:

  • By looking at the data I had the impression that especially the question of what we define as beneficiaries seems to be controversial.
  • Some further clarity on what we mean by "could be funded"/ "could be funded as new organization" might be helpful (not decisive for the screening, though).  I have e.g. spoken with Florian and he was not sure if this question pertains to the current state or potentially the future (if e.g. the MHI is a research project testing an intervention which could later potentially be scaled up, but clear evidence has not yet been provided).






​

About

Our strategy
Our story
​Our team

Research

​Research agenda
Philosophy of well-being
Measuring happiness
Life satisfaction theories
Subjective scales: comparability
Cash transfers: systematic review
Mental health: programme evaluation
Mental health: problem area report
Pain: problem area report
Positive education: research plan

Take action

Reading list
Donation advice
​Career advice
​
Vacancies
Contact us at:
hello@happierlivesinstitute.org

​
Support our work
© COPYRIGHT 2019-2020.
​ALL RIGHTS RESERVED.
The Happier Lives Institute (“HLI”) is operating through a fiscal sponsorship with Players Philanthropy Fund (Federal Tax ID: 27-6601178), a Maryland charitable trust with federal tax-exempt status as a public charity under Section 501(c)(3) of the Internal Revenue Code. Contributions to HLI are tax-deductible to the fullest extent of the law.
  • About
    • Our strategy
    • Our story
    • Our team
  • Research
    • Research agenda
    • Theoretical research >
      • Philosophy of well-being
      • Measuring happiness
      • Life satisfaction theories
      • Subjective scales: comparability
    • Intervention reports >
      • Cash transfers
    • Mental health programme evaluation
    • Problem area reports >
      • Pain
      • Mental health
      • Positive education
  • Blog
  • Take action
    • Reading list
    • Donation advice
    • Career advice
    • Vacancies
  • Donate