Quality of evidence

Quality of evidence
Ratings
Criteria
Changes to GRADE
- - Endnotes

When assessing the quality of the evidence for an intervention or charity, we use the following broad ratings (discussed in detail in the Ratings section):

High
Moderate
Low
Very low

Our assessment is based on the widely used GRADE (Grading of Recommendations, Assessment, Development and Evaluation) framework¹, with a few minor adjustments to make it a better fit for the charity evaluation context (see the Changes to GRADE section). The overall quality of evidence is a combined rating based on the following six criteria (which we detail in the Criteria section):

Study design
Risk of bias
Imprecision
Inconsistency
Indirectness
Publication bias

Note: this page describes our approach as of 2023, which may not match our older reports. We see this methodology as a work in progress, and welcome feedback on it.

Ratings

GRADE does not provide a mechanistic rating², but rather a method for making ratings in a systematic and transparent way. The quality of the evidence is evaluated holistically, and the weight assigned to each criteria may differ depending on the context. As such, we don’t use strict rules or cutoffs when assessing the criteria. Reasonable people may disagree on the overall rating, but the goal is to make the justifications for the decision clear. We see this approach as a work in progress, and we expect we will continue to refine our process and criteria as we go.

High

In line with GRADE, a high rating is the default for an evidence base composed of randomised controlled trials (RCTs). A high rating meets the following criteria (these are discussed in detail in the Criteria section below):

Study design: The evidence base includes multiple high-quality RCTs.
Risk of bias: The majority of the RCTs show little risk of bias (RoB)³.
Imprecision: The RCTs have high statistical power to detect significant differences, with confidence intervals that are sufficiently narrow that the statistical imprecision has negligible impacts on decision making⁴. This typically means that the average study has large sample sizes.
Inconsistency: The estimated effects are broadly consistent across the RCTs, although there may be some minor variation.
Indirectness: The RCTs study the intervention directly as it is implemented in the real world or in highly relevant contexts.
Publication bias: The evidence base appears to have small or non-existent publication bias.

Moderate

The level of evidence is considered moderate if it deviates from high in several (e.g. three or more) of the following ways:

Study design: The evidence base consists of well-designed – but non-randomised – controlled trials or pre-post studies.
Risk of bias: The risk of bias and confounding factors⁵ may be moderate, but not high.
Imprecision: The studies have only moderate statistical power to detect significant differences, with confidence intervals that may introduce some uncertainty into decision making. This typically means having smaller sample sizes.
Inconsistency: The estimated effects are broadly consistent across the studies, although there may be some moderate variation.
Indirectness: The studies are only moderately relevant to the context in which the intervention is implemented.
Publication bias: The evidence base appears to have moderate publication bias.

Note: As recommended by GRADE, we assess the overall quality holistically. We take into account both the number and severity of deviations to determine the overall quality rating.

Low

In line with GRADE, observational studies start with a low rating, but RCTs can receive this rating if they fail on criteria more severely. The level of evidence is considered low if it deviates from high in several (e.g., three or more) of the following ways:

Study design: The evidence base consists of observational studies, such as cross-sectional, case-control, or cohort studies⁶.
Risk of bias: The risk of bias and confounding factors are high.
Imprecision: The studies have low statistical power to detect significant differences, with confidence intervals that introduce uncertainty into decision making. This typically means having smaller sample sizes.
Inconsistency: The estimated effects are inconsistent across studies.
Indirectness: The studies have very low relevance to the context in which the intervention is implemented.
Publication bias: The evidence base appears to have high publication bias.

Very low

This level of evidence represents findings from individual case studies, anecdotal reports, expert opinions, or narrative reviews. At this level, the evidence has not been rigorously assessed or controlled, and the results are highly prone to bias and confounding factors. The intervention’s effectiveness is uncertain, and the outcomes may not be reliable or generalisable to broader populations.

Criteria

Here we expand on the criteria we have adapted from GRADE to evaluate an evidence base. We describe how our criteria differ from GRADE in the Changes to Grade section below.

Study design

The study design is a fundamental element in assessing the quality of evidence, as it largely determines our ability to make conclusions about causality (e.g., A causes B). In general, we assume that:

RCTs are the gold standard for establishing causal effects.
Natural experiments⁷ can also provide strong evidence of causal effects, but the strength can vary depending on the circumstances⁸.
Non-randomised controlled trials or pre-post designs can typically only provide suggestive evidence of causal effects.
Observational studies provide very weak evidence of causal effects.

For an evidence base made up of RCTs, we expect the quality of evidence to be high by default. For an evidence base made up of observational studies, we expect the quality of evidence to be low by default.

Risk of bias (RoB)

Risk of bias (RoB) refers to limitations to the study design or implementation that might bias the estimated effects of individual studies⁹. The risk of bias is higher in RCTs where:

Participants are aware of the research question and the experimental conditions.
Researchers are not blind to the condition participants are assigned to, or they have the ability to influence outcomes.
There is sizeable attrition (i.e., participants dropping out over the course of the study)¹⁰.
There is sizable missing outcome data (i.e., missing data).
Some measures or outcomes are not reported.
Outcomes are measured with scales that are not valid or reliable¹¹.
A number of robustness checks are conducted and tend to show that the results hold up to reasonable alterations to the analysis (e.g., different modelling specifications).

In general we assume that studies are guilty of being subject to bias unless proven innocent. If studies don’t report how they dealt with methodological concerns, we assume risk of bias is present in that dimension.

Imprecision

Imprecision refers to how precisely effects are estimated (e.g., the width of the 95% confidence interval). In general, sample size is the primary factor influencing imprecision, and therefore the factor we focus on most¹². All else equal, larger sample sizes are more precise, and therefore provide stronger evidence of the estimated effect size. They also provide greater statistical power, which means it is easier to conclude the effect is not 0.

What is a sufficient sample size? The answer can differ depending on the topic, but generally an evidence base should have more than 1,000 data points (across all included studies) to provide strong levels of evidence¹³.

Inconsistency

Inconsistency refers to the variability of effects across studies. Consistent results suggest that the effect is replicable (e.g,. not a fluke finding) and robust (e.g,. it does not depend on specific circumstances). The quality of evidence is strengthened if multiple high-quality RCTs report similarly sized effects.

The exact number of studies needed depends on the overall quality of the evidence, but in many cases at least three studies are required. Ideally, we would prefer to have 10 RCTs or more, as this is roughly the number where it is possible to assess for publication bias or to perform moderator analyses to examine factors that might account for the variability. In practice, we often have fewer than this number, so assessing inconsistency is more tentative.

Sometimes results are inconsistent for explainable reasons, such as using different demographic groups, dosages, or follow-up timeframes. For example, an intervention might work better for males than females. Sometimes seemingly inconsistent results become consistent after controlling for these differences. Any remaining inconsistency is called unexplained heterogeneity: this is what we want to be small.

There are various statistical methods to assess heterogeneity, such as I² and τ² (Harrer, 2022). But each method has limitations, so we use these in combination with a subjective evaluation of how much the effects differ across studies based on the point estimates and overlap in confidence intervals.

Indirectness

Indirectness refers to the relevance of the evidence to the real world context. The strength of evidence is higher when the study context closely matches the context in which the charity operates. Ideally, the charity programme is studied directly as it is implemented in the real world. Unfortunately, this is rarely the case.

To assess indirectness, we explore if the characteristics that differ between the study and the charity context appear to predict differences in the effects. Characteristics that commonly differ include:

Population demographics: This includes age, gender, mental health diagnosis, etc.
Context: This includes the social and environmental context¹⁴.
Intervention characteristics: This includes the type of intervention delivered, the dosage¹⁵, and the quality of delivery¹⁶.
Outcomes: The outcomes we have evidence for may differ from the outcomes we view as most important. For example, we often have measures of mental health but prefer to have measures of life satisfaction or happiness. If this is the case, we try to explore whether the proxy outcomes we have evidence for tend to give smaller or larger effects than our preferred outcomes, and adjust our analysis accordingly.
Comparison group: The comparison group in the evidence reflects the typical standard of care where the intervention is implemented.

Publication bias

Publication bias is a systematic error in the publication of research findings that occurs when the outcome of a study influences whether or not it is published. In social science, one of the most common forms of publication bias is the tendency for studies with large, positive, or statistically significant results to be more likely to be published than those with small, negative, or non-significant results. As a result, the true effect is actually smaller than the effect estimated from the existing evidence.

We use several methods to assess whether publication bias is present in a body of evidence. When determining whether publication bias is small or non-existent, we use the following approaches:

A funnel plot showing no asymmetry in effect sizes
Related tests based on Egger’s regression with significant statistical evidence of small studies effects
A p-curve showing no left-skew and no hump in significance levels across studies around p = 0.05
Similar effect sizes from studies that are published, preregistered, and unpublished.

Intuition check

This is not a formal part of our criteria, but we do have greater confidence in evidence that fits with sensible expectations. For example, we would have relatively more confidence in evidence that shows a dose-response relationship or a decay over time when there’s strong reasons to expect it to do so. There are many ways to conduct intuition checks and these will often involve subjective judgements and intuitions about how the world works. If evidence does not pass our intuition checks, it typically leads us to double-check other criteria to ensure that all limitations in the evidence base have been accounted for.

Changes to GRADE

Although GRADE is widely used, it was originally developed for use in health science. We have made some minor changes to the GRADE framework in order to adapt it to the charity evaluation context:

GRADE domain	HLI criteria	Note on similarity
Study design	Study design	Very similar.
Risk of bias	Risk of bias	Very similar.
Imprecision	Imprecision	Very similar, but we use different criteria for decision thresholds¹⁷.
Inconsistency	Inconsistency	Very similar.
Indirectness	Indirectness	Very similar, but we use additional criteria to assess the relevance of sociocultural context.
Publication bias	Publication bias	Very similar.
–	Intuition check	GRADE describes studies behaving intuitively as a general reason to increase confidence, but it doesn’t have its own domain.
Factors that can increase the quality of the evidence	–	GRADE uses several criteria to increase the quality of evidence (e.g., large magnitude of effect). We are sceptical that these criteria can be readily applied to social science, so we don’t use them formally¹⁸.

Endnotes

1
See this article for a brief overview.
2
For example, it does not rely on providing numerical scores and then simply summing them up.
3
For example, participant demographics are similar in the treatment and control groups at the start of the study, there’s very little dropout of participants between the initial collection of evidence and any follow-up, and subjective wellbeing or affective mental health outcomes are mostly measured with the most valid and reliable scales.
4
For example, the 95% confidence interval does not cross 0.
5
A confounding variable is a third variable that is related to both the independent and dependent variables in a research study, making it appear as if there is a cause-and-effect relationship when, in fact, there isn’t. For example, in a study examining the impact of microfinance loans on poverty reduction, education could be a confounding variable if it independently influences both the likelihood of receiving a loan and income. Cofounding variables are primarily a concern in non-randomised studies. Failure to account for confounders can lead to biased or misleading results.
6
While these study designs can help identify associations or correlations, they are very limited for establishing causation.
7
We are intentionally avoiding the term ‘quasi-experimental’, which has different meanings in the fields of economics and psychology.
8
Factors influencing the strength of evidence include the relevance of the context, the clarity and precision with which exposure to treatment is measured, and the extent of confounding variables.
9
This is distinct from risk of bias for the set of studies included in the meta-analysis, which is captured in our “Publication bias” criterion.
10
Or, if there is attrition, then an intention to treat analysis is used. This involves analysing the data with all participants included, regardless of whether they completed the treatment.
11
Many outcomes we use are self-reported. Participants sometimes inflate their ratings to ‘help’ the experimenter (i.e., experimenter demand). Because of this, we prefer if the followup survey results are conducted by an independent surveyor that is clearly unrelated to the intervention.
12
Other factors can also affect imprecision, such as the reliability of measures and heterogeneity in the sample.
13
Statistical power is what ultimately matters. The smaller the expected effect size, the larger the sample size would need to be for there to be sufficient power to detect a statistically significant effect when there is one. However, power is a function of sample size and effect size, and we often aren’t sure what the exact effect size would be, leaving us unsure of the power as well. This is why we always prefer large samples.
14
For example, Miguel and Kremer (2004) conducted an RCT on the impact of deworming pills during an El Niño year, so the prevalence of intestinal worms was much higher than usual, which inflated the effect.
15
For example, the average cash transfer in our meta-analysis was $200 per household, but GiveDirectly cash transfers are $1,000 per household (McGuire, et al., 2020).
16
Charities operating at scale sometimes have lower delivery quality than interventions provided during RCTs.
17
For example, GRADE recommends using ‘clinical decision thresholds’ based on prevalence of side effects. While this is useful in medical research, the decision thresholds we use vary by context.
18
For example, large effects are very rare in social science, and often a sign of publication bias, so we are hesitant to increase our rating on this basis. The other criteria are dose-response gradient and the effect of plausible residual confounding.