Globally, the concept of level of evidence represents the level of trust you can put in the conclusions of the study. On that matter, it’s better safe than sorry. At Bia, with the intent of analyzing our content in the best possible way, we designed an algorithm to determine the level of evidence of the studies based on several existing tools.
If you read our last article "How to Pick the Good Scientific Literature Critics?", you now know how to choose an article according to those previous criteria. You can now sit back, relax, and start reading. Those results are looking good, aren’t they? But can you really trust them? Here’s when the concept of level of evidence comes into play. Globally, this represents the level of trust you can put in the conclusions of the study. On that matter, it’s better to be safe than sorry.
At Bia, with the intent of analyzing our content in the best possible way, we designed an algorithm to determine the level of evidence of the studies based on several existing tools. This process was not an easy one since an impressing number of tools exist to assess the level of evidence, and they are not all built the same. We tried to keep the most rigor for the less trouble, and chose two methods that we used combined or separately. Let us show you.
The Evidence Based Practice Tool of Winona State University allows an interesting global classification based on the type of paper encountered. It is, however, a little too simplistic and we wanted less than seven levels for our scale to make it easier for our readers. Also, the scale asks for “well-designed” studies, which must be interpreted with caution.
LoE | Description |
I | Evidence from a systematic review or meta-analysis of all relevant RCTs or evidence-based clinical practice guidelines based on systematic reviews of RCTs or three or more RCTs of good quality that have similar results |
II | Evidence obtained from at least one RCT well-designed |
III | Evidence obtained from well-designed controlled trials without randomization (i.e. quasi-experimental) |
IV | Evidence from well-designed case-control or cohort studies |
V | Evidence from systematic reviews of descriptive and qualitative studies (meta-synthesis) |
VI | Evidence from a single descriptive or qualitative study |
VII | Evidence from the opinion of authorities and/or reports of experts committees |
‘Well-designed’ sounds a little unprecise, doesn’t it? How do you judge of the design quality of a scientific paper? This is where we needed the second tool, specifically elaborated for this use: the GRADE system. You probably heard about it before since it is a very popular tool in Quebec universities. Globally, it looks like this:
Randomised controlled trial (RCT) | High (++++) | Risk of bias -1 serious -2 very serious |
Large effect size +1 large +2 very large |
Moderate (+++) | Inconsistency -1 serious -2 very serious |
Dose response +1 evidence of a gradient |
|
Observational study | Low (++) | Indirectness -1 serious -2 very serious |
Counfounding component +1 identified |
Very low (+) | Imprecision -1 serious -2 very serious |
||
Publication bias -1 likely -2 very likely |
Let’s start by defining a little bit of vocabulary to facilitate the use of this system. There is two entry ways to this algorithm: RCTs get in at the superior level, which means we assume they are supposed to have a high level of evidence because of their methodology. Observational or descriptive studies enter at a lower level, according to their non-randomised participants, the absence of control group and the fact they sometimes are retrospective.
This is how the GRADE system defines its level of evidence:
High (++++): high quality, high confidence that the true effect is similar to the estimated effect
Moderate (+++): moderate quality, moderate confidence that the true effect is similar to the estimated effect. The real effect is probably close to the estimate, but it might be different. New research could have an important impact and could modify the estimated effect.
Low (++): low quality, low confidence in the estimated effect. Real effect could be very different from the estimate. New research will very likely have an important impact and could modify the estimated effect.
Very Low (+): very low quality, very low confidence in the estimate. Real effect is probably substantially different than the estimated effect. The estimated effect is uncertain.
Studies then undergo evaluation according to several criteria as shown in the table, which could modify their final attributed level of evidence. Briefly, the following will be evaluated:
All these items will lead to giving a final grade to the study. When both of those tools are put together, it looks like this:
Description | |
Evidence from a systematic review or meta-analysis of all relevant RCTs or evidence-based clinical practice guidelines based on systematic reviews of RCTs or three or more RCTs of good quality that have similar results | High (++++) |
Evidence obtained from at least one RCT well-designed | High (++++) |
Evidence obtained from well-designed controlled trials without randomization (i.e. quasi-experimental) | Moderate (+++) |
Evidence from well-designed case-control or cohort studies | Low (++) |
Evidence from systematic reviews of descriptive and qualitative studies (meta-synthesis) | Low (++) |
Evidence from a single descriptive or qualitative study | Very Low (+) |
Evidence from the opinion of authorities and/or reports of experts committees | Very Low (+) |
The GRADE system will be used when justifying the methodological quality of a study is needed in order to grant it a level of evidence (or when there’s a yellow band in the box!). Some papers won’t need the use of the GRADE system to be categorised.
KEEP IN MIND (because we love to make you work on your clinical reasoning too)
Even if this system looks like an easy algorithm with yes/no questions, it’s not. Grading level of evidence requires judgment, and consistency in ratings cannot be ensured. The GRADE system comes with a 9+ parts user manual and a disclaimer saying “empirical evidence supporting criteria is limited - attempts to show systematic difference between studies meeting or not the criteria have shown inconsistent results” and that it is “emphasizing simplicity and parsimony over completeness”. This being said, even if we put great efforts in being as rigorous as possible in the rating of the studies we use in our content, it does not alleviate you from the responsibility of being critical about information you read and how you integrate these notions into your daily practice!