Facebook Pixel
Levels Of Evidence - Glossary Term Illustration

Levels Of Evidence

Levels of evidence are proposed hierarchies of research types that intend to rank the strength and reliability of research findings based on study design. However, the idea that research can be ranked based on study design alone is controversial.

Levels Of Evidence

Levels of evidence are proposed hierarchies of research types that intend to rank the strength and reliability of research findings based on study design. However, the idea that research can be ranked based on study design alone is controversial. In evidence-based practice , levels of evidence hierarchies were developed to help clinicians and researchers quickly assess the relative confidence they can place in study results when making clinical decisions. However, these levels should be viewed as guidelines rather than absolutes, as they do not replace the need for critical appraisal of study quality, methodology, or relevance to the patient population in question.

Related Term:

Common Levels of Evidence Hierarchies

Several organizations and methodologists have proposed evidence hierarchies to help clinicians and researchers estimate the relative trustworthiness of research designs. These hierarchies typically rank study types based on their ability to minimize bias and control confounding variables.

Among the most frequently cited frameworks are:

  • Oxford Centre for Evidence-Based Medicine (OCEBM)
  • GRADE (Grading of Recommendations, Assessment, Development, and Evaluations)
  • U.S. Preventive Services Task Force
  • Shekelle et al. (1999) Levels of Evidence

Shekelle et al. (1999) Hierarchy (often cited):

  • IA – Evidence from meta-analysis of randomized controlled trials
  • IB – Evidence from at least one randomized controlled trial
  • IIA – Evidence from at least one controlled study without randomization
  • IIB – Evidence from at least one other type of quasi-experimental study
  • III – Evidence from non-experimental descriptive studies, such as comparative studies, correlation studies, or case-control studies
  • IV – Evidence from expert committee reports, opinions, or respected clinical experience

(Shekelle, P. G., Woolf, S. H., Eccles, M., & Grimshaw, J., 1999)

Frequently Asked Questions (FAQ)

What is level II evidence?

  • Although many different hierarchies exist, this typically refers to randomized controlled trials (RCTs).

What is level III evidence?

  • Although many different hierarchies exist, this typically refers to observational research.

What is level IV, V, VI, and VII research?

  • Because the various levels of evidence hierarchies are not identical and become increasingly divergent with each level, it is not possible to say for sure without knowing which hierarchy is being referred to.

What is the purpose of levels of evidence?

  • They help clinicians estimate the relative confidence in study findings by summarizing how well the research design controls bias and error.

Are randomized controlled trials always the best evidence?

  • Not necessarily. While RCTs provide strong internal validity, they may be poorly executed or not appropriate for some research questions, such as long-term harms or complex interventions.

How should levels of evidence be applied in practice?

  • As a guideline, clinicians must still appraise the study’s actual methods, relevance, consistency of findings, and the presence of biases.

What is a better way to think about evidence quality?

  • Think of it as a combination of study design, study execution, replication, and how well results match the clinical question, not design alone.

Issues with the Levels of Evidence Hierarchies

In theory, levels of evidence provide a shortcut for determining the "quality" of evidence. However, if five licensed professionals were asked what "quality" refers to, they are unlikely to be able to answer with a value that can be objectively measured. The term "quality" is subjective unless explicitly defined. For example, someone may feel that a Chevrolet is higher quality than a Toyota, but without objective comparison metrics (e.g., mechanical failure rates), this remains an opinion. "Quality" must be linked to an objective measure (e.g., error rate, reproducibility, or risk of bias). However, this point may be why "levels of evidence hierarchies" are flawed beyond repair.

The only way to quantify bias or error would be through studies comparing study designs that assess the actual accuracy of conclusions —a process known in meta-science as evidence synthesis or meta-research, which is still in development. This "meta-research" issue becomes more complex when hierarchies attempt to compare study types directly. If level-2 research is “better” than level-3, then by how much? Does one level-2 study outweigh five level-3 studies? Ten? These seemingly simple questions reveal the weak assumptions underlying most evidence hierarchies.

It is worth noting that evidence hierarchies, such as those from the Oxford Centre for Evidence-Based Medicine or GRADE, are heuristic tools, not absolute or universally valid ranking systems. Many methodologists share the views above, which are reflected in several common criticisms of these hierarchies:

Traditional evidence hierarchies have been criticized for oversimplifying complex methodological issues. These hierarchies often:

  • They overlook the quality of studies within each level.
  • They may overvalue randomized controlled trials (RCTs), even when RCTs are inappropriate for the research question due to risk, the complexity of the issue being observed, or the type of information that is required to answer the research question.
  • They often lead to the dismissal of large portions of the available research - a type of unintentional cherry-picking that can hinder the development of more accurate conclusions.
  • Many hierarchies have placed meta-analysis (MA) at the top of the pyramid, overlooking the fact that this is a method of review (not original data) that results in a completely different type of data and introduces the potential for more bias and error. This has potentially led to the proliferation of poorly designed MAs. For more on this topic, check out "The Mass Production of Redundant, Misleading, and Conflicted Systematic Reviews and Meta‐analyses"

Brookbush Institute's Contribution: A Better Level of Evidence Hierarchy

This section is from the article: Is There a Single Best Approach to Physical Rehabilitation?

Importantly, study design is not synonymous with methodological rigor. A randomized controlled trial may be well-designed or poorly executed. The same applies to observational or cohort studies. Additionally, different study designs are suited to different questions. RCTs may be optimal for testing the efficacy of acute interventions, but they are not well-suited for modeling longitudinal outcomes, cost-effectiveness, or rare adverse events. No single design universally outperforms others in all contexts.

Controls such as peer review, statistical analysis, blinding, and independent replication reduce error and bias (e.g., selection bias, availability heuristic, personal allegiance). However, since these traits apply across multiple study designs, we cannot rank study designs without additional context. It is also reasonable, when comparing peer-reviewed publications, to consider that data derived from a group of individuals is likely more reliable than data from a single case. Published case studies are likely more reliable than objective clinical measures, followed by expert opinion; however, non-expert opinion in an established scientific field is generally considered irrelevant.

The simplified hierarchy below does not imply that all studies are equally valuable or that expert opinion should be disregarded. Rather, it acknowledges that data derived from larger, more controlled, and replicated studies tend to be more reliable than single cases or anecdotes. This more pragmatic and logically defensible hierarchy is proposed based on the number of controls present and the scale of evidence, rather than generalizations about study design types.

A More Defensible Hierarchy

  1. More research studies are generally better than one study.
  2. Research is better than a single case.
  3. Objective outcome measures in clinical practice are more reliable than expert opinion.
  4. Expert opinion is generally considered superior to non-expert opinion.
  5. Non-expert opinion should not guide clinical decision-making.

The Brookbush Institute takes the concept of levels of evidence further by systematically reviewing all available peer-reviewed, published research on a given topic, rather than restricting conclusions to arbitrary evidence hierarchies. Conclusions emerge directly from available data, supported by a vote-counting rubric that captures trends across studies, combined with systematic tracking of patient outcomes and Bayesian updating principles to refine conclusions over time. This approach helps avoid biases introduced by ignoring lower-level evidence, increases precision in estimating the effectiveness of interventions, and ultimately improves clinical decision-making.

Comparison Rubric

(The goal is to use available research to determine the most likely trend.)

  • A is better than B in all studies → Choose A
  • A is better than B in most studies, and additional studies show similar results between A and B → Choose A
  • A is better than B in some studies, and most studies show similar results between A and B → Choose A (with reservations)
  • Some studies show A is better, some show similar results, and some show B is better → Results are likely similar (unless there is a clear moderator variable such as age, sex, or injury status that explains the divergence)
  • A and B show similar results in the gross majority of studies → Results are likely similar.
  • Some studies favor A, others favor B → Unless the number of studies overwhelmingly supports one side, results are likely similar.

Bibliography

  • Phillips, B., Ball, C., Sackett, D., Badenoch, D., Straus, S., Haynes, B., ... & Howick, J. (2009). Oxford centre for evidence-based medicine-levels of evidence (March 2009).
  • Guyatt, G. H., Oxman, A. D., Vist, G. E., Kunz, R., Falck-Ytter, Y., Alonso-Coello, P., & Schünemann, H. J. (2008). GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. Bmj, 336(7650), 924-926.
  • U.S. Preventive Services Task Force. (2008). Grade definitions. Agency for Healthcare Research and Quality. Retrieved from https://www.uspreventiveservicestaskforce.org/uspstf/grade-definitions
  • Shekelle, P. G., Woolf, S. H., Eccles, M., & Grimshaw, J. (1999). Developing clinical guidelines. Western Journal of Medicine, 170(6), 348.
  • Brookbush Institute. (2024, September 29). Is there a single best approach to physical rehabilitation? Brookbush Institute. Retrieved June 29, 2025, from https://brookbushinstitute.com/articles/there-is-one-best-approach-in-physical-rehabilitation

Discussion

Comments

Guest