Abstract:
Abstract
Title: Rethinking Levels of Evidence: Why the Evidence Pyramid Fails and What Should Replace It
Background: Levels-of-evidence pyramids are widely taught as hierarchies of “quality,” yet the metrics by which they are measured are rarely defined, and the research testing the accuracy of rankings is rarely discussed. Historically, these hierarchies were introduced as pragmatic tools to help guideline panels and busy clinicians prioritize evidence under time and resource constraints, not as universal truth meters. Over time, they have been reinterpreted as rigid rankings that place randomized trials and systematic reviews at the top and routinely devalue observational research, case series, and practice-based data that may be more directly relevant to clinical care.
Objective: To examine conceptual and empirical problems with conventional levels-of-evidence hierarchies and to propose a more defensible, user-focused framework for judging evidence in rehabilitation, human performance, and related fields.
Discussion: We review the historical development of evidence pyramids and show that major evidence-based medicine groups originally framed them as heuristic shortcuts for organizing recommendations, rather than as validated measures of accuracy. Meta-epidemiological comparisons between randomized and well-conducted observational studies indicate no consistent differences in effect-size accuracy when they address the same question in similar populations; differences in bias are better explained by specific features such as allocation concealment, blinding, outcome type, and handling of missing data than by design label alone. We argue that systematic reviews and meta-analyses are secondary analyses that reorganize primary data and add new opportunities for bias and error. Their explosive growth, methodological variability, redundancy, and frequent discordant conclusions contradict the notion that they should sit at the apex of an evidence hierarchy. We further contend that study design should align with the research question and ethical constraints, and that observational designs are often the only viable or optimal choice for causal, harm-related, and complex, multimodal questions. Finally, we distinguish between researchers (who may use design hierarchies as aspirational guides when planning new studies) and users of research (clinicians, educators, and reviewers), whose central task is to integrate all relevant, credible data rather than exclude observations that sit on “lower” tiers.
Conclusions: Design labels and traditional evidence pyramids are poor surrogates for accuracy. A more defensible framework for users distinguishes analytic comparative studies from uncontrolled designs, structured clinical data from unsystematic impressions, expert evidence-constrained opinion from non-expert claims, and primary studies from reviews. Within this framework, evidence is ordered conditionally by controls for bias, scale and replication, directness to the clinical question, and synthesis quality. Adopting this approach can reduce cherry-picking, restore appropriate weight to observational evidence, and improve decision-making in rehabilitation and human performance.
Why the Levels-of-Evidence Pyramids Fail and What Should Replace Them
Introduction: When a Heuristic Becomes Dogma
Ask ten licensed professionals to explain how the different “levels of evidence” are ranked, and you are likely to get ten different answers. One might say “risk of bias.” Another might say “internal validity.” A third might say “strength” or “rigor” without being able to specify how that rigor is quantified. The most common answer may be “quality,” which is a particularly subjective term. If you press a little harder and ask, “What statistic, metric, or objectively measurable quantity were you referring to—error rate, reproducibility, effect-size accuracy?” the conversation usually stalls. The pyramid is treated as self-explanatory, even when few individuals, if anyone, can clearly state what is being measured along its vertical axis.
Despite this ambiguity, most evidence-based practice courses introduce the same visual: a pyramid with expert opinion and case reports at the bottom, observational studies in the middle, randomized controlled trials (RCTs) above them, and systematic reviews or meta-analyses at the very top. Library guides and teaching materials describe this as a hierarchy of “strength” or “quality” of evidence, and some explicitly define it as a ranking of studies according to the probability of bias. For example, the Simmons University Nursing levels-of-evidence guide states, “Levels are ranked on risk of bias – level one being the least bias, level eight being the most biased” (1). The Concordia University Wisconsin Social Work evidence-based practice guide similarly notes, “Higher levels of evidence have less risk of bias” (2). The Oxford Centre for Evidence-Based Medicine (OCEBM), for example, presents levels of evidence that place systematic reviews of randomized trials at level 1 and expert opinion at level 5, and many derivative pyramids adopt the same basic ordering (3).
Originally, these hierarchies were introduced as pragmatic tools. Groups such as OCEBM developed levels-of-evidence tables to help guideline panels and journal editors prioritize studies when time and resources were limited (3). To our knowledge, they were never validated as instruments for measuring error rates across designs and were not intended to serve as universal truth meters. The problem is the shift from heuristic to dogma. Introductory courses on research often imply that study design categories are a direct proxy for “how true” a result is, so anything below a chosen level is dismissed as “low quality,” regardless of how the study was conducted or how much data it provides. In practice, this leads to rigid schemes in which large bodies of observational research are routinely down-ranked and ignored, while systematic reviews and meta-analyses are treated as the pinnacle of evidence, even though they do not generate new primary data and can amplify the biases of their inputs. The most problematic use of this logic is the dismissal of any study that is not a meta-analysis, combined with treating a meta-analysis that fails to refute the null as proof that the intervention does not work. (We discuss this fallacy further in “Meta-analysis Problems: Why do so many imply that nothing works? ”) In fields such as rehabilitation and disability, where blinding is often impossible, interventions are complex, and long-term practice-based outcomes may be more informative than short-term experimental trials, this structure tends to devalue some of the evidence that is most clinically relevant.
Quick Summary
- Where Levels-of-Evidence Hierarchies Came From: Levels-of-evidence hierarchies were created as pragmatic tools to help guideline panels and busy clinicians grade the strength of recommendations when under time and resource constraints, not as a calibrated ranking of the “truth.” Over time, they have been misused as universal evidence hierarchies, despite the existence of dozens of incompatible pyramids and the absence of an empirically validated ranking.
- What Levels-of-Evidence Hierarchies Claim to Measure But Don't: Hierarchies routinely invoke vague terms such as “quality,” “strength,” or “risk of bias” without defining them in measurable terms or mapping design labels to actual error rates. Meta-epidemiological studies demonstrate no significant difference in accuracy between observational studies and randomized trials, and differences in bias and overestimation are generally driven by specific methodological features (e.g., allocation concealment, blinding, outcome type).
- Systematic Reviews versus Original Research: Systematic reviews(SRs) and meta-analyses (MAs) are secondary analyses that synthesize and reorganize primary data. They do not generate new evidence, they generate a different type of data. If Netflix is a library of original works, then SRs are like "Rotten Tomatoes." Furthermore, the review adds layers of analysis that are susceptible to additional bias and error. Given the meta-epidemiological studies demonstrating explosive growth, methodological variability, redundancy, and frequent discordant conclusions, SRs and MAs should not be placed at the apex of an evidence hierarchy. Based on this research, SRs and MAs may be most deserving of "down-grading."
- Research Design Must Match the Research Question. The study design should be selected to align with the research question and minimize the risk of bias while maximizing accuracy, given the researchers' constraints. Research should not be designed to match an arbitrary position on a hierarchy. Observational designs are often the only viable or appropriate option for causal, harm, and complex multimodal questions. Discarding them because they sit “below” RCTs leads to incomplete, biased, and less accurate conclusions.
- Two Different Audiences, Two Different Interpretations: For researchers, levels of evidence can serve as aspirational design guidance when planning new studies and seeking funding. For readers and reviewers, however, the central task is to integrate all relevant, credible data. For users to ignore data because it is on a lower tier of evidence conflates design ideals with interpretation rules and encourages systematic cherry-picking.
- A More Defensible “Levels of Evidence” Framework: Empirical comparisons show no consistent accuracy gap between randomized and well-conducted observational studies, so the traditional level-2/level-3 split is not defensible as a reliability boundary. A more coherent framework distinguishes: (1) analytic studies with comparison groups versus uncontrolled designs, (2) structured clinical data versus unsystematic impressions, (3) expert versus non-expert opinion, and (4) primary studies versus reviews, and then orders evidence conditionally by controls for bias, scale, replication, directness, and synthesis quality.
New "Levels of Evidence"
- More Research: All else equal, multiple independent, well-controlled studies that trend in the same direction provide more reliable estimates than a single study.
- Original Research: Peer-reviewed original research generally deserves more weight than unsystematic clinical anecdotes.
- Systematic Review: Well-conducted systematic reviews and meta-analyses can provide an efficient overview of a topic, but they are secondary analyses that introduce the potential for additional bias and error. Furthermore, as discussed above, the production of redundant and incongruent systematic reviews should warrant skepticism regarding their reliability.
- Structured Data: Structured, routinely collected outcome data are more informative than unstructured impressions or clinical notes not recorded systematically.
- Research Supported Expert Opinion: Expert opinion constrained by a careful reading of the literature is more credible than lay opinion, but it remains subordinate to good data when the two conflict.
Rethinking Levels of Evidence: Why the Evidence Pyramid Fails and What Should Replace It
Where the Levels-of-Evidence Hierarchies (Pyramids) Came From
The first widely cited “levels of evidence” hierarchy likely emerged from the Canadian Task Force on the Periodic Health Examination in 1979. The Task Force was charged with organizing preventive care recommendations and needed a practical way to indicate the degree of confidence to place in different findings. Their report introduced three levels of evidence and five grades of recommendation strength, ranked largely by study design, to guide screening and preventive intervention recommendations (4). In the 1980s and 1990s, hierarchies similar to the Canadian Task Force recommendations were adopted both in early evidence-based medicine (EBM) texts and by guideline developers. For example, Sackett and colleagues’ article “Evidence-based medicine: what it is and what it isn’t” was written for practicing clinicians and emphasized the disciplined use of current best evidence, including distinctions among study types (5). At the same time, organizations such as the U.S. Preventive Services Task Force, the Australian National Health and Medical Research Council, and the Scottish Intercollegiate Guidelines Network used levels of evidence as tools to help panels grade the strength of recommendations. Later, groups such as the Oxford Centre for Evidence-Based Medicine (OCEBM) and the GRADE Working Group formalized and extended these ideas. OCEBM produced levels-of-evidence tables in the late 1990s to make “the process of finding appropriate evidence feasible and its results explicit,” and has revised them as EBM methods evolved (3). The GRADE group, starting in 2000, developed a separate framework for rating the certainty of evidence and the strength of recommendations for specific outcomes in guidelines, with a strong emphasis on transparency and usability for decision-making bodies (6). Taken together, this history demonstrates that hierarchies were originally conceived as pragmatic tools to aid in assessing the strength of evidence-based clinical recommendations, not as universal, empirically calibrated measures of “truth.”
Over time, the simple idea of ranking designs on a vertical scale proliferated into a large family of pyramids and level schemes. Traditional pyramids contrast expert opinion and case reports at the bottom with randomized trials and systematic reviews at the top. Variants such as the “6S” model add layers for synopses, summaries, and systems (7). Murad and colleagues proposed a “new evidence pyramid” that moves systematic reviews and meta-analyses off the apex and portrays them as an overlay on primary studies, while also adding qualitative and observational evidence in a more nuanced structure (8). Beyond these well-known examples, Blunt’s doctoral work has cataloged more than 80 distinct hierarchies for grading medical evidence, differing in the number of levels, the designs they include, and how they position systematic reviews, RCTs, and observational studies (9). The existence of dozens of non-identical hierarchies, all presented as “the” evidence ladder, is itself evidence that no single empirically validated ranking has been established.
The original intended use of these hierarchies was modest. Early levels-of-evidence tables and the OCEBM documents describe their system as “one approach” to organizing evidence for different question types and as a shortcut to make evidence-finding feasible and its results explicit for busy clinicians and guideline developers (3). Similarly, the GRADE Working Group defines GRADE as a transparent, structured system for rating the certainty of a body of evidence and the strength of recommendations in guidelines, applied to bodies of evidence after systematic review and risk-of-bias assessment (9). In practice, however, many educational materials and library guides present pyramids as general-purpose hierarchies of “strength of evidence,” encouraging readers to treat systematic reviews and meta-analyses of randomized trials as inherently superior and to dismiss observational studies, case series, and practice-based data as weak evidence. This shift from a heuristic for prioritizing evidence under time pressure to a gatekeeping device that determines which designs “count” is a misinterpretation of the intent of these hierarchies. It leads to fallacious interpretations of data and reflects a failure by many educators to help students develop a factual understanding of research and how to interpret findings.
What Levels-of-Evidence Hierarchies Claim to Measure But Don't
"Quality” is rarely defined.
Most levels-of-evidence hierarchies claim that higher levels represent “higher quality” or “stronger” evidence. However, “quality” is rarely defined in terms of measurable quantities such as error rates, reproducibility, or empirically estimated risk of bias. As mentioned above, the Simmons University Nursing levels-of-evidence guide explicitly states, “Levels are ranked on risk of bias – level one being the least bias, level eight being the most biased” (1). Also from above, the Concordia University Wisconsin Social Work evidence-based practice guide asserts that “higher levels of evidence have less risk of bias” (2). These statements imply a quantitative gradient but provide no empirical mapping of the study design to the magnitude of bias. Without an objectively measurable quantity, “quality” remains a subjective judgment. It is analogous to asserting that one car manufacturer produces “higher quality” vehicles than another without reporting breakdown rates, repair costs, or safety statistics. Preference is presented as fact, but the underlying quantity is undefined.
Design Does Not Equal Rigor
A central problem with design-based hierarchies is the assumption that study design is a direct proxy for methodological rigor. In reality, randomized trials vary widely in their execution, as do other experimental, observational, and case-series designs. Meta-epidemiological research has demonstrated that specific trial conduct characteristics, not the abstract label “randomized controlled trial,” are associated with differences in effect estimates.
Meta-epidemiological research has demonstrated that specific trial conduct characteristics are associated with differences in effect estimates, even when comparing studies label “randomized controlled trial (RCT).” Schulz et al. assessed the methodological quality of 250 controlled trials across 33 meta-analyses and found that inadequate or unclear allocation concealment was associated with substantially larger estimated treatment effects than in trials reporting adequate concealment (10). Wood et al., in a meta-epidemiological study of 1,346 trials across 146 meta-analyses, found that trials with inadequate or unclear allocation concealment and lack of blinding produced exaggerated estimates of benefit, particularly for subjectively assessed outcomes, while objective outcomes were less affected (11). Savović et. al. (2012) combined data from seven meta-epidemiological datasets (1,973 trials) and reached a similar conclusion: reported deficiencies in sequence generation, allocation concealment, and blinding were associated with larger apparent treatment effects and greater between-trial heterogeneity, again with the largest distortions for subjective outcomes (12). Page et al. synthesized 24 meta-epidemiological studies and confirmed that, on average, inadequate or unclear sequence generation and allocation concealment lead to modest but systematic exaggeration of intervention effects (ratio of odds ratios around 0.90–0.93, indicating larger apparent benefits in high-risk trials), with more pronounced bias for subjective outcomes (13). Another study by Savović et al. (2018), which links formal risk-of-bias assessments to trial results, has reinforced this pattern: differences in effect estimates are better explained by specific domains such as allocation concealment, blinding, and incomplete outcome data than by the generic label “randomized trial” (14).
In summary, these studies demonstrate that, when compared with one another, RCTs do not yield consistent effect estimates or low overestimation variance; rather, they are significantly influenced by allocation concealment methods, blinding, and the use of subjective or objective outcome measures. In other words, RCTs do not demonstrate sufficient reliability to suggest that this research design is less vulnerable to bias or methodological inconsistency. The same is likely true of observational and case-control studies; however, a comparison of the accuracy of different levels of the pyramid is addressed in the next section.
Levels do Not Predict Differences in Accuracy
Design-based hierarchies also imply a stronger claim: not only are randomized trials assumed to be more rigorous on average, but observational and other “lower-level” designs are assumed to be systematically less accurate, typically by overestimating treatment effects. Some authors have tried to justify the hierarchies by arguing that lower levels of evidence tend to overestimate treatment effects. This is a reasonable hypothesis a priori. With fewer design controls, we might expect more opportunities for bias that favor finding an effect, which, on average, would inflate treatment effects and increase the likelihood of crossing a conventional threshold for statistical significance. If this were true in a general, law-like way, we would observe a clear empirical gradient: for similar questions in similar populations, observational studies would consistently yield larger effect estimates than randomized trials, and case-series designs would exaggerate effects even further.
This claim is testable, and it has been tested. Benson et al. compared treatment-effect estimates from 136 reports (19 clinical topics) across observational studies and randomized controlled trials that evaluated the same interventions. They found that, in most cases, estimates from observational studies and randomized trials were similar, and only 2 of 19 comparisons produced observational estimates that lay outside the 95% confidence interval of the randomized-trial summary estimate (15). Concato et al. performed a similar comparison across five clinical questions, matching case–control and cohort studies to randomized trials on the same topic. They concluded that well-designed observational studies (with either a cohort or a case–control design) did not systematically overestimate the magnitude of treatment effects relative to randomized controlled trials addressing the same question (16).
More recent meta-epidemiological work has broadened these comparisons. Anglemyer and colleagues reviewed methodological studies that directly compared effect estimates from observational studies and randomized trials addressing the same questions. Across 14 reviews, 11 (79%) reported no significant difference between designs, and pooled ratios of effect estimates were close to 1 (for example, pooled ratio of ratios 1.08, 95% CI 0.96–1.22), with substantial heterogeneity across topics but no consistent pattern of large overestimation by observational studies (17). Toews et al. reported similar findings, again pooling ratios of ratios from methodological reviews and concluding that there was no difference or only very small average differences between effect estimates from randomized trials and observational studies (overall ratio of ratios 1.08, 95% CI 1.01–1.15), with modest deviations in some subgroups (for example, pharmaceutical interventions and meta-analyses with high heterogeneity). They emphasized that discrepancies were better explained by differences in populations, interventions, comparators, outcomes, and analytical choices than by study design label alone (18).
These studies demonstrate no significant difference in accuracy between observational studies and randomized trials. Other factors may significantly affect estimation accuracy, including the population studied, the interventions compared, and the outcome measures used. However, the levels-of-evidence hierarchies are stratified by study design rather than by these more influential factors.
Systematic Reviews versus Original Research
Levels-of-evidence pyramids typically place systematic reviews and meta-analyses at the apex, above RCTs and all other original research. These visual representations imply that systematic reviews (SRs) and meta-analyses (MAs) are "more accurate." However, this construction of the levels-of-evidence pyramid has three primary issues. SRs and MAs are reviews of data; they are not original data. For a simple analogy, they could be considered closer to the type of data on "Rotten Tomatoes" (reviews of original works), and fundamentally different from Netflix (original works). The second issue concerns the appropriateness of using SR processes, particularly MAs, to review a body of evidence. Many SRs and MAs include groups of studies that are not well suited to this type of analysis, or exclude certain types of research altogether (an overemphasis on RCTs). Often this is the result of poor sorting/grouping of heterogeneous studies, leading to comparisons of "apples to oranges." The third issue is the accuracy of published SRs and MAs. If SRs and meta-analyses were the most accurate representation of the truth, we would expect them to be rare, carefully curated syntheses of mature bodies of research, and that multiple reviews of the same content would yield congruent outcomes. Rosner, A. captured many of these ideas in his paper "Evidence-based medicine: Revisiting the pyramid." He notes that "the canonical pyramid of EBM excludes numerous sources of research information, such as basic research, epidemiology, and health services research... Compounding the issue is that poor systematic reviews, which comprise a significant portion of EBM, are prone to subjective bias in their inclusion criteria and methodological scoring, shown to skew outcomes (19)." SRs and MAs reorganize and transform data from primary studies; they do not automatically convert a heterogeneous, imperfect evidence base into a single, high-accuracy estimate. It should have been noted early in the development of these pyramids that the review of the original data by additional authors increases the layers of potential bias and error, rather than decreasing them.
The Scale of the Problem
If SRs and MAs were reliably the most accurate representation of the truth, we would expect them to be rare, carefully curated syntheses of the body of evidence, with consistent methodologies for reducing the risk of bias and error; however, this is not the case. Page et al. conducted a sampling analysis comparing 2014 to 2004 and estimated a 300% increase in SR publications during that period, an increase not reflected in the research literature. Furthermore, there was wide variability in core methodological features and reporting: 70% of SRs assessed risk of bias, but only 15% incorporated that risk into the analysis (20). Ou et al. reviewed 11 quantitative studies of overlapping systematic reviews on the same topic and reported that 68% of systematic reviews exhibited overlap, with up to 76 overlapping reviews on a single topic. Only 36% of overlapping reviews cited previous reviews, and only 9% reported protocol registration (21). Additionally, a well-known publication by Ioannidis ("The mass production of redundant, misleading, and conflicted systematic reviews and meta‐analyses") investigated the proliferation of PubMed-indexed articles between January 1, 1986, and December 4, 2015, and reported 266,782 items were tagged as “systematic reviews” and 58,611 were tagged as “meta-analyses.” Annual publications between 1991 and 2014 increased by 2,728% for SRs and 2,635% for MAs, compared with a 153% increase for all PubMed-indexed items over the same period. Further, for many topics, MAs were highly redundant, with some topics exceeding 20 MAs. In one field, antidepressants for depression, 185 MAs were published between 2007 and 2014. Additionally, many of these MAs were authored by industry employees or investigators with industry ties, potentially aligning them with sponsored interests and introducing a serious risk of bias (22). These studies demonstrate a proliferation of redundant SRs and MAs with some concerning methodological issues.
Additionally, if SRs and MAs were the most accurate sources of data, we would expect the findings of multiple MAs on the same topic to be consistent with one another and with the body of research. Unfortunately, this is not the case. Jadad et al. noted that as systematic reviews proliferated, it had become “common to find more than 1 systematic review addressing the same or a very similar therapeutic question” and that “conflicts among reviews are now emerging,” and they proposed a decision algorithm to help decision-makers select among discordant reviews (23). Lucenteforte et al. investigated interventions for myocardial infarction in Clinical Evidence and found that 36 of 153 systematic reviews (23.5%) were “multiple” reviews that formed 16 clusters addressing the same PICO question; among the 10 clusters with a shared composite outcome, complete agreement on statistically significant differences between interventions was present in only 7, and although reviews agreed on which treatment (or control) was superior in 14 of 16 clusters, they showed substantial variation in study and outcome selection, subgroup analyses, and reporting of major outcomes (24). Further, Ioannidis expounds on his analysis of the 185 MAs on antidepressants, noting that “Antidepressants offer a case study of the confusing effects of having redundant meta-analyses with different conclusions…paroxetine ranked anywhere from first to tenth best, and sertraline ranked anywhere from second to tenth best (22)." In summary, the proliferation of SRs and MAs is matched by an increase in conflicting conclusions, demonstrating a lack of accuracy.
In conclusion, modern SRs and MAs likely deserve less trust than original research, and Ioannidis may have been correct when he stated, "Currently, there is massive production of unnecessary, misleading, and conflicted systematic reviews and meta-analyses. Instead of promoting evidence-based medicine and health care, these instruments often serve mostly as easily produced publishable units or marketing tools (22)."
A Better Recommendation
Likely, the appropriate response to the proliferation and variability in the accuracy of SRs and MAs is not only to reconsider their relative accuracy and position in a ranking of evidence, but also to consider how they should be produced. A publication by Elliot et al. proposed a solution that the Brookbush Institute arrived at years later, following our concerted effort to improve the accuracy, consistency, and efficiency in producing systematic reviews and the courses that build on them. Elliott and colleagues proposed the idea of “living systematic reviews." That is, online, continuously updated evidence syntheses that are revised as new studies are published and as methods improve (25). In addition to continuous improvement, online publications could update systematic reviews in response to recommendations from the professional community. For example, a large systematic review may include a sub-section of studies that could be sorted by "experience" and compared to one another. A separate synthesis could be conducted by a community member and incorporated into the larger systematic review to provide additional nuance to the conclusions. The ability of professionals to continuously contribute to the same review should result in far less redundancy and an interactive increase in accuracy and refinement of conclusions over time.
In conclusion, evidence hierarchies should not regard SRs and MAs as more accurate than original research. SRs and MAs are tools for organizing and summarizing data. A well-conducted systematic review of robust trials and observational studies can provide a clearer picture of a topic than any single study, just as a carefully constructed Rotten Tomatoes aggregate score can give a better sense of the average quality of a movie than one review can. But a biased or incomplete subset of research, poorly analyzed data, and the use of tools that are inappropriate for the type of data can result in more "noise" and make the signal harder to detect. In short, levels-of-evidence hierarchies misclassify systematic reviews and meta-analyses by placing them at the apex of a design-based pyramid. Reviews and meta-analyses at the very least belong in a separate category and, in practice, based on the SRs and MAs published in the last 3 decades, may need to be considered less reliable than original research.
Research Design Must Match the Question
Evidence hierarchies have implied a false premise or, at least, led to a judgment that contradicts a basic principle of research. The study design should be selected to align with the research question and constraints; it should not be selected to satisfy a pre-specified rank on a hierarchy. Further, using evidence-level hierarchies to dismiss research (often to reduce the number of studies to be considered in a review) is a common and distasteful practice. This practice leads to published summary conclusions, reviews, and secondary sources that include only a fraction of the available evidence, resulting in less nuanced and often inaccurate information.
Observational Research is Not a Lesser Alternative
Observational research is not an alternative for researchers who are too lazy to perform RCTs. Observational research is the ideal methodology for many research questions, including those involving causal factors. RCTs are a powerful tool for some question types, but they are neither universally feasible nor universally the best methodology. Bosdriesz et al. stated that “the RCT is the best study design to evaluate the intended effect of an intervention, because the randomization procedure breaks the link between the allocation of the intervention and patient prognosis. If randomization of the intervention or exposure is not possible, one must rely on observational analytic studies, but these studies often suffer from bias and confounding. If the study focuses on unintended effects of interventions (i.e., effects of an intervention that are not intended or foreseen), observational analytic studies are the most suitable study designs, provided that there is no link between the allocation of the intervention and the unintended effect” (26). Further, Dekkers et al. developed guidance for performing SRs and MAs on observational etiologic studies and noted that, for many causal questions, observational evidence is the only realistic option and can be highly informative when appropriate confounding control and sensitivity analyses are used (27). For example, physical medicine often involves treatment plans that include multiple interventions, adapted over time to patient response, delivered in settings that are difficult to standardize. In this context, well-designed cohort studies, registries, interrupted time-series analyses, and single-case experimental designs can yield robust evidence linking the best outcomes to trends in intervention planning, which might be extremely difficult or cumbersome to study in conventional blinded RCTs. Additionally, Vandenbroucke notes an important function of observational research: “observational designs are indispensable for questions about environmental and lifestyle exposures, such as smoking or air pollution, where randomized assignment would be unethical or impossible” (28). This is an important point that should be taught in every research methods course. A researcher cannot knowingly expose an individual to a variable that has significant potential to cause harm. For research questions that aim to determine the association between a variable and the risk of harm, observational research is often the only viable methodology. This would include attempting to correlate changes in biomechanical alignment, muscle activity, or joint stiffness with pain, dysfunction, or injury, even though prior observational research has already suggested those associations. The importance of observational research for many research questions suggests that it should not be considered a lesser form of evidence; it is a different form of evidence.
Blinding and Control Groups Are Not Always Necessary
Additional factors used to “rank” RCTs in the levels-of-evidence hierarchy include blinding and control groups; however, there is no consideration of whether blinding or a control group is necessary, effective, or may introduce confounding variables. Blinding can reduce the risk of bias introduced by the intervention’s administrators; however, it is not always practical. Manual therapy, exercise programs, dry needling, and complex multimodal rehabilitation may make true blinding impossible, and the “sham conditions” may themselves introduce confounding variables. For example, sham mobilization or sham needling may influence outcomes through motion and other neurophysiological responses. Control groups may also be unnecessary, depending on the study’s intentions. Consider a research study that aims to determine which of two interventions is most effective, and includes two interventions for which prior research has already demonstrated statistically significant effects. If the intent is to determine the most effective intervention, a control group will not yield additional clinically relevant information. Although adding a control group may not affect the data, it is unnecessary. Note that blinding and control groups themselves are not the issue. The issue is dismissing potentially accurate and relevant data because a “levels of evidence” pyramid suggests that only RCTs with blinding and a control group are the “highest level of evidence.”
Clinicians and reviewers must work with the available evidence.
Perhaps the most concerning development resulting from the levels-of-evidence pyramids is the rationalization of a type of cherry-picking in which all forms of relevant information are not included in an attempt to develop the most accurate conclusions or understanding. Several organizations have noted this trend and proposed measures to reverse it. Johnston et al., in a publication noting the difficulty in synthesizing rehabilitation research, state that "Evidence synthesis in disability and rehabilitation can be improved by: explicating criteria for evaluating nonrandomized evidence, including the regression discontinuity, interrupted time series, and single-subject designs, as well as state-of-the-art methods of analysis of observational studies..." (29). This position is consistent with the broader GRADE framework, which allows well-conducted observational studies to be rated as higher-certainty evidence when randomized trials are infeasible, unethical, or grossly unrepresentative of clinical reality (6). Marko et al., in a publication on comparative effectiveness research (CER), also note the limitations of hierarchical models of evidence and favor the application of a strength-of-evidence model. In this model, observational research fills gaps in randomized clinical trial data and is particularly valuable for investigating effectiveness, harms, prognosis, and infrequent outcomes, as well as in circumstances where randomization is not possible (30). These publications imply that all relevant available evidence should be included to yield the most accurate conclusions.
Conclusion
If these points are considered together, they suggest that the primary question should be closer to, “Does this research study provide useful data?” It should not be, “Where does this research study fall on a generic hierarchy?” If research reflects what is occurring in the natural world, then studies that investigate the same intervention in similar populations, using different but appropriate designs, should, on average, converge on compatible answers. This is precisely what comparative meta-epidemiological work tends to show: across multiple clinical topics, well-conducted observational studies, cohort designs, and randomized trials often yield similar treatment-effect estimates when they address the same question in comparable populations. A case series, a cohort study, and a randomized trial of a 6-month progressive resistance-training program will all tend to show improvements in strength, endurance, and hypertrophy; the design shapes how precisely and how causally we can interpret those gains, but it does not place us in a different universe of “truth.” Design labels, by themselves, are a poor surrogate for accuracy. Put another way, individuals synthesizing data to develop conclusions about practice do not get to dictate to the body of research what they want to see; they must use all relevant data available. Anything short of this practice is cherry-picking and risks bias or, at the very least, leads to conclusions that omit nuance or detail that may be necessary for optimal outcomes.
Two Different Audiences, Two Different Interpretations
Levels-of-evidence hierarchies are presented as if all audiences should adopt them; however, there are at least two distinct audiences that should have different perspectives on research. There is a fundamental difference between researchers who design and conduct studies and those who read previously published research, including clinicians, educators, and researchers conducting reviews. This can be summarized as the difference between creating new data and synthesizing existing evidence. Researchers (Creators)When a researcher seeks to test a hypothesis with original research, certain design features are preferable. For many questions, it is reasonable to expect randomization and allocation concealment, the construction of matched comparison groups, the addition of a control group, blinding, and pre-registration. Each of these design choices can help reduce vulnerability to specific biases and errors in hypothesis testing. These design features are often explicitly or implicitly described in levels-of-evidence hierarchies, although much more work could be done to label, describe, and explain the effects of each feature on vulnerability and bias. For researchers, the levels-of-evidence hierarchy can be viewed as aspirational or forward-looking. The researcher can aim to include as many ideal design features as are permitted by the question and the resources available. Furthermore, levels-of-evidence pyramids can guide decisions on funding and publication when resources or reviewer capacity are limited. However, implicit in this decision-making must be the understanding that the levels of evidence are guidelines, not laws. As mentioned above, observational research can be more methodologically sound than experimental designs, and for some research, very few of the features mentioned above will be possible, despite providing answers to important questions—for example, retrospective studies that use existing patient data. Readers and Reviewers (Users)Clinicians, educators, and those conducting systematic reviews do not determine what evidence exists. Most fitness, human performance, and physical rehabilitation research includes a mixture of small RCTs, prospective observational studies, retrospective studies of existing data, and quasi-experimental designs. For these users of research, the central task should be integration, not design selection. As mentioned above, if research reflects what is occurring in the natural world, then studies that investigate the same intervention in similar populations, using different but appropriate designs, should, on average, converge on compatible answers. The goal of readers should be to develop a conclusion that accounts for as much of the data as possible, with the nuance and detail necessary to accurately reflect the available evidence. A rigid design-based pyramid is dangerous in this context because it encourages discarding “lower-level” research, which, in fitness, human performance, and physical rehabilitation, is critical to developing accurate and practical recommendations. This may include data on prevalence, long-term effects, rare adverse events, and effectiveness in real-world settings, all of which are likely to be derived from observational or quasi-experimental designs. The point is that users of research must develop the best possible picture from all relevant, credible data. Users do not get to dictate to the data what they want to see. This is a form of bias, especially when it involves ignoring large parts of the literature because the study labels do not sit at the top of a generic pyramid. The Root of MisuseToo many educators, texts, and courses imply that the levels-of-evidence hierarchies are for users and not for creators, without mentioning the potential pitfalls of dictating to existing data what is expected to be seen. Particularly egregious is when students are shown a pyramid that places systematic reviews at the top and case reports at the bottom, and then told to “start at the top of the pyramid” when searching for evidence. The implication is that anything below a chosen level can be safely ignored. This turns design ideals into interpretation rules. It suggests that because blinded RCTs are a worthy goal for the development of new efficacy studies, only RCTs and their syntheses deserve serious attention when making clinical decisions. It downplays the value of observational, quasi-experimental, and practice-based data, even in situations where RCTs are infeasible, unethical, or unrepresentative of real-world care. In conclusion, the conflation of creation and interpretation has turned a pragmatic heuristic into a dogmatic law that reduces accuracy in practice.
A More Defensible “Levels of Evidence” Framework
Source:
Studies comparing experimental (often labeled “level 2”) and observational (often labeled “level 3”) studies do not show consistent differences in the accuracy or consistency of treatment-effect estimates when they address the same question in similar populations. As summarized earlier, Benson et al., Concato et al., Anglemyer et al., and Toews et al. (15–18) report that well-conducted cohort and case-control studies tend to result in treatment-effect estimates similar to those from randomized trials on the same topic. This implies that the usual dividing line between experimental and observational studies does not correlate to a difference in reliability and accuracy.
A more defensible distinction might be between peer-reviewed analytic studies with a defined comparison group (whether randomized or observational) and designs that lack analysis and systematic comparison. Experimental and observational studies, in most hierarchies, share key characteristics: defined eligibility criteria, explicit definitions of exposure or intervention, prespecified outcomes, and the presence of a comparison group or reference condition. In contrast, case reports and case series lack a concurrent comparison group and often have flexible or implicit entry criteria and outcome definitions. They can detect signals, generate hypotheses, and document rare events, but they are not designed to estimate effect sizes with internal control.
Clinical practice data occupy an intermediate position. Structured, routinely collected outcomes, such as registry data, electronic health record extracts with predefined fields, or standardized follow-up measures, can provide large samples, long-term follow-up, and real-world context, even when they are not embedded in randomized designs. When definitions and measurement protocols are explicit, these data can support robust observational analyses and inform future experimental design. In contrast, unsystematic clinical impressions, undocumented outcomes, and informal “what I see in the clinic” narratives lack the transparency and reproducibility required to constitute data; they are closer to expert opinion than to evidence.
Expert opinion, when informed by careful review of the literature and explicit reasoning, can help interpret data, prioritize hypotheses, and guide practice in areas where research is sparse or absent. However, it should be considered with less weight than systematically produced or collected data. Expert opinion should not override consistent research findings without strong justification. Non-expert opinion, including marketing claims, casual anecdotes, and statements without engagement with the evidence, does not meet any reasonable threshold for evidence in an established scientific field and should not be included in an evidentiary hierarchy at all.
Systematic reviews and meta-analyses are a separate category. They do not generate original data; they reorganize and summarize results from the types of studies described above. When conducted carefully, they can provide an efficient means of forming an initial impression of a body of evidence, especially for readers who are not topic experts. At the same time, every review step, search strategy, inclusion and exclusion criteria, risk-of-bias judgments, data extraction, and analytic choices introduce another layer in which error and bias can be introduced. Reviews should therefore be treated as secondary analyses with their own methods and limitations, not as a higher “level” of evidence than the studies they synthesize.
New "Levels of Evidence" (for Users)
If these distinctions are taken seriously, very little of the traditional design pyramid remains defensible. The only “levels” that survive are broad and conditional:
- More Research: All else equal, multiple independent, well-controlled studies that trend in the same direction provide more reliable estimates than a single study.
- Original Research: Peer-reviewed original research generally deserves more weight than unsystematic clinical anecdotes.
- Systematic Review: Well-conducted systematic reviews and meta-analyses can provide an efficient overview of a topic, but they are secondary analyses that introduce the potential for additional bias and error. Furthermore, as discussed above, the production of redundant and incongruent systematic reviews should warrant skepticism regarding their reliability.
- Structured Data: Structured, routinely collected outcome data are more informative than unstructured impressions or clinical notes not recorded systematically.
- Research Supported Expert Opinion: Expert opinion constrained by a careful reading of the literature is more credible than lay opinion, but it remains subordinate to good data when the two conflict.
Each of these statements depends on the qualifiers “all else equal” and “when methods are sound.” None of them implies that a randomized label, a blinded design, or a systematic review badge automatically confers superiority. The goal is not to abolish all ordering, but to restrict it to distinctions that can be justified and measured.
Controls and Risk of Bias: How Vulnerable Is the Evidence?
Within peer-reviewed original research, the primary question should not be “What is the design label?” but “Which sources of bias were controlled, how much information is available, and how closely does it match the question at hand?” Design and analysis features matter more than generic categories. Key controls and the problems they address include:
- Comparison or control group: Helps distinguish intervention effects from natural history, regression to the mean, and secular trends.
- Randomization of intervention/exposure: Reduces confounding by balancing both known and unknown potential confounding factors across groups.
- Allocation concealment: Prevents foreknowledge of upcoming assignments and reduces selection bias at enrollment.
- Blinding of participants and treating personnel (when feasible): Reduces performance bias, expectation effects, and behavior changes driven by knowledge of group assignment.
- Blinding of outcome assessors: Reduces detection and measurement bias, especially for subjective or judgment-based outcomes.
- Use of objective outcomes and validated measurement tools: Decreases measurement error and misclassification and limits the influence of expectations on outcome scoring.
- Pre-registration and prespecified primary outcomes: Limits outcome switching, data dredging, and selective reporting driven by the results.
- Adequate sample size and appropriate power: Reduces random error and instability of effect estimates, making results less sensitive to chance imbalances.
- Appropriate handling of missing data and loss to follow-up reduces attrition bias and preserves comparability between groups (e.g., through intention-to-treat analysis).
- Explicit control for confounding in observational studies (design and analysis): Addresses systematic differences between exposed and unexposed groups through matching, restriction, stratification, or multivariable adjustment.
- Sufficient scale and replication across independent studies: Allows more precise estimates and evaluation of consistency in effect direction and approximate magnitude, provided that methods are not systematically biased in the same direction.
- Directness and applicability of population, intervention, comparator, and outcomes: Reduces threats to external validity by ensuring that the evidence addresses the actual clinical or practical question, rather than a narrow surrogate or atypical setting.
Meta-epidemiological studies show that failures in these controls, not merely the presence or absence of a “randomized trial” label, are associated with larger and more heterogeneous effect estimates, even within RCTs (10-14). In practice, a well-conducted cohort study with clear comparators, robust control of confounding, and objective outcomes may be more trustworthy than a small, unblinded RCT with poor allocation concealment and subjective endpoints. Design labels cannot substitute for an explicit assessment of which biases were controlled and which remain.
In summary, accuracy is a function of both control and the scale of information. Larger samples, more events, and multiple independent studies generally support more precise and robust conclusions, as long as they do not share the same systematic bias. At the same time, a tightly controlled RCT conducted in a narrowly selected population, with constrained protocols and surrogate outcomes, may be internally valid yet poorly representative of real-world practice. In rehabilitation, human performance, and disability, external validity is often where RCTs fall short and observational or registry data excel, because practice-based cohorts and quasi-experimental designs capture the actual combinations of interventions, comorbidities, and environments that clinicians face daily. An “ideal” trial that answers a different question is not superior evidence.
Evaluating Systematic Review
Systematic reviews and meta-analyses require their own set of criteria. Once we accept that they are not inherently “higher-level” evidence, the relevant questions become: How complete was the search? How transparent and appropriate were the inclusion and exclusion criteria? Was the risk of bias assessed and incorporated into the synthesis? How was heterogeneity handled? Were small-study effects, publication bias, and selective reporting considered? For meta-analyses, model choice (e.g., fixed vs. random effects), assumptions about between-study variance, and the conduct of sensitivity analyses all influence the estimates. A meta-analysis that pools incompatible studies, ignores the high risk of bias, or omits critical sensitivity analyses may degrade the signal rather than clarify it. The label “systematic review” should prompt scrutiny of methods, not automatic promotion to the top of a pyramid.
The idea of living systematic reviews offers one practical way to improve synthesis. Rather than producing a single publication that captures only a portion of the research available at the time of publication, living reviews maintain a continuously updated overview of a topic that can be improved iteratively. When implemented well, they can reduce redundancy, prevent the proliferation of conflicting reviews on the same question, and provide a clearer link between new primary studies and evolving conclusions. In practice, a living systematic review could also enable continuous editing and improvement by the authors, crowdsourcing of recommendations, and potentially additional analyses by specialists interested in the same topic.
Even so, living reviews remain tools. Their value depends on how comprehensively they search, how they sort and appraise studies across designs, and how honestly they represent uncertainty. They do not replace the need to understand the underlying data; they provide an organized interface for it.
Final Thoughts: Comparative Research
Finally, when the purpose of using research is to guide practical decisions, such as selecting among interventions in a constrained session or program, the most valuable evidence is comparative. Head-to-head trials and comparative observational studies that pit one plausible intervention against another yield relatively direct implications for the relative efficacy of interventions. They inform questions such as “Which of these options, for this population, is more likely to produce better outcomes per unit of time, cost, or burden?” in a way that single-arm trials and mechanism studies cannot. In this decision-theoretic context, a well-conducted comparative cohort study can be more informative for practice than a series of placebo-controlled RCTs that never directly compare the two (or more) interventions most likely to result in the best outcomes. Any implied hierarchy of evidence for users of research seeking to optimize recommendations for clients and patients should emphasize the importance of comparative research in refining practice.
Bibliography:
- Simmons University Library. (2025). Nursing – Evidence-Based Practice: Levels of Evidence. Simmons University. Retrieved December 3, 2025, from https://simmons.libguides.com/c.php?g=1033284&p=7490072
- Concordia University Wisconsin Library. (n.d.). Social Work Guide: Evidence-based practice – Types of studies. Concordia University Wisconsin. Retrieved December 3, 2025 - https://cuw.libguides.com/social_work_guide/ebp
- OCEBM Levels of Evidence Working Group. (2012). Oxford Centre for Evidence‐Based Medicine 2011 Levels of Evidence‐Traduction française.
- Hill, N., Frappier-Davignon, L., & Morrison, B. (1979). The periodic health examination. Can Med Assoc J, 121, 1193-1254.
- Sackett, D. L., Rosenberg, W. M., Gray, J. M., Haynes, R. B., & Richardson, W. S. (1996). Evidence based medicine: what it is and what it isn't. bmj, 312(7023), 71-72.
- GRADE Working Group. (2024). GRADE home. Retrieved December 3, 2025, from https://www.gradeworkinggroup.org
- DiCenso, A., Bayley, L., & Haynes, R. B. (2009). Accessing pre-appraised evidence: fine-tuning the 5S model into a 6S model. Evidence-based nursing, 12(4), 99-101.
- Murad, M. H., Asi, N., Alsawas, M., & Alahdab, F. (2016). New evidence pyramid. Evidence-Based Medicine, 21(4), 125–127. https://doi.org/10.1136/ebmed-2016-110401
- Blunt, C. J. (2015). Hierarchies of evidence in evidence-based medicine (Doctoral dissertation). London School of Economics and Political Science.
- Schulz, K. F., Chalmers, I., Hayes, R. J., & Altman, D. G. (1995). Empirical evidence of bias: dimensions of methodological quality associated with estimates of treatment effects in controlled trials. Jama, 273(5), 408-412.
- Wood, L., Egger, M., Gluud, L. L., Schulz, K. F., Jüni, P., Altman, D. G., ... & Sterne, J. A. (2008). Empirical evidence of bias in treatment effect estimates in controlled trials with different interventions and outcomes: meta-epidemiological study. bmj, 336(7644), 601-605.
- Savović, J., Jones, H. E., Altman, D. G., Harris, R. J., Jüni, P., Pildal, J., ... & Sterne, J. A. C. (2012). Influence of reported study design characteristics on intervention effect estimates from randomized, controlled trials: Combined analysis of meta-epidemiological studies. Annals of Internal Medicine, 157(6), 429–438.
- Savović, J., Turner, R. M., Mawdsley, D., Jones, H. E., Beynon, R., Higgins, J. P. T., & Sterne, J. A. C. (2018). Association between risk-of-bias assessments and results of randomized trials in Cochrane reviews: The ROBES meta-epidemiologic study. American Journal of Epidemiology, 187(5), 1113–1122.
- Page, M. J., Higgins, J. P., Clayton, G., Sterne, J. A., Hróbjartsson, A., & Savović, J. (2016). Empirical evidence of study design biases in randomized trials: systematic review of meta-epidemiological studies. PloS one, 11(7), e0159267.
- Benson, K., & Hartz, A. J. (2000). A comparison of observational studies and randomized, controlled trials. New England Journal of Medicine, 342(25), 1878–1886.
- Concato, J., Shah, N., & Horwitz, R. I. (2000). Randomized, controlled trials, observational studies, and the hierarchy of research designs. New England Journal of Medicine, 342(25), 1887–1892.
- Anglemyer, A., Horvath, H. T., & Bero, L. (2014). Healthcare outcomes assessed with observational study designs compared with those assessed in randomized trials. Cochrane database of systematic reviews, (4).
- Toews, I., Anglemyer, A., Nyirenda, J. L., Alsaid, D., Balduzzi, S., Grummich, K., ... & Bero, L. (2024). Healthcare outcomes assessed with observational study designs compared with those assessed in randomized trials: a meta‐epidemiological study. Cochrane database of systematic reviews, (1).
- Rosner, A. L. (2012). Evidence-based medicine: revisiting the pyramid of priorities. Journal of Bodywork and Movement Therapies, 16(1), 42-49.
- Page, M. J., Shamseer, L., Altman, D. G., Tetzlaff, J., Sampson, M., Tricco, A. C., ... & Moher, D. (2016). Epidemiology and reporting characteristics of systematic reviews of biomedical research: a cross-sectional study. PLoS medicine, 13(5), e1002028.
- Ou, S., Luo, J., & Jiang, Q. (2025). Overlapping Systematic Reviews on the Same Topic: A Systematic Literature Review of Quantitative Research. Journal of Evaluation in Clinical Practice, 31(4), e70148.
- Ioannidis, J. P. (2016). The mass production of redundant, misleading, and conflicted systematic reviews and meta‐analyses. The Milbank Quarterly, 94(3), 485-514.
- Jadad, A. R., Cook, D. J., & Browman, G. P. (1997). A guide to interpreting discordant systematic reviews. Cmaj, 156(10), 1411-1416.
- Lucenteforte, E., Moja, L., Pecoraro, V., Conti, A. A., Conti, A., Crudeli, E., ... & Virgili, G. (2015). Discordances originated by multiple meta-analyses on interventions for myocardial infarction: a systematic review. Journal of Clinical Epidemiology, 68(3), 246-256.
- Elliott, J. H., Synnot, A., Turner, T., Simmonds, M., Akl, E. A., McDonald, S., ... & Pearson, L. (2017). Living systematic review: 1. Introduction—the why, what, when, and how. Journal of clinical epidemiology, 91, 23-30.
- Bosdriesz, J. R., Stel, V. S., van Diepen, M., Meuleman, Y., Dekker, F. W., Zoccali, C., & Jager, K. J. (2020). Evidence‐based medicine—when observational studies are better than randomized controlled trials. Nephrology, 25(10), 737-743.
- Vandenbroucke, J. P. (2004). When are observational studies as credible as randomised trials?. The Lancet, 363(9422), 1728-1731.
- Dekkers, O. M., Vandenbroucke, J. P., Cevallos, M., Renehan, A. G., Altman, D. G., & Egger, M. (2019). COSMOS-E: guidance on conducting systematic reviews and meta-analyses of observational studies of etiology. PLoS medicine, 16(2), e1002742.
- Johnston, M. V., & Dijkers, M. P. (2012). Toward improved evidence standards and methods for rehabilitation: recommendations and challenges. Archives of Physical Medicine and Rehabilitation, 93(8), S185-S199.
- Marko, N. F., & Weil, R. J. (2010). The role of observational investigations in comparative effectiveness research. Value in Health, 13(8), 989-997.



