Posted on April 27, 2018
This is the twenty-eighth blog in a series of 36 blogs based on a list of ‘Key Concepts’. Each blog will explain one Key Concept that we need to understand to be able to assess treatment claims.
Surely you have come across the innocent-looking ‘p’ while reviewing literature or perhaps conducting your own research. In my experience, most people are familiar with p-values but few can explain what they mean. Because of this confusion, as well as rampant misuse, ‘p’ has become controversial and has even been banned from some scientific journals entirely . In this post, we will discuss why using p-values to report results can be problematic and explore alternatives better suited to convey information about confidence in our study findings.
When we describe our research findings, the most important number we report is our point estimate (e.g. a difference in risk between two groups). This value is our ‘best guess’ of the true difference. However, because we conduct studies using a sample of the larger population of interest, our point estimate for our sample inevitably differs from the unknowable true difference between the groups. That is, the true difference may be larger or smaller than our estimate. This can either be due to bias (discussed elsewhere ), random error based on our sample selection (‘chance’), or a combination of both.
In research, we use statistical tests to obtain information on how likely it is the difference we observed is simply due to chance. Typically, we calculate a p-value. ‘P’ stands for probability and refers to the probability of observing differences that are as large as we observed in our study or more extreme assuming when, in fact, there is no true difference (i.e. assuming the null hypothesis is true).
In our prevention trial, we test the difference in risk for developing disease X between two groups (one intervention and one placebo group) in a randomized controlled trial and find the risk in the intervention group is lower than the risk in the placebo group. Our statistical test of the difference in risk yields a small p-value of p=0.001.
This means it is quite unlikely this same difference would have been observed if there was no true difference between the comparison groups (the null hypothesis). In other words, we can be quite confident that this difference in risk is real and that our treatment lowers the risk for disease X. But in truth, we will never know for sure. Even with a small p-value, there remains a possibility that we incorrectly reject the null hypothesis when it is actually true (a ‘false positive’). This is what we call type I error.
How much type I error is acceptable? This is where the concept of statistical hypothesis testing comes into play. In a hypothesis test, the p-value is tested against a pre-specified cut-off (‘significance level’) that specifies how much type I error we are willing to tolerate (often 0.05 or 5%). The concept is quite simple: if p<0.05, the results are deemed “statistically significant”. If not, then the results are “not statistically significant”. But why exactly 0.05? This threshold is entirely arbitrary. As Rosendaal puts it, “There is no logic to it. There is no mathematics or biology that supports a cut-off value of 5% . Unfortunately, in modern-day research, there is a great deal of pressure to obtain “statistically significant” results of hypothesis tests based on this arbitrary cut-off.
Notice that up until this point, nothing has been said about the actual point estimate/effect size for our example trial! This is exactly the problem with p-values and significance testing; we have put emphasis on the fact that we are relatively certain our result is not due to chance alone, but we actually have no idea if the result is in any way useful or clinically relevant!
A “statistically significant” result does not necessarily indicate an important result. Even a trivially small effect (with no clinical relevance) may be deemed “significant” by virtue of a small p-value. This is not uncommon in large trials or trials testing many hypotheses (1/20 will be significant just by chance at the 0.05 significance level).
At the other end of the spectrum, it is also possible to have a large point estimate of an effect with a non-significant p-value (e.g. p=0.10). This is especially true with small sample sizes or with a large study testing small stratified subgroups. Unfortunately, non-significant p-values are often confused with “no effect” and potentially meaningful results of underpowered studies are simply discarded.
By presenting our results with only p-values and/or making a statement about “statistical significance”, we are omitting the most important information: our point estimate. By now you are surely thinking, wouldn’t it be nice to have an alternative method to report uncertainty in the context of the actual effect size and direction? Fortunately, we have another option!
If we want to convey the uncertainty about our point estimate, we are much better served using a confidence interval (CI). A CI is a symmetrical range of values within which values of repeated similar experiments are likely to lie. Our point estimate lies at the center of this range. The width of CIs represent the margin of error and are calculated using the spread of our data, sample size, and a sampling distribution, which are also used to calculate p-values. The important distinction is that the CI provides more context than a p-value because it includes the direction of the effect (e.g. whether a treatment increases or decreases risk of death) and is reported in the same units as the point estimate, while also indicating the uncertainty in our estimation .
The level (90%, 95%, 99%, etc.) of confidence chosen for the CI is entirely arbitrary. 95% is conventionally used in medical research since this number corresponds to our familiar significance level of 0.05. What does this percentage mean? A common misinterpretation is that the true value lies in this range 95% of the time. Instead, a 95% CI means if you performed the same experiment over and over with different samples of the population of interest, the true value would lie within the CI in 95% of those trials (assuming all assumptions needed to correctly compute the CI hold) .
The width of the CI indicates the precision of our point estimate. For example, a point estimate of 5.5 difference may have a 95% CI of 3.5 to 7.5 (width of 4 units). A narrower interval spanning a range of two units (e.g. 95% CI, 4.5 to 6.5) indicates a more precise estimate of the same effect size than a wider CI with the same effect size (e.g. 95% CI, 3.5 to 7.5).
For example, let’s suppose a particular treatment reduced risk of death compared to placebo with an odds ratio of 0.5, and a 95% CI of 0.2 to 0.8. This means that, in our sample, the treatment reduced risk of death by 50% compared to placebo, and that the true reduction in risk is somewhere between 20% and 80%.
It is important to note that a confidence interval is not a uniform distribution of probability and the values closest to the point estimate are more likely to be true than the values on the outer ends of the interval.
For those who insist on statistical hypothesis testing, confidence intervals even provide you with that information. If your CI does not contain the null hypothesis value (e.g. for a risk difference: null hypothesis =0, for a relative risk: null hypothesis=1), then your result is “statistically significant” (at the significance level corresponding to the CI). If the null hypothesis value does lie within the interval, the result is “not statistically significant”, but it is important to remember that this dichotomous thinking can be problematic for the reasons mentioned earlier.
A great additional resource you can look at is an animated slide presentation, prepared by Steven Woloshin, which shows how the Cochrane logo was developed, and what it tells us.
In summary, p-values can be very misleading, especially when they are presented in the context of statistical hypothesis testing without corresponding point estimates and confidence intervals. Their use detracts from potentially interesting results that do not meet the significance threshold due to factors such as few outcome events. Failing to publish results because they are “not statistically significant” (which is not the same as finding “no association”) leads to damaging publication bias.
Instead of relying on uninformative p-values, I encourage you to report results using point estimates and their more informative confidence intervals and to be skeptical of research findings and claims that do not provide this information.