# data wrangling, numerical computing, visualization
import pandas as pd
import numpy as np
import seaborn as sns
# statistical models
import statsmodels
import statsmodels.api as sm
# existing tools to check our implementation against
from statsmodels.stats.multitest import fdrcorrection
from statsmodels.stats.multitest import multipletests
# random number generation
from numpy.random import default_rng
In this post we explain and re-implement two multiple testing corrections, Bonferroni and Benjamini-Hochberg, for the sake of understanding.
Preliminaries
We prepare the libraries we’ll need.
Simulating a motivating example
The motivation behind multiple testing correction is that if you test enough hypotheses, you will eventually find a result just by chance. The relevance of this concern increases with the number of tests - some experiments test dozens, hundreds of hypotheses.
We will develop a motivating example with which to work. Initially, we will simulate p values from populations with the same parameters so we know the null hypothesis of a t test is true: they’re not different.
We can set this up nicely in Python using a function utilizing numpy and a list comprehension:
# seed the results for reproducibility
= default_rng(seed = 1)
rng
# write a simulation function
def simulation(mu1, sd1, mu2, sd2, draws = 100, n = 1000):
# generate different populations
= rng.normal(loc = mu1, scale = sd1, size = draws)
pop1 = rng.normal(loc = mu2, scale = sd2, size = draws)
pop2 # this returns three things, we only need the middle one
= sm.stats.ttest_ind(pop1, pop2)
tstat, pval, degf
return pval
This function generations two populations. We can make them the same by passing the same mean and standard deviation. Now we can test it:
# set parameters for our populations.
= 5
mu1 = 2.5
sd1
= 5
mu2 = 2.5
sd2
# get a single p value
simulation(mu1, sd1, mu2, sd2)
0.9993893617043748
Recall that we’ve “simulated under the null” - we’ve specified that the parameters are the same for both distributions. Thus, we assume any positives are false. We can now repeat this a bunch of times to get a distribution of p values. We’re using an \(\alpha\) (alpha) of 0.5 (more on this later), so we expect around 5% to be false positives.
# set a seed
= default_rng(seed = 1)
rng
# run a thousand simulations
= [simulation(mu1, sd1, mu2, sd2) for i in range(0,1001)]
pvals
# get the ones below alpha
= [p for p in pvals if p <= 0.05]
sig_vals
# calculate the percent that are "significant"
= (len(sig_vals)/len(pvals))*100
sig_percent
# percent significant hits
round(sig_percent, 4)
6.2937
We can see that we have a small but non-trivial amount of hits flagged as significant. If we think about the definition of a p value though, this makes sense.
p value: The probability of seeing evidence against the null that is this extreme or more given the null is true [@ 1] - given our modeling assumptions are met.
If we’re operating at \(\alpha = 0.05\), we’ve accepting that we will be wrong about 5% of the time. In experiments where we’re testing a lot of hypotheses though, this adds up to a lot of mistakes.
For example, consider a differential expression experiment involving 20,000 genes:
# 5 percent wrong in this case...
0.05 * 20000
1000.0
…we’re talking about being wrong a thousand times. This is why multiple testing correction exists. We will consider two types with one example each.
Controlling Family-wise Error Rate (FWER) with the Bonferroni method
Bonferroni is perhaps the most common method for controlling the Family-wise error rate, which is the probability of making at least one false finding [2]. In this method we simply establish a new threshold for significance by dividing \(\alpha\) by the number of tests.
# a function to apply the correction
def calc_bonferroni_threshold(vector, alpha = 0.5):
# divide alpha by the number of tests, ie, pvalues in the vector
return alpha/len(vector)
# compute a new threshold
= calc_bonferroni_threshold(pvals)
bonferroni_threshold
# new threshold
bonferroni_threshold
0.0004995004995004995
# see which ones are significant now
= [p for p in pvals if p <= bonferroni_threshold]
sig_adj_pvals
# calculate the percent
= len(sig_adj_pvals)/len(pvals)
sig_percent_adj
# inspect
round(sig_percent_adj, 4)
0.0
Let’s see if we also get 0 with Python’s version. This doesn’t return the threshold, it returns a list of corrected pvalues and a list of True/False values telling us to reject not. If our code agrees with this, we should see a vector with noTrue
s in it.
# call the function
= multipletests(pvals, method="bonferroni", alpha = 0.05)
check
# the 0th index contains the booleans
= [i for i in check[0] if i == True]
check_trues
len(check_trues)/len(pvals)
0.0
We’ve taken care of the false-positives. As you may have intuited, this is a pretty strict correction (dividing by 20k, in our other example); it does come at the cost of statistical power (the ability of a test to find something if it’s there). There is no free lunch, especially in statistical power. Sometimes a more permissive tradeoff can be made.
False Discovery Rate (FDR)
While Bonferroni is tried and true, it’s something of a blunt instrument however, and there are some situations where another common method Benjamini-Hochberg [3], often abbreviated “BH”, is preferable. The Benjamini-Hochberg controls the False-discovery rate [3], which is the percent of the time we’re making a false positive. This is useful in situations where the strictness of Bonferroni might be limiting (more on comparing the two later). The method involves:
- Ranking the p values from lowest to highest.
- Computing a critical value for each one.
- Comparing the p value to the critical value.
- The highest p value that is lower that the corresponding critical is the new threshold - it and all smaller than it are considered significant.
The formula for the critical value is:
\[ c = (\frac{r}{n})\alpha \]
Note you will often see \(\alpha\) denoted \(q\) - I’m just trying to keep things simple by calling it what it is (not exactly a time-honored tradition in statistics).
This is relatively straightforward to implement in Python (there is a built in function in R as well as a function in statsmodels
for Python, but we will reimplement it for the sake of learning and transparency). We define the function below:
# function to compute the new BH threshold
def calc_bh_threshold(vector, alpha = 0.05):
# the number of hypotheses
= len(vector)
m # the is just naming convention
= alpha
q
# sort the pvalues
= sorted(vector)
sorted_pvals
# collect a critical value for each p value
= []
critical_vals for i in range(0, len(sorted_pvals)):
= i+1 # the rank is the index +1 as Python is zero-indexed
rank # the formula for
= (rank/m)*q
critical_val
critical_vals.append(critical_val)
# organize our results
= pd.DataFrame({
df_res "pvalue": sorted_pvals,
"critical_value": critical_vals
})# get the values where the pvalue is lower than the critical value
= df_res.query("pvalue <= critical_value").\
df_res_p_less_crit = "pvalue", ascending = False)
sort_values(by
# if none meet the criteria return 0 so no comparison to it will be significant
if len(df_res_p_less_crit) == 0:
return 0
else:
# the largest of these is the new threshold
return df_res_p_less_crit["pvalue"].max()
= calc_bh_threshold(pvals)
bh_threshold bh_threshold
0
Now we can compare apply this method to correcting for multiple testing on our p values (some of which we know are false positives).
# organize the p values
= pd.DataFrame({"pval": pvals})
df_res
# apply a test - are they lower than the new threshold?
"bg_reject"] = df_res.pval.apply(lambda x: x <= bh_threshold)
df_res[
# find where we reject the null (false positive)
= df_res.query("bg_reject == True")
sig_adj_bh
# get the new percent
len(sig_adj_bh)/len(pvals)
0.0
We see we no longer get significant p values. Just to make sure we’re correct, we can simulate where the mean actually is different and see if our implementation agrees with the official one.
# here, the means are different
= 5
mu1 = 1
sd1
= 5.5
mu2 = 1
sd2
# create a another simulated vector of p values
= [simulation(mu1, sd1, mu2, sd2) for i in range(0,1000)]
pvals
# organize it
= pd.DataFrame({"pval": pvals})
df_check
# put some ids to identify them by for better comparison
"sample_id"] = ["sample" + str(i+1) for i in range(0, len(pvals))]
df_check[
# get a new threshold
= calc_bh_threshold(df_check["pval"])
bh_threshold_check bh_threshold_check
0.04645320875671015
We’ve got our new threshold, now we can compare the p values to it.
# compare the values in the pval function to the new criteria and see how many we get
"bh_reject_custom"] = df_check.pval.apply(lambda x: x <= bh_threshold_check)
df_check[
# a count of reject true/false
df_check.bh_reject_custom.value_counts()
True 931
False 69
Name: bh_reject_custom, dtype: int64
Now we compare that to the statsmodels implementation:
# apply the statsmodels function to the p values
"bh_reject_default"] = fdrcorrection(df_check["pval"])[0]
df_check[
# a table of reject true/false
df_check.bh_reject_default.value_counts()
True 931
False 69
Name: bh_reject_default, dtype: int64
Let’s asses the agreement by making a column based on their comparison:
# do they agree?
"agreement"] = \
df_check["bh_reject_custom"] == df_check["bh_reject_default"]
df_check[
# see where they disagree
for i in df_check["agreement"] if i == False] [i
[]
No disagreement. Just to be absolutely safe, we check the IDs of the rejected p values.
# subset where the custom rejects
= df_check.query("bh_reject_custom == True")
df_check_custom_agree
# a subset where the default rejects
= df_check.query("bh_reject_default == True")
df_check_default_agree
# compare the ids
= df_check_custom_agree.sample_id == \
agreement
df_check_default_agree.sample_id
# observe the concordance
agreement.value_counts()
True 931
Name: sample_id, dtype: int64
When to use these methods?
This will depend on the particulars of the experiment in question and must be evaluated on a case-by-case basis. In rough terms however, it depends on if you need to be absolutely sure about a particular result vs if you’re willing to get a few false positives. For example, if you’re conducting a follow-up study to confirm a result before committing a large amount of time and money to a line of research, Bonferroni makes more sense - making even a single error is costly, and this is what the FWER tells us (the probability of even a single mistake). If you’re doing a preliminary experiment to nominate targets for further study, then a few false positives might be fine. You might nominate 10-15 targets to be followed up on in the lab, and if only a few pan out, that’s ok, they’re the new focus. This is common if differential expression experiments [1].