Multiple testing correction for family-wise error-rate and false discovery rate – Thadryan J. Sweeney, MS

In this post we explain and re-implement two multiple testing corrections, Bonferroni and Benjamini-Hochberg, for the sake of understanding.

Preliminaries

We prepare the libraries we’ll need.

# data wrangling, numerical computing, visualization
import pandas as pd
import numpy as np
import seaborn as sns

# statistical models
import statsmodels
import statsmodels.api as sm

# existing tools to check our implementation against
from statsmodels.stats.multitest import fdrcorrection
from statsmodels.stats.multitest import multipletests   

# random number generation
from numpy.random import default_rng

Simulating a motivating example

The motivation behind multiple testing correction is that if you test enough hypotheses, you will eventually find a result just by chance. The relevance of this concern increases with the number of tests - some experiments test dozens, hundreds of hypotheses.

We will develop a motivating example with which to work. Initially, we will simulate p values from populations with the same parameters so we know the null hypothesis of a t test is true: they’re not different.

We can set this up nicely in Python using a function utilizing numpy and a list comprehension:

# seed the results for reproducibility 
rng = default_rng(seed = 1)

# write a simulation function
def simulation(mu1, sd1, mu2, sd2, draws = 100, n = 1000):
    # generate different populations 
    pop1 = rng.normal(loc = mu1, scale = sd1, size = draws)
    pop2 = rng.normal(loc = mu2, scale = sd2, size = draws)
    # this returns three things, we only need the middle one
    tstat, pval, degf = sm.stats.ttest_ind(pop1, pop2)

    return pval

This function generations two populations. We can make them the same by passing the same mean and standard deviation. Now we can test it:

# set parameters for our populations.
mu1 = 5
sd1 = 2.5

mu2 = 5
sd2 = 2.5

# get a single p value
simulation(mu1, sd1, mu2, sd2)

0.9993893617043748

Recall that we’ve “simulated under the null” - we’ve specified that the parameters are the same for both distributions. Thus, we assume any positives are false. We can now repeat this a bunch of times to get a distribution of p values. We’re using an \(\alpha\) (alpha) of 0.5 (more on this later), so we expect around 5% to be false positives.

# set a seed
rng = default_rng(seed = 1)

# run a thousand simulations
pvals = [simulation(mu1, sd1, mu2, sd2) for i in range(0,1001)]

# get the ones below alpha
sig_vals = [p for p in pvals if p <= 0.05]

# calculate the percent that are "significant"
sig_percent = (len(sig_vals)/len(pvals))*100

# percent significant hits
round(sig_percent, 4)

6.2937

We can see that we have a small but non-trivial amount of hits flagged as significant. If we think about the definition of a p value though, this makes sense.

p value: The probability of seeing evidence against the null that is this extreme or more given the null is true [@ 1] - given our modeling assumptions are met.

If we’re operating at \(\alpha = 0.05\), we’ve accepting that we will be wrong about 5% of the time. In experiments where we’re testing a lot of hypotheses though, this adds up to a lot of mistakes.

For example, consider a differential expression experiment involving 20,000 genes:

# 5 percent wrong in this case...
0.05 * 20000

1000.0

…we’re talking about being wrong a thousand times. This is why multiple testing correction exists. We will consider two types with one example each.

Controlling Family-wise Error Rate (FWER) with the Bonferroni method

Bonferroni is perhaps the most common method for controlling the Family-wise error rate, which is the probability of making at least one false finding [2]. In this method we simply establish a new threshold for significance by dividing \(\alpha\) by the number of tests.

# a function to apply the correction
def calc_bonferroni_threshold(vector, alpha = 0.5):
    # divide alpha by the number of tests, ie, pvalues in the vector
    return alpha/len(vector)

# compute a new threshold
bonferroni_threshold = calc_bonferroni_threshold(pvals)

# new threshold
bonferroni_threshold

0.0004995004995004995

# see which ones are significant now
sig_adj_pvals = [p for p in pvals if p <= bonferroni_threshold]

# calculate the percent
sig_percent_adj = len(sig_adj_pvals)/len(pvals)

# inspect 
round(sig_percent_adj, 4)

0.0

Let’s see if we also get 0 with Python’s version. This doesn’t return the threshold, it returns a list of corrected pvalues and a list of True/False values telling us to reject not. If our code agrees with this, we should see a vector with noTrues in it.

# call the function
check = multipletests(pvals, method="bonferroni", alpha = 0.05)

# the 0th index contains the booleans
check_trues = [i for i in check[0] if i == True]

len(check_trues)/len(pvals)

0.0

We’ve taken care of the false-positives. As you may have intuited, this is a pretty strict correction (dividing by 20k, in our other example); it does come at the cost of statistical power (the ability of a test to find something if it’s there). There is no free lunch, especially in statistical power. Sometimes a more permissive tradeoff can be made.

False Discovery Rate (FDR)

While Bonferroni is tried and true, it’s something of a blunt instrument however, and there are some situations where another common method Benjamini-Hochberg [3], often abbreviated “BH”, is preferable. The Benjamini-Hochberg controls the False-discovery rate [3], which is the percent of the time we’re making a false positive. This is useful in situations where the strictness of Bonferroni might be limiting (more on comparing the two later). The method involves:

Ranking the p values from lowest to highest.
Computing a critical value for each one.
Comparing the p value to the critical value.
The highest p value that is lower that the corresponding critical is the new threshold - it and all smaller than it are considered significant.

The formula for the critical value is:

\[ c = (\frac{r}{n})\alpha \]

Note you will often see \(\alpha\) denoted \(q\) - I’m just trying to keep things simple by calling it what it is (not exactly a time-honored tradition in statistics).

This is relatively straightforward to implement in Python (there is a built in function in R as well as a function in statsmodels for Python, but we will reimplement it for the sake of learning and transparency). We define the function below:

# function to compute the new BH threshold
def calc_bh_threshold(vector, alpha = 0.05):

    # the number of hypotheses
    m = len(vector)
    # the is just naming convention
    q = alpha

    # sort the pvalues
    sorted_pvals = sorted(vector)

    # collect a critical value for each p value
    critical_vals = []
    for i in range(0, len(sorted_pvals)):
        rank = i+1 # the rank is the index +1 as Python is zero-indexed
        # the formula for 
        critical_val = (rank/m)*q
        critical_vals.append(critical_val)
    
    # organize our results
    df_res = pd.DataFrame({
        "pvalue": sorted_pvals,
        "critical_value": critical_vals
    })
    # get the values where the pvalue is lower than the critical value
    df_res_p_less_crit = df_res.query("pvalue <= critical_value").\
        sort_values(by = "pvalue", ascending = False)
    
    # if none meet the criteria return 0 so no comparison to it will be significant
    if len(df_res_p_less_crit) == 0:
        return 0
    else:
        # the largest of these is the new threshold
        return df_res_p_less_crit["pvalue"].max()

bh_threshold = calc_bh_threshold(pvals)
bh_threshold

Now we can compare apply this method to correcting for multiple testing on our p values (some of which we know are false positives).

# organize the p values
df_res = pd.DataFrame({"pval": pvals})

# apply a test - are they lower than the new threshold?
df_res["bg_reject"] = df_res.pval.apply(lambda x: x <= bh_threshold)

# find where we reject the null (false positive)
sig_adj_bh = df_res.query("bg_reject == True")

# get the new percent
len(sig_adj_bh)/len(pvals)

0.0

We see we no longer get significant p values. Just to make sure we’re correct, we can simulate where the mean actually is different and see if our implementation agrees with the official one.

# here, the means are different
mu1 = 5
sd1 = 1

mu2 = 5.5
sd2 = 1

# create a another simulated vector of p values
pvals = [simulation(mu1, sd1, mu2, sd2) for i in range(0,1000)]

# organize it
df_check = pd.DataFrame({"pval": pvals})

# put some ids to identify them by for better comparison
df_check["sample_id"] = ["sample" + str(i+1) for i in range(0, len(pvals))]

# get a new threshold
bh_threshold_check = calc_bh_threshold(df_check["pval"])
bh_threshold_check

0.04645320875671015

We’ve got our new threshold, now we can compare the p values to it.

# compare the values in the pval function to the new criteria and see how many we get
df_check["bh_reject_custom"] = df_check.pval.apply(lambda x: x <= bh_threshold_check)

# a count of reject true/false
df_check.bh_reject_custom.value_counts()

True     931
False     69
Name: bh_reject_custom, dtype: int64

Now we compare that to the statsmodels implementation:

# apply the statsmodels function to the p values
df_check["bh_reject_default"] = fdrcorrection(df_check["pval"])[0]

# a table of reject true/false
df_check.bh_reject_default.value_counts()

True     931
False     69
Name: bh_reject_default, dtype: int64

Let’s asses the agreement by making a column based on their comparison:

# do they agree?
df_check["agreement"] = \
    df_check["bh_reject_custom"] == df_check["bh_reject_default"]

# see where they disagree
[i for i in df_check["agreement"] if i == False]

[]

No disagreement. Just to be absolutely safe, we check the IDs of the rejected p values.

# subset where the custom rejects
df_check_custom_agree = df_check.query("bh_reject_custom == True")

# a subset where the default rejects
df_check_default_agree = df_check.query("bh_reject_default == True")

# compare the ids
agreement = df_check_custom_agree.sample_id == \
  df_check_default_agree.sample_id

# observe the concordance
agreement.value_counts()

True    931
Name: sample_id, dtype: int64

When to use these methods?

This will depend on the particulars of the experiment in question and must be evaluated on a case-by-case basis. In rough terms however, it depends on if you need to be absolutely sure about a particular result vs if you’re willing to get a few false positives. For example, if you’re conducting a follow-up study to confirm a result before committing a large amount of time and money to a line of research, Bonferroni makes more sense - making even a single error is costly, and this is what the FWER tells us (the probability of even a single mistake). If you’re doing a preliminary experiment to nominate targets for further study, then a few false positives might be fine. You might nominate 10-15 targets to be followed up on in the lab, and if only a few pan out, that’s ok, they’re the new focus. This is common if differential expression experiments [1].

References

McDonald JH. Hypothesis testing - Handbook of Biological Statistics. https://www.biostathandbook.com/hypothesistesting.html (accessed 16 Dec 2022).

McDonald JH. Multiple comparisons - Handbook of Biological Statistics. http://www.biostathandbook.com/multiplecomparisons.html (accessed 16 Dec 2022).

Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B (Methodological) 1995;57:289–300.https://www.jstor.org/stable/2346101