A Case study in Silent Data Corruption in an RNA-Seq Experiment

How a subtle bug and misleading error message can transform your RNA-Seq data.
R
Bioinformatics
Statistics
Differential Expression
Tools
Author

Thadryan

Published

April 27, 2020

During a recent differential gene expression analysis I had a few issues converge in such a way that the code would run fine from top to bottom but would silently compromise the analysis and produce bogus results. Essentially it was a combination of a bugged error message in a package I was using, a bad row in an input file I was given, and some weird behavior of R (and of course, my initial carelessness in not noticing sooner). I caught this issues with some included QC functions.

It seemed plausible that these issues could join forces to trip people up now and then, so I figured I would document it in case people wanted to keep an eye out for it.

If you’re not interested in this sort of analysis but use R, there is still a short takeaway summarized here:

(x <- c("1", "10", "100"))
[1] "1"   "10"  "100"
(x <- as.factor(x))
[1] 1   10  100
Levels: 1 10 100
(x <- as.integer(x))
[1] 1 2 3

Whoops! Those numbers have been totally changed.

The repo for the analysis here, and the PDF is below:

NOTE: It appears more recent versions of DESeq2 don’t do this!

NOTE: Edited for clarity, 2022-12-13