Jul 25, 2017

RNA-seq Power calculation (FAQ)

I spend a lot of time answering questions from researchers working with genomic data. If I put a lot of effort into an answer, I try to keep it in my file of 'Frequently Asked Questions' - even though this stuff does change fairly rapidly. Last week I got the common question: "How do I calculate Power for an RNA-seq experiment?  So here is my FAQ answer. I have summarized from the work of many wise statisticians, with great reliance on the RnaSeqSampleSize R package by Shilin Zhao and the nice people at Vanderbilt who build a Shiny website interface to it.


>> I’m considering including an RNA-Seq experiment in a grant proposal. Do you have any advice on how to calculate power for human specimens? I’m proposing to take FACS sorted lymphocytes from disease patients and two control groups. I believe other people analyze 10-20 individuals per group for similar types of experiments.
>> It would be great if you have language that I can use in the grant proposal to justify the cohort size. Also, we can use that number to calculate the budget for your services. Thanks!
>> Ken

Hi Ken,

Power calculations require that you make some assumptions about the experiment.  Ideally, you have done some sort of pilot experiment first, so you have an estimate of the total number of expressed genes (RPKM>1), fold change, variability between samples within each treatment, and how many genes are going to be differentially expressed.   The variability of your samples is probably the single most important issue - humans tend to vary a lot in gene expression, cultured cell lines not so much. You can reduce variability somewhat by choosing a uniform patient group - age, gender, body mass index, ethnicity, diet, current and previous drug use, etc.

Have a look at this web page for an example of an RNA-seq  power calculator.

I plugged in the following data:   FDR=0.05, ratio of reads between groups=1, total number of relevant genes 10,000 (ie. you will remove about half of all genes due to low overall expression prior to differential expression testing).  Expected number of DE genes=500, fold change for DE genes=2, read count (RPKM) for DE genes =10, dispersion (Standard Dev) 0.5.  With these somewhat reasonable values, you get sample size of 45.   So, to get a smaller sample size, you can play with all of the parameters. 

The estimated Sample Size:
"We are planning a RNA sequencing experiment to identify differential gene expression between two groups. Prior data indicates that the minimum average read counts among the prognostic genes in the control group is 10, the maximum dispersion is 0.5, and the ratio of the geometric mean of normalization factors is 1. Suppose that the total number of genes for testing is 10000 and the top 500 genes are prognostic. If the desired minimum fold change is 2, we will need to study 45 subjects in each group to be able to reject the null hypothesis that the population means of the two groups are equal with probability (power) 0.9 using exact test. The FDR associated with this test of this null hypothesis is 0.05."

To improve power (other than larger samples size or less variability among your patients), you can sequence deeper (which allows a more accurate and presumably less variable measure of expression for each gene), only look at the most highly expressed genes, or only look at genes that have large fold change. Again, it helps to have prior data to estimate these things.

When I do an actual RNA-seq data analysis, we can improve on the 'expected power' by cheating a bit on the estimate of variance (dispersion). We calculate a single variance estimate for ALL genes, then modify this variance for each individual gene (sort of a Bayesian approach). This allows for a lower variance than would happen if you just calculate StdDev for each gene in each treatment.  This rests on an assumption that MOST genes are not differentially expressed in your experiment, and the variance of all genes across all samples is a valid estimate of background genetic variance.

1 comment:

Anuj Agarwal said...

Hi Next-Gen Sequencing Team,

My name is Anuj Agarwal. I'm Founder of Feedspot.

I would like to personally congratulate you as your blog Next-Gen Sequencing has been selected by our panelist as one of the Top 75 Bioinformatics Blogs on the web.


I personally give you a high-five and want to thank you for your contribution to this world. This is the most comprehensive list of Top 75 Bioinformatics Blogs on the internet and I’m honored to have you as part of this!

Also, you have the honor of displaying the badge on your blog.