----------
>> I’m considering including an RNA-Seq experiment
in a grant proposal. Do you have any advice on how to calculate power for human
specimens? I’m proposing to take FACS sorted lymphocytes from disease patients and two control groups. I believe other people
analyze 10-20 individuals per group for similar types of experiments.
>>
>> It would be great if you have language that I
can use in the grant proposal to justify the cohort size. Also, we can use that
number to calculate the budget for your services. Thanks!
>>
>> Ken
Hi Ken,
Power calculations require that you make some assumptions
about the experiment. Ideally, you have
done some sort of pilot experiment first, so you have an estimate of the total
number of expressed genes (RPKM>1), fold change, variability between samples
within each treatment, and how many genes are going to be differentially
expressed. The variability of your
samples is probably the single most important issue - humans tend to vary a lot
in gene expression, cultured cell lines not so much. You can reduce variability
somewhat by choosing a uniform patient group - age, gender, body mass index,
ethnicity, diet, current and previous drug use, etc.
Have a look at this web page for an example of an
RNA-seq power calculator.
I plugged in the following data: FDR=0.05, ratio of reads between groups=1,
total number of relevant genes 10,000 (ie. you will remove about half of all
genes due to low overall expression prior to differential expression
testing). Expected number of DE
genes=500, fold change for DE genes=2, read count (RPKM) for DE genes =10,
dispersion (Standard Dev) 0.5. With
these somewhat reasonable values, you get sample size of 45. So, to get a smaller sample size, you can
play with all of the parameters.
The estimated Sample Size:
45
Description:
"We are planning a RNA sequencing experiment to
identify differential gene expression between two groups. Prior data indicates
that the minimum average read counts among the prognostic genes in the control
group is 10, the maximum dispersion is 0.5, and the ratio of the geometric mean
of normalization factors is 1. Suppose that the total number of genes for
testing is 10000 and the top 500 genes are prognostic. If the desired minimum
fold change is 2, we will need to study 45 subjects in each group to be able to
reject the null hypothesis that the population means of the two groups are
equal with probability (power) 0.9 using exact test. The FDR associated with
this test of this null hypothesis is 0.05."
To improve power (other than larger samples size or less
variability among your patients), you can sequence deeper (which allows a more
accurate and presumably less variable measure of expression for each gene),
only look at the most highly expressed genes, or only look at genes that have
large fold change. Again, it helps to have prior data to estimate these things.
When I do an actual RNA-seq data analysis, we can improve
on the 'expected power' by cheating a bit on the estimate of variance
(dispersion). We calculate a single variance estimate for ALL genes, then
modify this variance for each individual gene (sort of a Bayesian approach).
This allows for a lower variance than would happen if you just calculate StdDev
for each gene in each treatment. This
rests on an assumption that MOST genes are not differentially expressed in your
experiment, and the variance of all genes across all samples is a valid
estimate of background genetic variance.