Jul 25, 2017

RNA-seq Power calculation (FAQ)

I spend a lot of time answering questions from researchers working with genomic data. If I put a lot of effort into an answer, I try to keep it in my file of 'Frequently Asked Questions' - even though this stuff does change fairly rapidly. Last week I got the common question: "How do I calculate Power for an RNA-seq experiment?  So here is my FAQ answer. I have summarized from the work of many wise statisticians, with great reliance on the RnaSeqSampleSize R package by Shilin Zhao and the nice people at Vanderbilt who build a Shiny website interface to it.

----------

>> I’m considering including an RNA-Seq experiment in a grant proposal. Do you have any advice on how to calculate power for human specimens? I’m proposing to take FACS sorted lymphocytes from disease patients and two control groups. I believe other people analyze 10-20 individuals per group for similar types of experiments.
>>
>> It would be great if you have language that I can use in the grant proposal to justify the cohort size. Also, we can use that number to calculate the budget for your services. Thanks!
>>
>> Ken




Hi Ken,

Power calculations require that you make some assumptions about the experiment.  Ideally, you have done some sort of pilot experiment first, so you have an estimate of the total number of expressed genes (RPKM>1), fold change, variability between samples within each treatment, and how many genes are going to be differentially expressed.   The variability of your samples is probably the single most important issue - humans tend to vary a lot in gene expression, cultured cell lines not so much. You can reduce variability somewhat by choosing a uniform patient group - age, gender, body mass index, ethnicity, diet, current and previous drug use, etc.

Have a look at this web page for an example of an RNA-seq  power calculator.

I plugged in the following data:   FDR=0.05, ratio of reads between groups=1, total number of relevant genes 10,000 (ie. you will remove about half of all genes due to low overall expression prior to differential expression testing).  Expected number of DE genes=500, fold change for DE genes=2, read count (RPKM) for DE genes =10, dispersion (Standard Dev) 0.5.  With these somewhat reasonable values, you get sample size of 45.   So, to get a smaller sample size, you can play with all of the parameters. 

The estimated Sample Size:
45
Description:
"We are planning a RNA sequencing experiment to identify differential gene expression between two groups. Prior data indicates that the minimum average read counts among the prognostic genes in the control group is 10, the maximum dispersion is 0.5, and the ratio of the geometric mean of normalization factors is 1. Suppose that the total number of genes for testing is 10000 and the top 500 genes are prognostic. If the desired minimum fold change is 2, we will need to study 45 subjects in each group to be able to reject the null hypothesis that the population means of the two groups are equal with probability (power) 0.9 using exact test. The FDR associated with this test of this null hypothesis is 0.05."

To improve power (other than larger samples size or less variability among your patients), you can sequence deeper (which allows a more accurate and presumably less variable measure of expression for each gene), only look at the most highly expressed genes, or only look at genes that have large fold change. Again, it helps to have prior data to estimate these things.

When I do an actual RNA-seq data analysis, we can improve on the 'expected power' by cheating a bit on the estimate of variance (dispersion). We calculate a single variance estimate for ALL genes, then modify this variance for each individual gene (sort of a Bayesian approach). This allows for a lower variance than would happen if you just calculate StdDev for each gene in each treatment.  This rests on an assumption that MOST genes are not differentially expressed in your experiment, and the variance of all genes across all samples is a valid estimate of background genetic variance.


Feb 6, 2017

Oxford MinION 1D and 2D reads

We have been testing out the Oxford MinION DNA sequencing machine to see what it can contribute to larger de novo sequencing projects.  Most of the posted data so far come from small genomes where moderate coverage is more easily obtained.  Recent publications claim improved sequence quality for Oxfort MinION.

We are working on a new de novo sequence for the little skate shark ((Leucoraja erinacea), an interesting system to study developmental biology.  The skate has a BIG genome, estimated at 4 Gb (bigger than human), so this is going to be a difficult project. The existing skate genome in Genbank and the SkateBase website is not in very good shape (3 million contigs with  N50 size of 665 bp).

We got a couple of preliminary Oxford MinION reads from skate DNA - not nearly enough coverage to make a dent in this project, just having a look at the data. Oxford produces two kinds of data (in their own annoying FAST5 format, but I won't rant about that right now), single pass 1D and double pass 2D.  [My Bioinformatics programmer Yuhan Hao did this analysis.] Here is what our data looks like.


So the 1D reads are really long - some more than 50 kb. The 2D reads are mostly 2-10 kb.  The SkateBase has a complete contig of the mitochondrial genome, so we were able to align the Oxford sequences to this as a reference. Coverage was low, but we do have some regions where both 1D and 2D reads match the reference. What we can see is that the 1D reads have a lot of SNPs vs the reference, while the 2D reads have very few SNPs- so it is clear that the 2D reads have been successfully error corrected.  Strangely, both the 1D and 2D reads have a lot of insertion-deletion errors (several per hundred bases) compared to the reference, and in fact they do not match each other - so we consider these to all be novel, uncorrected errors.







We also ran a standard Illumina whole genome shotgun sequencing run for the skate genome, which we aligned to the mitochondrial reference. With this data, we can see a small number of Oxford 2D SNPs shared by hundreds of Illumina reads, others not. None of the indels are supported by our Illumina reads.


Other investigators have had poor quality Oxoford sequences. With more coverage, we may be able to use the Oxoford reads as scaffolds for our de novo assembly. It may be possible to use Illumina reads for error correction, and mark all uncorrected areas of the Oxford sequences as low quality, but that is not the usual method for submitting draft genomes to Genbank.

Jan 17, 2017

GenomeWeb reports on updated Coffee Beetle genome

A nice review of my 2015 Coffee Beetle paper in GenomeWeb today.
My genome had 163 million bases and 19,222 predicted protein-coding genes. I am very pleased to learn that a revised version of the draft genome sequence (from a group in Columbia) contains 160 million bases and 22,000 gene models. They also confirm the 12 horizontally transferred genes that I identified.

Coffee Pest, Plant Genomes Presented at PAG Conference

Researchers from the USDA's Agricultural Research Service, New York University, King Abdullah University of Science and Technology, and elsewhere published information on a 163 million base draft genome for the coffee berry borer in the journal Scientific Reports in 2015.
That genome assembly, produced with Illumina HiSeq 2000 reads, housed hundreds of small RNAs and an estimated 19,222 protein-coding genes, including enzymes, receptors, and transporters expected to contribute to coffee plant predation, pesticide response, and defense against potential pathogens. It also provided evidence of horizontal gene transfer involving not only mannanase, but several other bacterial genes as well.
At the annual Plant and Animal Genomes meeting here this week, National Center for Coffee Research (Cenicafe) scientist Lucio Navarro provided an update on efforts to sequence and interpret the coffee berry borer genome during a session on coffee genomics. For their own recent analyses, Navarro and his colleagues upgraded an earlier version of a coffee berry borer genome that had been generated by Roche 454 FLX sequencing, using Illumina short reads from male and female coffee berry borers to produce a consensus assembly spanning around 160 million bases. The assembly is believed to represent roughly 96 percent of the insect's genome.
In addition to producing a genome with improved contiguity levels, he reported, members of that team also combined 454 and Illumina reads to get consensus transcriptomes for the beetle. With these and other data, they identified almost 22,000 gene models, novel transposable element families, and their own evidence of horizontal gene transfer.