Jul 25, 2017

RNA-seq Power calculation (FAQ)

I spend a lot of time answering questions from researchers working with genomic data. If I put a lot of effort into an answer, I try to keep it in my file of 'Frequently Asked Questions' - even though this stuff does change fairly rapidly. Last week I got the common question: "How do I calculate Power for an RNA-seq experiment?  So here is my FAQ answer. I have summarized from the work of many wise statisticians, with great reliance on the RnaSeqSampleSize R package by Shilin Zhao and the nice people at Vanderbilt who build a Shiny website interface to it.

----------

>> I’m considering including an RNA-Seq experiment in a grant proposal. Do you have any advice on how to calculate power for human specimens? I’m proposing to take FACS sorted lymphocytes from disease patients and two control groups. I believe other people analyze 10-20 individuals per group for similar types of experiments.
>>
>> It would be great if you have language that I can use in the grant proposal to justify the cohort size. Also, we can use that number to calculate the budget for your services. Thanks!
>>
>> Ken




Hi Ken,

Power calculations require that you make some assumptions about the experiment.  Ideally, you have done some sort of pilot experiment first, so you have an estimate of the total number of expressed genes (RPKM>1), fold change, variability between samples within each treatment, and how many genes are going to be differentially expressed.   The variability of your samples is probably the single most important issue - humans tend to vary a lot in gene expression, cultured cell lines not so much. You can reduce variability somewhat by choosing a uniform patient group - age, gender, body mass index, ethnicity, diet, current and previous drug use, etc.

Have a look at this web page for an example of an RNA-seq  power calculator.

I plugged in the following data:   FDR=0.05, ratio of reads between groups=1, total number of relevant genes 10,000 (ie. you will remove about half of all genes due to low overall expression prior to differential expression testing).  Expected number of DE genes=500, fold change for DE genes=2, read count (RPKM) for DE genes =10, dispersion (Standard Dev) 0.5.  With these somewhat reasonable values, you get sample size of 45.   So, to get a smaller sample size, you can play with all of the parameters. 

The estimated Sample Size:
45
Description:
"We are planning a RNA sequencing experiment to identify differential gene expression between two groups. Prior data indicates that the minimum average read counts among the prognostic genes in the control group is 10, the maximum dispersion is 0.5, and the ratio of the geometric mean of normalization factors is 1. Suppose that the total number of genes for testing is 10000 and the top 500 genes are prognostic. If the desired minimum fold change is 2, we will need to study 45 subjects in each group to be able to reject the null hypothesis that the population means of the two groups are equal with probability (power) 0.9 using exact test. The FDR associated with this test of this null hypothesis is 0.05."

To improve power (other than larger samples size or less variability among your patients), you can sequence deeper (which allows a more accurate and presumably less variable measure of expression for each gene), only look at the most highly expressed genes, or only look at genes that have large fold change. Again, it helps to have prior data to estimate these things.

When I do an actual RNA-seq data analysis, we can improve on the 'expected power' by cheating a bit on the estimate of variance (dispersion). We calculate a single variance estimate for ALL genes, then modify this variance for each individual gene (sort of a Bayesian approach). This allows for a lower variance than would happen if you just calculate StdDev for each gene in each treatment.  This rests on an assumption that MOST genes are not differentially expressed in your experiment, and the variance of all genes across all samples is a valid estimate of background genetic variance.


Feb 6, 2017

Oxford MinION 1D and 2D reads

We have been testing out the Oxford MinION DNA sequencing machine to see what it can contribute to larger de novo sequencing projects.  Most of the posted data so far come from small genomes where moderate coverage is more easily obtained.  Recent publications claim improved sequence quality for Oxfort MinION.

We are working on a new de novo sequence for the little skate shark ((Leucoraja erinacea), an interesting system to study developmental biology.  The skate has a BIG genome, estimated at 4 Gb (bigger than human), so this is going to be a difficult project. The existing skate genome in Genbank and the SkateBase website is not in very good shape (3 million contigs with  N50 size of 665 bp).

We got a couple of preliminary Oxford MinION reads from skate DNA - not nearly enough coverage to make a dent in this project, just having a look at the data. Oxford produces two kinds of data (in their own annoying FAST5 format, but I won't rant about that right now), single pass 1D and double pass 2D.  [My Bioinformatics programmer Yuhan Hao did this analysis.] Here is what our data looks like.


So the 1D reads are really long - some more than 50 kb. The 2D reads are mostly 2-10 kb.  The SkateBase has a complete contig of the mitochondrial genome, so we were able to align the Oxford sequences to this as a reference. Coverage was low, but we do have some regions where both 1D and 2D reads match the reference. What we can see is that the 1D reads have a lot of SNPs vs the reference, while the 2D reads have very few SNPs- so it is clear that the 2D reads have been successfully error corrected.  Strangely, both the 1D and 2D reads have a lot of insertion-deletion errors (several per hundred bases) compared to the reference, and in fact they do not match each other - so we consider these to all be novel, uncorrected errors.







We also ran a standard Illumina whole genome shotgun sequencing run for the skate genome, which we aligned to the mitochondrial reference. With this data, we can see a small number of Oxford 2D SNPs shared by hundreds of Illumina reads, others not. None of the indels are supported by our Illumina reads.


Other investigators have had poor quality Oxoford sequences. With more coverage, we may be able to use the Oxoford reads as scaffolds for our de novo assembly. It may be possible to use Illumina reads for error correction, and mark all uncorrected areas of the Oxford sequences as low quality, but that is not the usual method for submitting draft genomes to Genbank.

Jan 17, 2017

GenomeWeb reports on updated Coffee Beetle genome

A nice review of my 2015 Coffee Beetle paper in GenomeWeb today.
My genome had 163 million bases and 19,222 predicted protein-coding genes. I am very pleased to learn that a revised version of the draft genome sequence (from a group in Columbia) contains 160 million bases and 22,000 gene models. They also confirm the 12 horizontally transferred genes that I identified.

Coffee Pest, Plant Genomes Presented at PAG Conference

Researchers from the USDA's Agricultural Research Service, New York University, King Abdullah University of Science and Technology, and elsewhere published information on a 163 million base draft genome for the coffee berry borer in the journal Scientific Reports in 2015.
That genome assembly, produced with Illumina HiSeq 2000 reads, housed hundreds of small RNAs and an estimated 19,222 protein-coding genes, including enzymes, receptors, and transporters expected to contribute to coffee plant predation, pesticide response, and defense against potential pathogens. It also provided evidence of horizontal gene transfer involving not only mannanase, but several other bacterial genes as well.
At the annual Plant and Animal Genomes meeting here this week, National Center for Coffee Research (Cenicafe) scientist Lucio Navarro provided an update on efforts to sequence and interpret the coffee berry borer genome during a session on coffee genomics. For their own recent analyses, Navarro and his colleagues upgraded an earlier version of a coffee berry borer genome that had been generated by Roche 454 FLX sequencing, using Illumina short reads from male and female coffee berry borers to produce a consensus assembly spanning around 160 million bases. The assembly is believed to represent roughly 96 percent of the insect's genome.
In addition to producing a genome with improved contiguity levels, he reported, members of that team also combined 454 and Illumina reads to get consensus transcriptomes for the beetle. With these and other data, they identified almost 22,000 gene models, novel transposable element families, and their own evidence of horizontal gene transfer.

Dec 8, 2016

Finding differences between bacterial strains 100 bases at a time.

This work was conducted mostly by my research assistant Yuhan Hao.

We have recently been using Bowtie2 for sequence comparisons related to shotgun metagenomics, where we directly sequence samples that contain mixtures of DNA from different organisms, with no PCR. Bowtie2 alignments can be made very stringently (>90% identity), and computed very rapidly for large data files with hundreds of millions of sequence reads. This allows us to identify DNA fragments by species and by gene function, provided we have a well-annotated database that contains DNA sequences from similar microbes. I know that MetaPhlan and other tools already do this, but I want to focus on the difference between bacteria (and viruses) at the strain level.

I have been playing around with the idea of constructing a database that will give strain-specific identification for human-associated microbes using Bowtie2 alignments with shotgun metagenomic data. There are plenty of sources for bacterial genome sequences, but the PATRIC database has a very good collection of human pathogen genomes. However the database is very redundant - with dozens or even hundreds of different strains for the same species, and the whole thing is too large to build into a Bowtie database (>350 Gb). So I am looking at ways to eliminate redundancy from the database while retaining high sensitivity at the species level to identify any sequence fragment, not just specific marker genes,  and the ability to make strain-specific alignments. So I want to identify which genome fragments have shared sequences among a group of strains within a single species and which fragments are unique to each strain.

Since our sequence reads are usually ~100 bases, We are looking the similarity between bacterial strains when the genome is chopped into 100 bp pieces.

Bowtie2 can be limited to only perfect alignments (100 bases with zero mismatches) using the parameters  --end-to-end  and  --scoremin C, 0, -1  and to something similar to 99% identity with    --scoremin  ‘L,0, -0.06’ 

Between two strains of  Streptococcus agalactiae, if we limit the Bowtie2 alignments to perfect matches,  half of the fragments align. So at the 100 base level, half of the genome is identical, and the other half has at least one variant. At 99% similarity (one mismatch per read), about 2/3 of the fragments align, and the other 1/3 has more than one variant, or in some cases no-similarity at all to the other genome. Yuhan Hao extended this experiment to another species (Strep pyogenes), where we see almost no alignment to Strep ag.  of genome fragments at 100%, 99%, or ~97% identity (see the Table below).  I have previously used the 97% sequence identity threshold to separate bacterial species, but I thought it was only for 16S rDNA sequences - here it seems to apply to almost every 100 base fragment in the whole genome. 

So we can build a database with one reference genome per species and safely eliminate the 2/3 of fragments that are nearly identical between strains, and retain only those portions of the genome that are unique to each strain.  I'm going to think about how to apply this iteratively (without creating a huge computing task), so we add only the truly unique fragments for EACH strain, rather than just testing if a fragment differs from the Reference for that species. With such a database, stringent Bowtie2 alignments will identify each sequence read by species and by strain and have very low false matches across different species. 

We can visualize the 100 base alignments between two strains at the 100% identity level, 99% identity, and at the least stringent Bowtie2 setting (very fast local) to see where the strains differ. This makes a pretty picture (below), which reveals some fairly obvious biology: There are parts of the genome that are more conserved and parts that are more variable between strains. The top panel shows only the 100% perfect matches (grey bars), the second panel shows that we can add in some 99% matching reads (white bars with a single vertical color stripe to mark the mismatch base), and the lowest panel shows reads that are highly diverged (lots of colored mismatch bases and some reads that are clearly mis-aligned from elsewhere on the genome). So if we choose only the reads that differ by more than 99%, we can build a database that can use Bowtie2 to identify different strains and have few multiple matches or incorrectly matched reads. 




This chart lists the number of UNMATCHED fragments after alignment of all non-overlapping 100 bp fragments from the genomes of each strain to the Strep. agalactiae genome that we arbitrarily chose as the Reference.   C 0 -1 corresponds to perfect matches only, L 0 -0.06 is approximately 99% identity (by my calculation), and L 0, -0.2 is about 97% identity. The lower the stringency of alignment, the fewer fragments end up in the "unmatched" file. 

                                S.ag Ref  S.ag st2   S.ag st3   S.pyo st1 S.pyo st2
# 100b kmers          20615      22213      21660      17869      17094
C 0 -1       0              0              11243      10571      17718      16934  (perfect matches)
L 0 -0.06  0              0              6605        5918        17533      16762  (99% ident)
L 0 -0.1    0              0              6534        5839        17533      16762
L 0 -0.12  0              0              4812        4058        17371      16600
L 0 -0.15  0              0              4767        4006        17369      16600
L 0 -0.17  0              0              4756        3997        17369      16600

L 0 -0.2    0              0              4156        3419        17237      16441 (97% ident)

May 27, 2016

Functional Metagenomics from shotgun sequences

A number of groups at our research center have recently become interested in metagenomic shotgun sequencing (MGS), which is simply taking samples that are presumed to contain some microbes, extracting DNA and sequencing all of it, shotgun style. This is seen as an improvement over metagenomics methods that amplify 16S ribosomal RNA genes using bacteria specific PCR primers, and then sequence these PCR products. The 16S approach has had a lot of success, in that essentially all of the important "microbiome" publications over the past 5 years have been based on the 16S method. The 16S approach has a number of advantages – the amount of sequencing effort per sample is quite small (1,000 to 10,000 sequences is usually considered adequate) and fairly robust computational methods have been developed to process the sequence data into abundance counts for taxonomic groups of bacteria.

There are a number of drawbacks for the 16S method. The data is highly biased by DNA extraction methods, PCR primers and conditions, DNA sequencing technology, and computational methods used to clean, trim, cluster, and identify the sequences. What I mean by biased is that if you change any of these factors, then you get a different set of taxa and abundances from the samples. Even when these biases are carefully addressed, the accuracy of the taxonomic calls are not very good or reliable. It is simply not possible to identify with high precision and accuracy all bacterial  species (or strains) present in a DNA sample with just ~400 bp of 16S sequence data. Many 16S microbiome studies report differences in bacterial abundance at the genus or even the phylum level.

Two samples may be discovered to have reproducibledifferences in 16S sequence content, but the actual bacterial species or strains that differ are not confidently identified.  Even more important, the low resolution of 16S studies may be missing a lot of important biology. Huge changes in environment can perhaps favor anaerobes over aerobes, but smaller changes in pH, nutrient abundance, or immune call populations and function may cause a shift from one species in a genus to another. And this difference in species or strain may bring important changes in metabolite flux, immune system interaction, etc.

It has been proposed that bulk metagenomic shotgun sequencing(MGS) of all of the DNA in a biosample, rather than just PCR amplified 16S sequences, would allow for more precise species and strain identification, and quantification of the actual microbial genes present. Some MGS methods also attempt to count genes in specific functional groups such as within the Gene Ontology, or KEGG pathways. Other people would like to discover completely novel genes that innovative bacteria have developed to do interesting metabolic things. This leads to a computationally hard problem. MGS data tend to be very large (200 million to 1 billion reads per sample), and databases of bacterial genes are incomplete.  In fact, we probably have complete genome sequences for very much less than 0.1 percent of all bacteria in the world. We might also like to identify DNA from archea, viruses, and small eukaryotes in our samples.

Each fragment of DNA in an MGS data file has to be identified based on some type of inexact matching to some set of reference genes or genomes (a sequence alignment problem), which is computationally very demanding. In my experience, BLAST is the most sensitive tool to match diverged DNA sequences, but depending on the size of the database, it takes from 0.1 to 10 cpu seconds for each sequence to be aligned by BLAST. 100,000 sequences takes at least overnight (on 32 CPUs, 128 GB RAM), if not all week. I have never tried a billion.  Multiply that by a billion reads per sample and you can see that we have a serious compute challenge. We have at least 200 samples with FASTQ files queued up for analysis, more being sequenced, and more investigators are preparing to start new studies. So we need a scalable solution that cannot be solved just by brute force BLAST searching on ever bigger collections of computers.

As more investigators have become interested in MGS, computing methods for processing this data have popped up like weeds. It's really difficult to review/benchmark all of these tools, since there are no clear boundaries for what analysis results they should deliver, and what methods they should be using. The omictools.com website lists 12 tools for "metagenomics gene prediction" and 5 for "metagenomics functional annotation" however these categories might be defined.  The best benchmark paper I could find (Gardner et al 2015) looks at about 14.

Should the data be primer and quality trimmed (YES!), human and other contaminants removed (YES!), duplicates removed (maybe not???), clustered (????), assembled into contigs or complete genomes. Then what sort of database should the fragments be aligned to? The PFAM library of protein motifs, a set of complete bacterial genomes, some set of candidate genes?

So far, the most successful tools focus on the taxonomy/abundance problem – choosing some subset of the sequence data and comparing it to some set of reference sequences. I have chosen MetaPhalAn <http://www.ncbi.nlm.nih.gov/pubmed/22688413> to process our data because it does well in benchmark studies, runs quickly on our data, and Curtis Huttenhower has a superb track record of producing excellent bioinformatics software. In addition, we removed primers and quality trimmed using Trimmomatic, then removed human sequences by using Bowtie2 to align to the human reference genome (as much as 90% of the data in some samples). Using MetaPhalAn on the cleaned data files, we got species/abundance counts for our ~200 WGS samples in less than a week. [Note, when I say 'we', I actually mean that all the work was done by Hao Chen, my excellent Bioinformatics Programmer ]


  1. An intentionally fuzzy top species abundance heatmap of pre-publication data for some MGS samples processed with MetaPhalAn. If we added info about which samples came from which patients, this might be considerably more interesting. 


Moving on, our next objective is to identify microbial protein coding genes and metabolic pathways. Here, the methods become much more challenging, and difficult to benchmark. I basically don’t like the idea of assembling reads into contigs. This adds a lot of compute time, huge inconsistency among different samples (some will assemble better than others), and creates all kinds of bias for various species (high vs low GC genomes, repeats, etc.) and genes. Also, I'm having a hard time coming up with a solid data analysis plan that meets all of our objectives. Bacteria are diverse. A specific enzyme that fulfills a metabolic function (for example in a MetaCyc pathway), could differ by  80% of its DNA sequence from one type of bacteria to another, but still do the job. Alternately, there are plenty of multi-gene families of enzymes in bacteria with paralogs that differ in DNA sequence by 20% or less within the genome of a single organism, but perform different metabolic functions. And of course there are sequence variants in individual strains that inactivate an enzyme, or just modifiy one of its functions. There is no way we can compute such subtle stuff on billions of raw Illumina reads from a mixture of DNA fragments from unknown organisms (many of which have no reference data).  So how do we compute up a report that realistically describes the functional differences in gene/pathway metabolic capacity between different sets of MGS samples? 

If I have to design the gene content identifier myself, I will probably translate all reads in 6 frames, then use hmmscan vs. the PFAM library of known protein functional domains. The upside is that I know how to do this, and it is reasonably sensitive for most types of proteins. The downside is that many proteins will receive a very general function (ie: "kinase" or "7-TM domain"), which does not reveal exactly what metabolic functions they are involved in. And of course some 30-40% of proteins will come up with no known function – just like every new genome that we sequence. Another way to do this would be to use BLASTx against the set of bacterial proteins from UniProt. I can speed this up a bit by taking the UniRef 90% identity clusters (which reduces the database size by about 25%). An alternate proposal, which is much more clumsy, slower, and ad hoc, would be to BLASTn against a large set of bacterial genomes. Either all the complete microbial genomes in GenBank, or all the ones collected by the Human Microbiome Project, or some taxonomically filtered set (one genome per genus???). The point of searching against many complete genomes is that we have some chance of catching rare or unusual genes that are not well annotated as a COG, or KEGG pathway member. 

According to Gardner et al (Sci. Rep. 2015), the best tools for identifying gene content in MGS samples are the free public servers MG-RAST and the EBI Metagenomics portal. So we have submitted some test samples to these. However, the queue is at least 10 days for processing by MG-RAST, so this is maybe not going to satisfy my backlog of metagenomics investigators who are rapidly cranking out more MGS samples. We probably need a local tool that will be under our own control in terms of compute time and more amenable to tweaking to the goals of each project. 

At this point I'm asking blog readers for suggestions for a decent method or tool to identity gene content in big MGS FASTQ files which will be reasonably accurate in terms of protein/pathway function, and not crush my servers. Some support from a review or benchmark paper (other than by the tool's own authors...) would be nice also.