A number of groups at our research center have recently become
interested in metagenomic shotgun sequencing (MGS), which is simply taking
samples that are presumed to contain some microbes, extracting DNA and
sequencing all of it, shotgun style. This is seen as an improvement over
metagenomics methods that amplify 16S ribosomal RNA genes using bacteria
specific PCR primers, and then sequence these PCR products. The 16S approach
has had a lot of success, in that essentially all of the important
"microbiome" publications over the past 5 years have been based on the
16S method. The 16S approach has a number of advantages – the amount of
sequencing effort per sample is quite small (1,000 to 10,000 sequences is
usually considered adequate) and fairly robust computational methods have been
developed to process the sequence data into abundance counts for taxonomic
groups of bacteria.
There are a number of drawbacks for the 16S method. The data
is highly biased by DNA extraction methods, PCR primers and conditions, DNA
sequencing technology, and computational methods used to clean, trim, cluster,
and identify the sequences. What I mean by biased is that if you change any of these factors, then you get a different set of taxa and abundances from the samples. Even when these biases are carefully addressed, the
accuracy of the taxonomic calls are not very good or reliable. It is simply not
possible to identify with high precision and accuracy all bacterial species (or strains) present in a DNA sample with
just ~400 bp of 16S sequence data. Many 16S microbiome studies report differences
in bacterial abundance at the genus or even the phylum level.
Two samples may be discovered to have reproducibledifferences in 16S sequence content, but the actual bacterial species or
strains that differ are not confidently identified. Even more important, the low resolution of 16S
studies may be missing a lot of important biology. Huge changes in environment can
perhaps favor anaerobes over aerobes, but smaller changes in pH, nutrient
abundance, or immune call populations and function may cause a shift from one species in a genus to another. And this
difference in species or strain may bring important changes in metabolite flux,
immune system interaction, etc.
It has been proposed that bulk metagenomic shotgun sequencing(MGS) of all of the DNA in a biosample, rather than just PCR amplified 16S
sequences, would allow for more precise species and strain identification, and
quantification of the actual microbial genes present. Some MGS methods also
attempt to count genes in specific functional groups such as within the Gene
Ontology, or KEGG pathways. Other people would like to discover completely
novel genes that innovative bacteria have developed to do interesting metabolic
things. This leads to a computationally hard problem. MGS data tend to be very large (200 million to 1 billion reads per sample), and databases of bacterial
genes are incomplete. In fact, we
probably have complete genome sequences for very much less than 0.1 percent of
all bacteria in the world. We might also like to identify DNA from archea,
viruses, and small eukaryotes in our samples.
Each fragment of DNA in an MGS data file has to be
identified based on some type of inexact matching to some set of reference
genes or genomes (a sequence alignment problem), which is computationally very
demanding. In my experience, BLAST is the most sensitive tool to match diverged
DNA sequences, but depending on the size of the database, it takes from 0.1 to 10
cpu seconds for each sequence to be aligned by BLAST. 100,000 sequences takes
at least overnight (on 32 CPUs, 128 GB RAM), if not all week. I have never
tried a billion. Multiply that by a
billion reads per sample and you can see that we have a serious compute
challenge. We have at least 200 samples with FASTQ files queued up for analysis,
more being sequenced, and more investigators are preparing to start new studies.
So we need a scalable solution that cannot be solved just by brute force BLAST
searching on ever bigger collections of computers.
As more investigators have become interested in MGS, computing
methods for processing this data have popped up like weeds. It's really
difficult to review/benchmark all of these tools, since there are no clear
boundaries for what analysis results they should deliver, and what methods they
should be using. The omictools.com website lists 12 tools for "metagenomics
gene prediction" and 5 for "metagenomics functional annotation"
however these categories might be defined. The best benchmark paper I could find (Gardner et al 2015) looks at about 14.
Should the data be primer and quality trimmed (YES!), human
and other contaminants removed (YES!), duplicates removed (maybe not???),
clustered (????), assembled into contigs or complete genomes. Then what sort of
database should the fragments be aligned to? The PFAM library of protein
motifs, a set of complete bacterial genomes, some set of candidate genes?
So far, the most successful tools focus on the
taxonomy/abundance problem – choosing some subset of the sequence data and
comparing it to some set of reference sequences. I have chosen MetaPhalAn <http://www.ncbi.nlm.nih.gov/pubmed/22688413> to
process our data because it does well in benchmark studies, runs quickly on our
data, and Curtis Huttenhower has a superb track record of producing excellent bioinformatics
software. In addition, we removed primers and quality trimmed using
Trimmomatic, then removed human sequences by using Bowtie2 to align to the
human reference genome (as much as 90% of the data in some samples). Using
MetaPhalAn on the cleaned data files, we got species/abundance counts for our
~200 WGS samples in less than a week. [Note, when I say 'we', I actually mean
that all the work was done by Hao Chen, my excellent Bioinformatics Programmer ]
- An intentionally fuzzy top species abundance heatmap of
pre-publication data for some MGS samples processed with MetaPhalAn. If we added
info about which samples came from which patients, this might be considerably
more interesting.
- An intentionally fuzzy top species abundance heatmap of pre-publication data for some MGS samples processed with MetaPhalAn. If we added info about which samples came from which patients, this might be considerably more interesting.
Moving on, our next objective is to identify microbial protein coding
genes and metabolic pathways. Here, the methods become much more challenging, and difficult to
benchmark. I basically don’t like the idea of assembling reads into contigs.
This adds a lot of compute time, huge inconsistency among different samples
(some will assemble better than others), and creates all kinds of bias for
various species (high vs low GC genomes, repeats, etc.) and genes. Also, I'm having a hard time coming up with a solid data analysis plan that meets all of our objectives. Bacteria are diverse. A specific enzyme that fulfills a metabolic function (for example in a MetaCyc pathway), could differ by 80% of its DNA sequence from one type of bacteria to another, but still do the job. Alternately, there are plenty of multi-gene families of enzymes in bacteria with paralogs that differ in DNA sequence by 20% or less within the genome of a single organism, but perform different metabolic functions. And of course there are sequence variants in individual strains that inactivate an enzyme, or just modifiy one of its functions. There is no way we can compute such subtle stuff on billions of raw Illumina reads from a mixture of DNA fragments from unknown organisms (many of which have no reference data). So how do we compute up a report that realistically describes the functional differences in gene/pathway metabolic capacity between different sets of MGS samples?
If I have to design the gene content identifier myself, I
will probably translate all reads in 6 frames, then use hmmscan vs. the PFAM library of known protein functional domains. The upside is that I know how to
do this, and it is reasonably sensitive for most types of proteins. The downside
is that many proteins will receive a very general function (ie:
"kinase" or "7-TM domain"), which does not reveal exactly
what metabolic functions they are involved in. And of course some 30-40% of
proteins will come up with no known function – just like every new genome that
we sequence. Another way to do this would be to use BLASTx against the set of
bacterial proteins from UniProt. I can speed this up a bit by taking the UniRef 90% identity clusters (which reduces the database size by about 25%). An
alternate proposal, which is much more clumsy, slower, and ad hoc, would be to
BLASTn against a large set of bacterial genomes. Either all the complete microbial genomes in GenBank, or all the ones collected by the Human Microbiome Project,
or some taxonomically filtered set (one genome per genus???). The point of searching against many complete genomes is that we have some chance of catching rare or unusual genes that are not well annotated as a COG, or KEGG pathway member.
According to Gardner et al (Sci. Rep. 2015), the best tools
for identifying gene content in MGS samples are the free public servers MG-RAST
and the EBI Metagenomics portal. So we have submitted some test samples to
these. However, the queue is at least 10 days for processing by MG-RAST, so
this is maybe not going to satisfy my backlog of metagenomics investigators who are rapidly
cranking out more MGS samples. We probably need a local tool that will be under our own control in terms of compute time and more amenable to tweaking to the goals of each project.
At this point I'm asking blog readers for suggestions for a
decent method or tool to identity gene content in big MGS FASTQ files which will be reasonably
accurate in terms of protein/pathway function, and not crush my servers. Some support from a review or benchmark paper (other than by the tool's own authors...) would be nice also.