I have been helping out several groups who want to do cheap, quick genome
projects on previously "unsequenced" eukaryotic organisms. In terms of the
genetic diversity of living things, at this point we have sampled very unevenly
across taxonomic domains. Insects are particularly underrepresented in the
whole genome database. According to the number of named species, insects
represent over 80% of animal species. The NCBI has over 10,000 whole genome
projects, but only 132 insects (and 36 of those are Drosophila species). So
there is certainly taxonomic room to knock out some more insect genomes and
discover interesting new stuff. A wide variety of valid arguments can be made
for the usefulness of sequencing any number of as yet overlooked organisms in
all taxonomic domains.
There are a lot of very useful experimental approaches that
require a draft genome as a reference: gene
expression, transcription factors and epigenetics (ChIP-seq), and just the
basic evolutionary biology of important genes that are present or absent in the
genome, novel paralogs in important gene families, etc. A few years ago, building a draft genome for a
new organism was a major undertaking that required substantial funding and a
dedicated research team. Today, the sequencing can be done fairly cheaply, but
the bioinformatics work is extremely open ended. Clearly you can work and work
and work to make a very well defined and annotated genome with maximum value
for all possible users. But what is the optimal set of most useful genomic
information that we can produce with a few person-weeks of time? [that would perhaps serve as preliminary data
to bring in showers of additional funding for follow-up studies]
NOTE ONE: Collect
both genomic and RNA sequence data. These two data types are extremely
complementary. Many earlier de novo
genome projects built on collections of existing EST sequence data, which was
the poor man's approach to draft genomes in the previous decade. The ESTs
provided seeds for gene finding on the genomic DNA, training of gene-finding
algorithms, etc. Now we can get a comprehensive genome wide set of RNA-seq data
for the cost of one lane on a HiSeq machine and a sample prep kit. If you have
the choice, get 100 bp paired-end sequencing of the RNA, it will map better and
end up giving more value for the dollar. It might be possible to use paired-end RNA to bridge DNA contigs into scaffolds - as far as I know, this is an untested area. It would be very helpful to have a
normalized RNA library, to get more coverage of poorly expressed genes – this
is on my wish list for future Cheap-O Genome projects.
NOTE TWO: More
data is good. High total genome coverage is good, but long insert paired-end
DNA libraries build better genome contigs. This makes complete sense. Early
"shotgun" genome projects relied on sequencing the ends of clones
from libraries with various size inserts. It would be really nice to have 10 or
20 KB insert libraries, or "mate pair" sequences that come from the
junctions of large genomic fragments
that have been circularized, but these are generally not available when your
entire sequencing budget is in the single digit thousands of dollars. We were
able to get a 550 bp insert library for a recent project and it led to an
assembly with an N50 > 40 KB. Pretty good for two lanes of HiSeq data.
For our next cheap genome, we are using the Illumina TruSeq
Synthetic Long-Read kit (which is based on technology developed by Moleculo).
This is a really clever idea: first it breaks the genome into ~10
KB fragments and sorts the fragments into wells of a 384 well plate. Just a few dozen to
a few hundred fragments in each well. The fragments in each well are clonally
amplified (sort of like 454 technology), then sheared into the normal size
range for Illumina sequencing (300-500 bp) and tagged with barcode primers at
the ends. Then all the tagged fragments are
pooled and sequenced normally on a HiSeq machine. Illumina has a custom
assembly app (built in BaseSpace) that demultiplexs the data and does separate de novo assembly on each
barcode set – so it is just assembling the small number of 10 KB fragments from
one well. The final output is a set of "synthetic long reads" that
really do seem to be 10 kb long.
(From Illumina product literature)
NOTE THREE: I
like the SOAPdenovo assembler
(127-kmer) for Illumina DNA sequence data. It did a good job for us on several
different species with only a moderate consumption of computing resources (an
overnight job on 32 processors with shared 128 GB of RAM). The final product is
a set of contigs in FASTA format, some quite big, and a lot of little ones.
Hopefully the sum of the contigs comes out to something similar to the expected
genome size of the organism. The quite new SOAPdenovo-Trans assembler for
RNA-seq also worked quite well for us – at least in comparison to Trinity which
is a huge computer hog.
I visualize the bioinformatics work in two parts. First, find the genes in our data. Second, annotate the found genes and
the genome using reference data. [Annotation will be described in another blog post.]
Ok, so here is my gene finding workflow for the Cheap-O
Genome Project.
Gene Finding
Workflow
- Assemble DNA reads into genome contigs with
SOAPdenovo assembler (127-kmer)
- De novo gene finding on the DNA contigs with
GeneScan or GeneMark (I used GeneMark).
- Assemble RNA-seq reads into
"transcripts" with SOAPdenovo-Trans
- Map RNA-seq reads onto the DNA contigs with
TopHat
- Make another set of transcripts with Cufflinks
(using no annotation file)
- Use BLAT to map the de novo assembled
transcripts onto the DNA contigs
- Use the extremely useful psl_to_bed_best_score.pl script written by Dave Tang (https://gist.github.com/davetang/7314846) to convert the output of BLAT (in .psl format) into a .bed file, choosing only the best match for each query. Without this sorting and conversion, the BLAT results as a PSL file look like garbage in a genome browser.
OK, now assemble all 5 data sets into one nice visualization
using IGV or GBrowse. We have a genome track (the DNA contigs from SOAPdenovo
in FASTA format), an RNA track (the RNA-seq reads aligned to the draft genome
in BAM format), a gene prediction track (GeneMark GTF file), the Cufflinks
transcripts (transcripts.gtf), and the RNA assemblies (from SOAPdenovo-Trans)
as a BED file. For some genes, all of
the data agree quite nicely. For other
genes, it's guess your best.
Here are two IGV screenshots of examples from the same genome
contig. The first is a nice gene with plenty of RNA where
all 3 annotation methods agree.
The second is a messy region where no gene model makes much
sense, none of the methods agree at all, but there seems to be enough RNA (and
spliced alignments!) to suggest real transcription is happening. Time to add some reference data by homology modeling (in my next post on annotation).