May 8, 2009

Targeted Resequencing

Targeted Resequencing is one area of the DNA sequencing landscape that has not yet been revolutionized by Next-Gen technologies.

Targeted resequencing typically investigates a few genes (or a few dozen) across large populations. The largest portion of the effort involves lots of PCR to collect all the exons — or in some projects entire gene regions, then sequencing each amplification product, while keeping track of which PCR product comes from which individual. Even a small project - 10 genes, with 10 exons each, on 100 individuals means 10,000 PCR reactions, and 10,000 sequencing reactions (while keeping accurate track of 10,000 different DNA fragments and avoiding cross-contamination).

The Next-Gen approach would amplify the genomic regions in larger chunks, combine all of the chunks from one individual together, then run the library prep protocol (fragment, attach linkers, etc). So how does this play out in reality?

I read a paper in Genome Biology yesterday (Harismendy et al http://genomebiology.com/2009/10/3/R32) about targeted sequencing. They looked at six genes, which were covered by 28 large PCR amplicons (all exons plus some introns) which ranged in size from 3 to 14 kb, for a total of 266 kb of genomic DNA. These PCR products were then combined, and used in the sample prep protocols for 454, ABI SOLID, and Illumina GA sequencing. The same genes also were sequenced by standard Sanger methods using 273 short PCR reactions (88 kb).

Overall, the NG seq methods showed distinct bias favoring the ends of PCR products, and required very high coverage (34-fold, 110-fold and 101-fold for Roche 454, Illumina GA, and ABI SOLiD, respectively) to achieve a 10% false positive rate - false negative rates were much lower.

Lets talk about costs. Sanger sequencing costs from $3-10 per sample. I've got an Internet offer here for $4 per reaction, so lets use that for this study:

Sanger: $4 x 273 = $1092 per individual
Illumina is about $1000 per sample plus about $300 per sample for the library prep kit.

So I think they are about the same.

However, the Next Gen methods come out far ahead if you multiplex a group of individuals together in the same sequencing reaction. This is not possible with Sanger methods since the sequence is read from the average of a large number of molecules. Then the question becomes how deep can you multiplex while still producing enough reads from each individual research subject to achieve the depth of coverage needed? Our Illumina GAII currently produces about 2 million (usable) 35 bp reads per lane, but we are ramping up toward 5 million 50 bp reads with the latest upgrades.

2 M X 35 bp = 70 M bases
5 M X 50 bp = 250 M bases

So for 250 kb X 100x coverage = 25 M bp

So it looks like the current generation of NG machines do have a cost advantage over Sanger methods if you include 8, 10, or 12 X multiplexing. Improved accuracy and reduced sampling bias (sample prep methods) could bring down the coverage requirements and increase the advantage of NG methods.

I'd really like to hear some other opinions about this issue. We are writing several grant proposals for projects like these and I need some convincing arguments.

—Stuart

Nov 19, 2008

Metagenomics of the effects of antibiotics on the human gut


Dethlefsen L, Huse S, Sogin ML, Relman DA
PLoS Biology Vol. 6, No. 11, e280 doi:10.1371/journal.pbio.0060280

A paper in PLOS Biology from the Relman lab investigates the effect of a treatment with the antibiotic ciprofloxacin on the bacteria in the intestine. They collected over 7,000 full-length 16S rDNA sequences (1100-1400 bp) by Sanger sequencing and over 900,000 reads (~250 bp) from 454 sequencing of the V3 and the V6 regions. 

There are many important results in this paper, but it is particularly relevant that 454 sequencing reveals more taxonomic variation with greater stability than traditional sequencing. In my own work, I have found that sequence variants that occur only once in the experiment cannot be used to differentiate samples. Deep sequencing reveals more taxa, and also reduces the frequency of singletons. A rare sequence variant (OTU) that occurs only once in the ~7000 full-length sequences occurs about 65 times in the 454 data set, providing more than enough "probability of detection" to be used for comparisons between samples. 


"This set of 7,208 sequences is among the largest datasets of full-length 16S rRNA sequences from the human microbiota (or any environment), the rarefaction curves for V6 and V3 tag pyrosequencing eventually rise higher and display more curvature toward the horizontal than the OTU0.01 curve. These features show that a single run of the [454] FLX sequencer targeting V6 or V3 tags from the human gut microbiota can reveal more taxa, and capture a larger proportion of the detectable taxa, than a more extensive effort directed toward full-length 16S rRNA clone sequencing."

journal-pbio-0060280-g003





Nov 12, 2008

CisGenome new software for Chip-Seq

CisGenome - just published in Nov. Nature Biotechnology.
An integrated software system for analyzing ChIP-chip and ChIP-seq data.
Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH.
Nat Biotechnol. 2008 Nov;26(11):1293-300.

A full-function integrated bioinformatics suite for ChIP-chip and ChIP-Seq including peak-finding, FDR control for single samples, subtraction of control lane, visualization and annotation of peaks on known genomes, and Motif finding.  Functional GUI on Windows and Mac. Wow. 

Software website here:  CisGenome
http://www.biostat.jhsph.edu/~hji/cisgenome/index.htm

Abstract:
We present CisGenome, a software system for analyzing genome-wide chromatin immunoprecipitation (ChIP) data. CisGenome
is designed to meet all basic needs of ChIP data analyses, including visualization, data normalization, peak detection, false
discovery rate computation, gene-peak association, and sequence and motif analysis. In addition to implementing previously
published ChIP–microarray (ChIP-chip) analysis methods, the software contains statistical methods designed specifically
for ChlP sequencing (ChIP-seq) data obtained by coupling ChIP with massively parallel sequencing. The modular design of
CisGenome enables it to support interactive analyses through a graphic user interface as well as customized batch-mode
computation for advanced data mining. A built-in browser allows visualization of array images, signals, gene structure,
conservation, and DNA sequence and motif information. We demonstrate the use of these tools by a comparative analysis of
ChIP-chip and ChIP-seq data for the transcription factor NRSF/REST, a study of ChIP-seq analysis with or without a negative
control sample, and an analysis of a new motif in Nanog- and Sox2-binding regions.

Oct 28, 2008

Gene-Boosted Assembly

Steven Salzberg describes a method for de novo assembly of a bacterial genome (Pseudomonas aeruginosa strain PAb1 = 6.2 MB) from a set of 33 bp Solexa fragments, using two closely related strains as reference sequences, and "boosting" assembly using predicted protein coding regions.

Salzberg SL, Sommer DD, Puiu D, Lee VT (2008) Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads. PLoS Comput Biol 4(9): e1000186. doi:10.1371/journal.pcbi.1000186

The AMOS assembler used in this project employs several different software modules and a considerable amount of hands-on effort. 

AMOScmp is a comparative alignment tool - it aligns short reads to a similar reference genome, and then builds contigs. This avoids the challenge of all-vs-all assembly for de novo genome sequencing projects. 

Minimus is a highly stringent assembler that uses Smith-Waterman alignments to identify overlaps between reads.

Contigs were then scanned for protein coding sequences using a combination of Glimmer and BLAST. The ABBA program uses protein coding information - especially at the ends of contings and singletons to close gaps.

Velvet was also used to independently assemble all the reads into contigs, them MUMMer was used to combine contigs and fill gaps. 

==================

This method is not going to work for every de novo sequencing problem, but we are going to try something similar for some new Plasmodium and Trichomonas species. 

All software from the Salzberg lab at the Univ. of Maryland is freely available here:

and a page describing the Short Read Assembly methods here:




Oct 20, 2008

Public Chip-Seq Data

Here are some Chip-Seq data sets that have been published and are out there in the public domain.



NHLBI

Valouev et al, Sidow lab @ Stanford, 

Robertson et al, 2007, Nature Methods  4(8) 651-7.
Eland processed sequence reads and FindPeaks output for Stat1 and FoxA2 transcription factors