May 8, 2009

Targeted Resequencing

Targeted Resequencing is one area of the DNA sequencing landscape that has not yet been revolutionized by Next-Gen technologies.

Targeted resequencing typically investigates a few genes (or a few dozen) across large populations. The largest portion of the effort involves lots of PCR to collect all the exons — or in some projects entire gene regions, then sequencing each amplification product, while keeping track of which PCR product comes from which individual. Even a small project - 10 genes, with 10 exons each, on 100 individuals means 10,000 PCR reactions, and 10,000 sequencing reactions (while keeping accurate track of 10,000 different DNA fragments and avoiding cross-contamination).

The Next-Gen approach would amplify the genomic regions in larger chunks, combine all of the chunks from one individual together, then run the library prep protocol (fragment, attach linkers, etc). So how does this play out in reality?

I read a paper in Genome Biology yesterday (Harismendy et al about targeted sequencing. They looked at six genes, which were covered by 28 large PCR amplicons (all exons plus some introns) which ranged in size from 3 to 14 kb, for a total of 266 kb of genomic DNA. These PCR products were then combined, and used in the sample prep protocols for 454, ABI SOLID, and Illumina GA sequencing. The same genes also were sequenced by standard Sanger methods using 273 short PCR reactions (88 kb).

Overall, the NG seq methods showed distinct bias favoring the ends of PCR products, and required very high coverage (34-fold, 110-fold and 101-fold for Roche 454, Illumina GA, and ABI SOLiD, respectively) to achieve a 10% false positive rate - false negative rates were much lower.

Lets talk about costs. Sanger sequencing costs from $3-10 per sample. I've got an Internet offer here for $4 per reaction, so lets use that for this study:

Sanger: $4 x 273 = $1092 per individual
Illumina is about $1000 per sample plus about $300 per sample for the library prep kit.

So I think they are about the same.

However, the Next Gen methods come out far ahead if you multiplex a group of individuals together in the same sequencing reaction. This is not possible with Sanger methods since the sequence is read from the average of a large number of molecules. Then the question becomes how deep can you multiplex while still producing enough reads from each individual research subject to achieve the depth of coverage needed? Our Illumina GAII currently produces about 2 million (usable) 35 bp reads per lane, but we are ramping up toward 5 million 50 bp reads with the latest upgrades.

2 M X 35 bp = 70 M bases
5 M X 50 bp = 250 M bases

So for 250 kb X 100x coverage = 25 M bp

So it looks like the current generation of NG machines do have a cost advantage over Sanger methods if you include 8, 10, or 12 X multiplexing. Improved accuracy and reduced sampling bias (sample prep methods) could bring down the coverage requirements and increase the advantage of NG methods.

I'd really like to hear some other opinions about this issue. We are writing several grant proposals for projects like these and I need some convincing arguments.