Mar 8, 2013

Genotyping on MiSeq with overlapping paired-end reads

Our Illumina MiSeq can now do 2 X 250 bp paired-end reads. We are going to use an amplicon method for a simple genotyping assay to identify very rare mutations at a single SNP site. The idea is to amplify a short fragment (less than 300-400 bp) so the F and Rev reads overlap by 50 or 100 bp, not just by a few bp, as we have done in some other assays (such as 16S metagenomics), and put our target SNP in the center of the overlap region. In this way, the SNP will be located in a high quality region of both reads, and we can get the maximum accuracy in the genotype call.

One problem with using the MiSeq for amplicon assays is that it is very sensitive to the base-pair composition at each cycle. Illumina tech support insists that a spike-in of 50% Phix DNA is necessary for any amplicon assay. (Our Genomics lab tried it with 40% added Phix and got poor results.)  However, with the latest upgrades, the MiSeq is now producing 15 Million reads per run, so with the 50% Phix spike-in, it still produces over 7 M usable (high-quality) reads.


In the past, we have not used Illumina to test for rare mutations (ultra-deep sequencing) because the overall error rate is in the 0.3-0.5 range. Even if we restrict results to bases >Q30 at our target, we can't find mutations with a certainty below the error frequency of about 1 per thousand - since that would be the expected rate of false positives. With overlapping high quality reads, we can recalculate a new Q-score based on both reads (assuming that they both agree on a variant call). I am still thinking about the proper math for this (a joint probability of error), but it is something similar to the sum of the two Q-scores (product of the two error frequencies).  This would allow us to find mutations in the one per ten thousand  (or even hundred thousand) range with a low false positive rate.  The PANDA-seq program uses similar math to calculate Q-scores for overlapping regions, so we are going to use that for the first pass on this data.

21 comments:

Povilas said...

It might be a silly question, but what about quality, that drops down towards the end of the read?

Stuart Brown said...

That's why we want to use long reads on a short amplicon, so they overlap a lot and we don't have to use the sequence at the ends of the read. I will post some overlapped and joined reads with adjusted Q-scores when we get them.

Anonymous said...

"Illumina tech support insists that a spike-in of 50% Phix DNA is necessary for any amplicon assay"

Where did you read this information? Is this statement true only for Nextera or every amplicon assay?

Stuart Brown said...

This was told to me directly on the phone by two different Illumina support specialists (MiSeq operations and Sample Prep).

James@cancer said...

PhiX is needed nomore!
Illumina's latest RTA works equally well on a single amplicon with no PhiX as 20 amplicons and 50% PhiX. Our first test using the new RTA is running now!

Faevae said...

Our MiSeq past this test running MCS2.2 and RTA 1.17.
Very good quality for low diversity libraries with only 2% Phi-X in a 2x151 PE run.
Hurry for Illumina.
Unfortunately they do not plan to introduce the new RTA algorithm into the HiSeq RTA. So a control lane remains necessary...

Scott Yourstone said...

Great post! I would be very interested in what you come up with to update quality scores based on the overlapping reads.

We have been working on some of these exact problems with our MiSeq 16S amplicon sequencing. We developed a method of frameshifted primers that requires no PhiX, and it should be extendable to the HiSeq platform. We also have a new method which we call molecule tagging that can reduce amplicon error rates down to ~0.42 ept.

I am working on a website that explains these processes in greater detail and provides the necessary bioinformatic tools (https://sites.google.com/site/moleculetagtoolbox/description). It is still a work in progress and will be until our paper gets accepted (which should be very soon).

Stuart Brown said...

Thanks for that comment Scott. I will look at the website. We are running the MiSeq with the long overlaps this week. I will have data on quality scores next week.

magda assem said...

we had a high AR result for FLT-ITD mut. in AML shall we proceed with direct sequencing or NGS & will your technique with large overlaps be better for clarifying the sequence

Anonymous said...

Hi Scott,

Is your molecule tagging like the Veritag method from Population Genetics (http://www.populationgenetics.com/technology/veritag/) or the SAFE-Seq method (http://www.pnas.org/content/108/23/9530) or the Duplex Sequencing approach (http://www.ncbi.nlm.nih.gov/pubmed/22853953)? It would be nice to see one of these ported to the MiSeq.

Scott Yourstone said...

Thanks for contributing those links! I had not seen them before. Yes, we are essentially doing the SafeSeq method on the MiSeq. It seems like the duplex tagging is a great idea too.

Anonymous said...

Dr. Brown:
When you said "Our Genomics lab tried it with 40% added Phix and got poor results.". Would tell a little detail why you think the results are poor?

Dr. Brendan Hodkinson said...

We just did a MiSeq 2x150 amplicon run with the new software and an extremely small PhiX spike-in and it has effectively doubled our yield! It's great to see the technology continually improving; we're now getting almost 10 times as many reads from each 2x150 MiSeq run as we did just a few months back.

Arti Sharma said...

http://yeastinfection7.com/

Genohub said...

Hi Scott,
Curious to hear about what % PhiX you are using now for Amplicon-Seq. Have you tried a staggered primer approach?

Scott Yourstone said...

Hi Genohub,
Yes, we use the staggered primer approach, and because of that we don't need any PhiX. However, I have heard that with the most recent MiSeq upgrades only ~2% PhiX is needed to get high quality sequences.

Derek Lundberg said...

Hi,
Scott and my paper describing MiSeq paired-end sequencing methods with frameshifting primers for great quality without phiX and molecule tags for correcting errors/bias will be coming out online in Nature Methods Sept. 1st. Look for this and also check out a paper by Jeremiah Faith / Jeff Gordon which describes some very similar techniques (they call their LEA-seq) and was published this July in the journal Science. Scott wrote some user-friendly software (optional GUI) that can process paired end reads from either of these and related techniques.

Anonymous said...

Seems interesting-
PEAR: A fast and accurate Illumina Paired-End reAd mergeR

http://bioinformatics.oxfordjournals.org/content/early/2013/10/18/bioinformatics.btt593.short

Anonymous said...

Do you have experience in SNP discovery using MiSeq with polyploid organisms?

Yev Marusenko said...

Hi Scott Yourstone,

Since you use the staggered primer approach, how do you design the Index sequencing primer (and reverse compliment Read 2) since there is variable number of bases between the reverse primer/linker?

Thanks for any insight.

Scott Yourstone said...

Hi Yev,

The framshifted molecule tagged primers include an Illumina adapter (which I think is the same thing as the index primer). So when those primers are designed they look something like:

[Illumina index primer] [molecule tag with frameshifts] [spacer] [target primer]

After all PCR steps the index can be added using the index primer.