Mar 8, 2013

Genotyping on MiSeq with overlapping paired-end reads

Our Illumina MiSeq can now do 2 X 250 bp paired-end reads. We are going to use an amplicon method for a simple genotyping assay to identify very rare mutations at a single SNP site. The idea is to amplify a short fragment (less than 300-400 bp) so the F and Rev reads overlap by 50 or 100 bp, not just by a few bp, as we have done in some other assays (such as 16S metagenomics), and put our target SNP in the center of the overlap region. In this way, the SNP will be located in a high quality region of both reads, and we can get the maximum accuracy in the genotype call.

One problem with using the MiSeq for amplicon assays is that it is very sensitive to the base-pair composition at each cycle. Illumina tech support insists that a spike-in of 50% Phix DNA is necessary for any amplicon assay. (Our Genomics lab tried it with 40% added Phix and got poor results.)  However, with the latest upgrades, the MiSeq is now producing 15 Million reads per run, so with the 50% Phix spike-in, it still produces over 7 M usable (high-quality) reads.

In the past, we have not used Illumina to test for rare mutations (ultra-deep sequencing) because the overall error rate is in the 0.3-0.5 range. Even if we restrict results to bases >Q30 at our target, we can't find mutations with a certainty below the error frequency of about 1 per thousand - since that would be the expected rate of false positives. With overlapping high quality reads, we can recalculate a new Q-score based on both reads (assuming that they both agree on a variant call). I am still thinking about the proper math for this (a joint probability of error), but it is something similar to the sum of the two Q-scores (product of the two error frequencies).  This would allow us to find mutations in the one per ten thousand  (or even hundred thousand) range with a low false positive rate.  The PANDA-seq program uses similar math to calculate Q-scores for overlapping regions, so we are going to use that for the first pass on this data.