Our Illumina MiSeq can now do 2 X 250 bp paired-end reads. We are going to use an amplicon method for a simple genotyping assay to identify very rare mutations at a single SNP site. The idea is to amplify a short fragment (less than 300-400 bp) so the F and Rev reads overlap by 50 or 100 bp, not just by a few bp, as we have done in some other assays (such as 16S metagenomics), and put our target SNP in the center of the overlap region. In this way, the SNP will be located in a high quality region of both reads, and we can get the maximum accuracy in the genotype call.
One problem with using the MiSeq for amplicon assays is that
it is very sensitive to the base-pair composition at each cycle.
Illumina tech support insists that a spike-in of 50% Phix DNA is
necessary for any amplicon assay. (Our Genomics lab tried it with 40%
added Phix and got poor results.) However, with the latest upgrades,
the MiSeq is now producing 15 Million reads per run, so with the 50%
Phix spike-in, it still produces over 7 M usable (high-quality) reads.
In the past, we have not used Illumina to test for
rare mutations (ultra-deep sequencing) because the overall error rate is
in the 0.3-0.5 range. Even if we restrict results to bases >Q30 at
our target, we can't find mutations with a certainty below the error
frequency of about 1 per thousand - since that would be the expected
rate of false positives. With overlapping high quality reads, we can
recalculate a new Q-score based on both reads (assuming that they both
agree on a variant call). I am still thinking about the proper math for
this (a joint probability of error), but it is something similar to the
sum of the two Q-scores (product of the two error frequencies). This
would allow us to find mutations in the one per ten thousand (or even
hundred thousand) range with a low false positive rate. The PANDA-seq
program uses similar math to calculate Q-scores for overlapping
regions, so we are going to use that for the first pass on this data.
GROM – lightning-fast genome variant detection - Current human whole genome sequencing projects produce massive amounts of data, often creating significant computational challenges. Different approaches h...
2 days ago