Jun 28, 2011

The False Discovery of Mutations by Sequencing

     I am amazed by the success reported in recent papers finding mutations by Next-Gen Sequencing in rare genetic diseases and cancer. In our lab, the sequence data for SNPs is messy and difficult to interpret. The basic problem is that NGS data, particularly Illumina data in our case, contains a moderate level of sequencing errors. We get somewhere between 0.5% and 1% errors in our sequence reads from the GAII and HiSeq machines. This is not bad for many practical purposes (ChIPseq and RNAseq experiments have no trouble with this data) and this error level "is within specified operating parameters" according to Illumina Tech support. The errors are not random, they occur much more frequently at the ends of long (100 bp) reads. Some types of errors are systematic in all Illumina sequencing (A>T miscalls are most common), and other types of errors are common to a particular sample, run, or lane of sequence data. Also, when you are screening billions of bases, looking for mutations, rare overlaps of errors will occur.
     So if sequence data contains errors, and the point of your experiment is to find mutations, then when you find a difference between your data and the reference genome (a variant), you had better make doubly sure that the difference is real. There is a lot of software designed to filter out real mutations (SNPs) from the random sequence errors. The basic idea is to first filter out bad, low quality bases using the built-in quality scores produced by the sequencer. Second, require that multiple reads show the same variant, and that the fraction of reads showing the variant makes sense in your experiment: 40-60% might be good for a heterozygous allele in a human germline sample, 10% or less might make sense if you are screening for a rare variant in a sample from a mixed population of cells.  Also, it is usually wise to filter out all common SNPs in the dbSNP database - we assume that these are not cancer causing, and they have a high likelihood of being present in healthy germline cells as well as tumor cells.
     We have used the SNP calling tools in the Illumina CASAVA software, the MAQ software package, similar tools in SAMtools, and recently the GATK toolkit. In all cases, it is possible to tweak parameters to get a stringent set of predicted mutations, filtering out low quality bases, low frequency mutations, and SNPs that are near other types of genomic problems such as insertion/deletion sites, repetitive sequence, etc. Using their own tools Illumina has published data showing a false positive detection rate of 2.89% (Illumina FP Rate).  Under many experimental designs, validating 97% of your predicted mutations would be excellent.
     Unfortunately, our medical scientists don't want predicted SNPs vs. an arbitrary reference genome. They want to find mutations in cancer cells vs. the normal cells (germline or wild type) of the same patient. This is where all the tools seem to fall apart. When we run the same SNP detection tools on two NGS samples, and then look for the mutations that are unique to the tumor vs the wild type (WT), we get a list of garbage, thousands of lines long. We get stupid positions with 21% variant allele detected in tumor and 19% variant in WT. Or we get positions where the 80% variant allele frequency is not called as a SNP in WT because 2 out of 80 reads have a one base deletion near that base. So the stringent settings on our SNP discovery software create FALSE NEGATIVES where we miss real SNPs in the WT genome, which then show up as tumor-specific mutations in our SNP discovery pipeline.
     Zuojian Tang is creating a post-SNP data filter that imposes a sanity check on the data based on allele frequencies. We are trying out various parameters, but something like a minimum of 40% variant in the tumor and less than 5% variant in the WT narrows the list of tumor-specific mutations down to a manageable number that could be validated by PCR or Sequenom.

5 comments:

Douglas said...

Hi Stuart,

One thing is not clear to me: if you want to look for difference between the cancer genome(s) and normal genome, it is at least conceptually valid if you compare them separately to a common reference and then look for differences.

-Douglas

Stuart Brown said...

The point I was trying to make is as follows: If we compare both tumor and germline to common Ref. genome, and focus on mutations found only in tumor, we get mostly false positives. When we look at the raw reads, almost all of these tumor-only mutations are present in germline, but not called by SNP software.

Douglas said...

Thank you for the clarification. Do you think the root cause is insensitivity of the SNP calling algorithm or lack of overall sufficient coverage?

Stuart Brown said...

The exome data I was working with had very good coverage (nearly 100x), so that is not the main culprit. As I see it, the problem is created by very stringent SNP calling methods. In our effort to avoid false positives, we throw away questionable SNPs - some of which are real. So in any pair of samples, some SNPs will be called in one, but excluded in the other.

Fabien Campagne said...

Hi Stuart,

We have had similar experiences and find that default stringent variant calling filters can seriously increase the rate of false negatives. We prefer approaches that allow to compare allele counts directly across groups of samples. We have implemented such an approach in Goby. This tool also provide a very fast algorithm for local read realignment around indels, which avoids alignment artifacts, yet does not filter out SNPs just because they are close to an indel. You can find the tool and various tutorials here: http://goby.campagnelab.org
The tool that compare allele frequencies or base counts for different alleles across groups of samples is called discover-sequence-variants.