I am amazed by the success reported in recent papers finding mutations by Next-Gen Sequencing in rare genetic diseases and cancer. In our lab, the sequence data for SNPs is messy and difficult to interpret. The basic problem is that NGS data, particularly Illumina data in our case, contains a moderate level of sequencing errors. We get somewhere between 0.5% and 1% errors in our sequence reads from the GAII and HiSeq machines. This is not bad for many practical purposes (ChIPseq and RNAseq experiments have no trouble with this data) and this error level "is within specified operating parameters" according to Illumina Tech support. The errors are not random, they occur much more frequently at the ends of long (100 bp) reads. Some types of errors are systematic in all Illumina sequencing (A>T miscalls are most common), and other types of errors are common to a particular sample, run, or lane of sequence data. Also, when you are screening billions of bases, looking for mutations, rare overlaps of errors will occur.
So if sequence data contains errors, and the point of your experiment is to find mutations, then when you find a difference between your data and the reference genome (a variant), you had better make doubly sure that the difference is real. There is a lot of software designed to filter out real mutations (SNPs) from the random sequence errors. The basic idea is to first filter out bad, low quality bases using the built-in quality scores produced by the sequencer. Second, require that multiple reads show the same variant, and that the fraction of reads showing the variant makes sense in your experiment: 40-60% might be good for a heterozygous allele in a human germline sample, 10% or less might make sense if you are screening for a rare variant in a sample from a mixed population of cells. Also, it is usually wise to filter out all common SNPs in the dbSNP database - we assume that these are not cancer causing, and they have a high likelihood of being present in healthy germline cells as well as tumor cells.
We have used the SNP calling tools in the Illumina CASAVA software, the MAQ software package, similar tools in SAMtools, and recently the GATK toolkit. In all cases, it is possible to tweak parameters to get a stringent set of predicted mutations, filtering out low quality bases, low frequency mutations, and SNPs that are near other types of genomic problems such as insertion/deletion sites, repetitive sequence, etc. Using their own tools Illumina has published data showing a false positive detection rate of 2.89% (Illumina FP Rate). Under many experimental designs, validating 97% of your predicted mutations would be excellent.
Unfortunately, our medical scientists don't want predicted SNPs vs. an arbitrary reference genome. They want to find mutations in cancer cells vs. the normal cells (germline or wild type) of the same patient. This is where all the tools seem to fall apart. When we run the same SNP detection tools on two NGS samples, and then look for the mutations that are unique to the tumor vs the wild type (WT), we get a list of garbage, thousands of lines long. We get stupid positions with 21% variant allele detected in tumor and 19% variant in WT. Or we get positions where the 80% variant allele frequency is not called as a SNP in WT because 2 out of 80 reads have a one base deletion near that base. So the stringent settings on our SNP discovery software create FALSE NEGATIVES where we miss real SNPs in the WT genome, which then show up as tumor-specific mutations in our SNP discovery pipeline.
Zuojian Tang is creating a post-SNP data filter that imposes a sanity check on the data based on allele frequencies. We are trying out various parameters, but something like a minimum of 40% variant in the tumor and less than 5% variant in the WT narrows the list of tumor-specific mutations down to a manageable number that could be validated by PCR or Sequenom.
Accounting for cell type hierarchy in evaluating single cell RNA-seq clustering - Cell clustering is one of the most common routines in single cell RNA-seq data analyses, for which a number of specialized methods are available. The evalu...
19 hours ago