Two interesting projects came through our informatics group last week, both in the 'data drop' mode were the investigator asks for help to analyze data as it comes off of the sequencers. I have noted many times before, that our informatics effort is much greater on the poorly designed and failed experiments.
Experiment #1 was a seemingly standard SNP detection using exome sequences with 100 bp paired-end reads on Illumina HiSeq (Agilent Sure Select capture) - the entire thing done by an private sequencing contractor. The contractor also supplied SNP calls using Illumina CASAVA software. Our job was simply to find overlaps between the SNP calls for various samples and controls, and to annotate the SNPs with genomic information (coding or non-coding, conservative mutations, biological pathways, etc). However, we have an obsession with QC data, which the vendor was very reluctant to supply. Turns out that these sequencing reads have a 1.5% error rate, while our internal sequencing lab generates 0.5% error. We also see 10K novel SNPs in each sample with only minimal overlap across samples (a red flag for me). More QC data is extracted from the vendor, and now we see a steep increase in error at the ends of reads. So we wish to trim all reads down by 10-25% and recall SNPs - extract more files from vendor 3x (Illumina requires a LOT of runtime and intermediate files in order to run CASAVA for SNP calling).
Meanwhile, Experiment #2 is an RNAseq project where the investigator is interested in alternative splicing. We analyzed one earlier data set with 50bp reads with only moderate success. It seems that very deep coverage is needed to get valid data for alt-splicing, especially when levels of a poorly expressed isoform are suspected to change by a small amount due to biological treatment. The investigator saw some published results suggesting that paired-end RNAseq data would provide more information about splicing isoforms. So, WITHOUT a bioinformatics consult, they sent an existing sample (created for 50bp single end sequencing) to the lab for 100 bp paired-end sequencing. This data came out of our pipeline with more than 20% error and a strange mix of incorrectly oriented read pairs (facing outward rather than inward). After a few days of head scratching and escalating levels of Illumina bioinformatics tech support, we have an explanation. A 225 bp library fragment contains 130 bp of primers and adapters. Thus the insert has an average size of about 95 bp. Some are shorter! Thus, our 100 cyle reads go off the far end of most sequences, adding 5 or more bases of adapter sequence where the alignment software is expecting genomic sequence. In addition, the paired ends overlap more than 100% - so the start of one read is inside the end of the other. Thus they map in the opposite orientation, with an insert size of 5-10 bp. Our best effort to analyze this data will involve chopping all reads back to 36 bp and repeating the Paired-End analysis. So that was 3 days of bioinformatics analysis time not so well spent on forensic QC.
Now we are looking back to Experiment #1 and wondering about insert sizes in that library. What if that library's insert size was about 110 or 120 bp (perhaps with a sizeable tail of much smaller fragments), and a fraction of the reads also run off into the adapter, adding mismatched bases at the ends of alignments, and thus jacking up the overall error rate.
Two conclusions: 1) talk to bioinformatics BEFORE you build your sequencing libraries
2) if you want something done right, do it yourself.
Friday SNPpets - This week includes one of those stories that reminds me of the power of databases. See that diagnostic odyssey of a family with a child with mystery sympto...
3 hours ago