Jun 6, 2011

Involve Bioinformatics in design of every experiment... please

Two interesting projects came through our informatics group last week, both in the 'data drop' mode were the investigator asks for help to analyze data as it comes off of the sequencers. I have noted many times before, that our informatics effort is much greater on the poorly designed and failed experiments.

Experiment #1 was a seemingly standard SNP detection using exome sequences with 100 bp paired-end  reads on Illumina HiSeq (Agilent Sure Select capture) - the entire thing done by an private sequencing contractor. The contractor also supplied SNP calls using Illumina CASAVA software. Our job was simply to find overlaps between the SNP calls for various samples and controls, and to annotate the SNPs with genomic information (coding or non-coding, conservative mutations, biological pathways, etc).  However, we have an obsession with QC data, which the vendor was very reluctant to supply. Turns out that these sequencing reads have a 1.5% error rate, while our internal sequencing lab generates 0.5% error. We also see 10K novel SNPs in each sample with only minimal overlap across samples (a red flag for me). More QC data is extracted from the vendor, and now we see a steep increase in error at the ends of reads. So we wish to trim all reads down by 10-25% and recall SNPs - extract more files from vendor 3x (Illumina requires a LOT of runtime and intermediate files in order to run CASAVA for SNP calling).

Meanwhile, Experiment #2 is an RNAseq project where the investigator is interested in alternative splicing. We analyzed one earlier data set with 50bp reads with only moderate success. It seems that very deep coverage is needed to get valid data for alt-splicing, especially when levels of a poorly expressed isoform are suspected to change by a small amount due to biological treatment. The investigator saw some published results suggesting that paired-end RNAseq data would provide more information about splicing isoforms. So, WITHOUT a bioinformatics consult, they sent an existing sample (created for 50bp single end sequencing) to the lab for 100 bp paired-end sequencing. This data came out of our pipeline with more than 20% error and a strange mix of incorrectly oriented read pairs (facing outward rather than inward). After a few days of head scratching and escalating levels of Illumina bioinformatics tech support, we have an explanation. A 225 bp library fragment contains 130 bp of primers and adapters. Thus the insert has an average size of about 95 bp. Some are shorter!  Thus, our 100 cyle reads go off the far end of most sequences, adding 5 or more bases of adapter sequence where the alignment software is expecting genomic sequence. In addition, the paired ends overlap more than 100% - so the start of one read is inside the end of the other. Thus they map in the opposite orientation, with an insert size of 5-10 bp. Our best effort to analyze this data will involve chopping all reads back to 36 bp and repeating the Paired-End analysis. So that was 3 days of bioinformatics analysis time not so well spent on forensic QC.

Now we are looking back to Experiment #1 and wondering about insert sizes in that library. What if that library's insert size was about 110 or 120 bp (perhaps with a sizeable tail of much smaller fragments), and a fraction of the reads also run off into the adapter, adding mismatched bases at the ends of alignments, and thus jacking up the overall error rate.

Two conclusions: 1) talk to bioinformatics BEFORE you build your sequencing libraries
2) if you want something done right, do it yourself.

6 comments:

Anonymous said...

Please see http://www.intrepidbio.com founded by bioinformatician at the University of Louisville.

Anonymous said...

Isn't higher error rate at the end of reads a known issue with Illumina...

tgenbio said...

I'm curious about your RNAseq pipeline. Do you mind sharing steps in this analysis?

mmarchin said...

I agree with the premise and I hear it all the time. In reality of course, it's not always true that talking to a bioinformatician can prevent some of these types of problems. But at least you would be able to have some input on experiment design.

Do you run FastQC?

Douglas said...

Hi Stuart,

How much do you charge for your services in those two cases?

One key question is whether and how much next-gen users are willing to pay for bioinformatic support. Often times a user pays less than $2000 dollars to run a lane. Will the user be willing to pay between $500 and $1000 to have the data analyzed and interpreted by an experienced bioinformatician?

Anonymous said...

I don't know how many problems could actually be prevented by talking to the bioinformatics people before the sampling/sequencing starts, but I'm pretty sure it would be a *lot* better than coming to them after you have the data and need someone to "rescue" the study...