Aug 22, 2013

Changes in software can change results

I gave a talk this week at the NGx conference in Providence RI.  Thanks to MaryAnn Brown for inviting me and Chris Dwan for chairing the session.  I am trying to share the slides from my talk as a "Google Slides" document at this URL:

https://docs.google.com/presentation/d/1UKrs2Cz8KJEXALooCbvfHKEo3IaT2fZ9-uI4nWUg_VA/edit?usp=sharing

Let me know if the file is accessible and how it looks. I have lots of ppt files from my bioinformatics course lectures, and  I want to know if this is a good way to share them.

The conference session was about analysis pipelines for sequencing data, so I decided to look at the changes that we have made in RNA-seq software over the past couple of years. We started out using the software provided by Illumina (ELAND alignment and CASAVA gene expression counting), then added edgeR for normalization and differential expression. In the past year or so we have been using the Tophat/Cufflinks/Cuffdiff package for a lot of projects. Recently we have been testing out the STAR aligner.

 I just wanted to see how much difference would show up in the final results (lists of differentially expressed genes) if we re-analyzed the same data set with each different package. This turns out to be quite difficult. Just between STAR and Cufflinks, different novel isoforms are created for each gene in each sample. Even when "merged," the lists of differentially expressed genes was very difficult to compare in a quantitative way. However, it is possible to focus on a set of "known genes" using a reference GTF file and not calling any new transcripts, and this allows for direct comparison between STAR and Tophat both at the level of reads per gene (FPKM) and at the level of p-values for differentially expressed genes. Overall, the expression levels are very highly correlated, and the p-values correlate at about 0.95.

Comparisons with the Illumina software kit are even more difficult. I learned that BAM is not actually a standard file format. Each alignment software has its own style of BAM file, and some are not compatible with each other. The official specification of the BAM format includes many "optional" fields and tags, and Cufflinks requires an "XS" tag for reads that are split by an intron, while ELAND/CASVA does not write this tag when aligning reads to its splice-junction database.  The best I could do was compare final gene lists ranked by p-value when analyzed by CASAVA/edgeR vs Tophat/Cuffdiff (all computed on the same FASTQ data files).  I got a 62% overlap between the top 1000 differentially regulated genes - which is kind of terrible. This suggests that a lot of the biological conclusions made in past RNA-seq experiments would be substantially different if we just changed the analysis software used on the exact same sequence data. Ouch!





1 comment:

Gabe Rudy said...

Thanks for the write-up Stuart.

At least it's comforting that STAR/TopHat are providing reasonable concordance.

It will be interesting to see what conclusions we can make about the impact of method choices on analysis as the SEQC papers come out with their massive benchmark sets.

As I mentioned, I've helped on one sub-project on SEQC comparing concordance of the same samples with trimmed read lengths of different sides and at our preliminary findings are that top DE genes are stable regardless of read length.