Aug 22, 2013

Changes in software can change results

I gave a talk this week at the NGx conference in Providence RI.  Thanks to MaryAnn Brown for inviting me and Chris Dwan for chairing the session.  I am trying to share the slides from my talk as a "Google Slides" document at this URL:

https://docs.google.com/presentation/d/1UKrs2Cz8KJEXALooCbvfHKEo3IaT2fZ9-uI4nWUg_VA/edit?usp=sharing

Let me know if the file is accessible and how it looks. I have lots of ppt files from my bioinformatics course lectures, and  I want to know if this is a good way to share them.

The conference session was about analysis pipelines for sequencing data, so I decided to look at the changes that we have made in RNA-seq software over the past couple of years. We started out using the software provided by Illumina (ELAND alignment and CASAVA gene expression counting), then added edgeR for normalization and differential expression. In the past year or so we have been using the Tophat/Cufflinks/Cuffdiff package for a lot of projects. Recently we have been testing out the STAR aligner.

 I just wanted to see how much difference would show up in the final results (lists of differentially expressed genes) if we re-analyzed the same data set with each different package. This turns out to be quite difficult. Just between STAR and Cufflinks, different novel isoforms are created for each gene in each sample. Even when "merged," the lists of differentially expressed genes was very difficult to compare in a quantitative way. However, it is possible to focus on a set of "known genes" using a reference GTF file and not calling any new transcripts, and this allows for direct comparison between STAR and Tophat both at the level of reads per gene (FPKM) and at the level of p-values for differentially expressed genes. Overall, the expression levels are very highly correlated, and the p-values correlate at about 0.95.

Comparisons with the Illumina software kit are even more difficult. I learned that BAM is not actually a standard file format. Each alignment software has its own style of BAM file, and some are not compatible with each other. The official specification of the BAM format includes many "optional" fields and tags, and Cufflinks requires an "XS" tag for reads that are split by an intron, while ELAND/CASVA does not write this tag when aligning reads to its splice-junction database.  The best I could do was compare final gene lists ranked by p-value when analyzed by CASAVA/edgeR vs Tophat/Cuffdiff (all computed on the same FASTQ data files).  I got a 62% overlap between the top 1000 differentially regulated genes - which is kind of terrible. This suggests that a lot of the biological conclusions made in past RNA-seq experiments would be substantially different if we just changed the analysis software used on the exact same sequence data. Ouch!





3 comments:

Anonymous said...

Thanks for the write-up Stuart.

At least it's comforting that STAR/TopHat are providing reasonable concordance.

It will be interesting to see what conclusions we can make about the impact of method choices on analysis as the SEQC papers come out with their massive benchmark sets.

As I mentioned, I've helped on one sub-project on SEQC comparing concordance of the same samples with trimmed read lengths of different sides and at our preliminary findings are that top DE genes are stable regardless of read length.

Anacyte Laboratories said...

Thanks for sharing this blog, I need such type of information that got this post. For more information please visit our site

RNA Sequencing

Single Cell RNA Sequencing

Protein Stabilization

Ric Clayton said...

I really want to thank Dr Emu for saving my marriage. My wife really treated me badly and left home for almost 3 month this got me sick and confused. Then I told my friend about how my wife has changed towards me. Then she told me to contact Dr Emu that he will help me bring back my wife and change her back to a good woman. I never believed in all this but I gave it a try. Dr Emu casted a spell of return of love on her, and my wife came back home for forgiveness and today we are happy again. If you are going through any relationship stress or you want back your Ex or Divorce husband you can contact his whats app +2347012841542 or email emutemple@gmail.com website: Https://emutemple.wordpress.com/