Jan 20, 2013

Its hard to make NGS simple

Our Informatics group has been running training seminars for some postdocs and upper level grad students to teach computer skills for NGS data analysis. This was at their request, not my idea. What the students really wanted was the ability to run the common NGS software on their own, rather than be completely dependent on the bioinformatics group to do all of the work for them.  Our HPC director, Efstratios Efstathiadis, and I made up some lectures on basic Unix command-line skills and tried to make some simple workflows to demonstrate common NGS applications such as alignment, visualization, ChIP-seq, and RNA-seq.

The first tutorial went pretty well. We created an Amazon Cloud virtual machine and make accounts for all of our students (it had to be on the Cloud since we still have no local servers - see hurricane story in my previous blog post). We installed BWA, SAMtools, and the necessary supporting modules. We uploaded a very small FASTQ file as a practice data set and a reference genome. Then in class, we taught about a dozen basic Unix commands for the complete novice (pwd, ls, cd, mkdir, man, rm, cp, mv, head, more, ...). Then we  had the students align the sample data with BWA, transform to sorted indexed BAM files with SAMtools, then download (.BAM and .BAI files) and visualize the final data on their own computers with IGV. This took about 2.5 hours for a class of 10 students, and was generally felt to be a solid success by all participants.

The second tutorial was supposed to be a bit closer to a real use case for a ChIP-seq experiment. We created a somewhat larger practice data set that contained data just from human chromosome 1, with 2 FASTQ files: a ChIP sample and an input control (about 1 million reads per file). I gave a little lecture on the basics of ChIP-seq technology, how peak calling software works, and what to look out for in terms of QC of the data. Then we had the students make the alignments, and transform into BAM format plus index for the two FASTQ files. Then run MACS to find peaks and produce a BED file as output. Then download BAM, BAI, and BED files to local laptops and visualize on IGV as well as load BED file into Galaxy and compare with locations of known gene start sites.

This tutorial turned out to be overly ambitious. The amount of CPU churning required for 10 students to run 2 BWA jobs each on 1 million read FASTQ files was much more than our standard Amazon EC2 VM could support. We should have created all of the output files in advance and shared them with the students. Then they could start each job, kill the job, and move on. Also, downloading all the final data from the EC2 instance was a hassle. We should have just passed around a USB drive with the final BAM files. This class went way beyond our 2.5 hours, so we never found the time to show the students how to load all gene Transcription Start Sites into Galaxy and overlap with the BED file to annotate ChIP-seq peaks with respect to promoters (one of my favorite tricks).

I'm supposed to follow up with an RNA-seq tutorial in a couple of weeks, and I doubt that I can make TopHat/Cufflinks simple enough to run smoothly in a classroom setting. Overall, I have learned that it is darn hard to make even routine NGS tasks simple and bullet proof. I am leaning more toward some type of Galaxy on the Cloud solution for lab scientists who want to take some control over their data analysis tasks.


Oliver Hofmann / @fiamh said...


if you need some ideas, we've been playing with this for a bit (NGS/Galaxy). Course notes and all data is at http://scriptogr.am/ohofmann/about.

Cheers, Oliver

AndrewL said...

Hi Stuart. We've been through the same process, for starting students in a bioinf masters course and for general life sciences bench scientists moving to next-gen. We ended up writing some tutorials using Galaxy, and putting a lot of effort into building a general platform for life scientists to do genomics (based on galaxy-on-the-cloud; happily we have Enis Afgan, developer of Cloudman, working with us to build that platform). We'd be very happy for others to use the tutorials and platform! Links to both the tutes and the Galaxy-on-the-cloud instance they run on http://genome.edu.au

Also, have you seen cistrome.org? Has a very good ChIP-seq tute (again based on Galaxy) at http://cistrome.org/ap/u/cistrome/p/demonstration

We'd also be extremely happy for any feedback on these.


HelixCode said...

Dear Brown,
Would it be possible for you to upload the video of the lectures which you conducted on the request of students, so that we can even learn what was taught. I am requesting you because i am deeply interested in NGS DATA analysis and i am learning through not only books or pdf but by also watching various video tutorial on vimeo and youtube.

Anonymous said...

If you need of free assistance.
We are a team of next gen analysis experts that offer to the scientific community free service of bioinformatics analysis when it is required.
This is our website, where we explain what we do: http://www.nextgenintelligence.com/

NGI staff

Blogger said...

BlueHost is ultimately one of the best web-hosting company with plans for any hosting needs.