Dec 4, 2012

Juggling Elephants

We got hit by a hurricane.

Unfortunately that is not some cute metaphor about large amounts of NGS data, but a very real and literal hurricane. NYU Med Center is located on 1st Ave in Manhattan (between 32nd and roughly 26th streets, depending on which buildings you are counting). It overlooks the scenic FDR Drive, the East River, and some nice factories and warehouses in Long Island City, Queens. During Hurricane Sandy (a couple of days before Halloween), the East River hit us with a 15 foot (4.5 meters) storm surge. This was considerably higher than had ever before been recorded here in NYC. It flooded the subway system, train and highway tunnels; and the associated winds knocked out electrical power for many millions of people.

The damage to NYU Med Buildings was truly overwhelming. All of our hospitals had to be evacuated (Tish, Rusk, Bellevue), and power, backup generators, and all associated building services were knocked out. Our Informatics Center kept our computing cluster in a data center located in the sub-basement of the main Medical Sciences Building. This room was flooded to the ceiling (and half-way filled the floor above) for a couple of days with water from the East River mixed with fuel oil, and the contents of the adjacent mouse breeding lab and Gamma Knife medical suite. Needless to say, those computers (700 nodes) were a total loss - bagged and hauled out as toxic waste. Thanks to foresightful and heroic efforts by our Computing Directors, we managed to save the data backup (200 Terabytes of sequencing data on an Isilon cluster). Our Next-Gen Sequencing lab was not directly damaged by the storm, but it is in a building that still has no power, AC, or running water, and the building is still undergoing assessment for asbestos cleanup and determination of what structural repairs will be needed (4 weeks after the storm).

So that little vignette is a preface to a discussion of outsourcing our sequencing and computing in order to maintain Next-Gen sequencing services for NYU research scientists. A number of labs stepped up to offer us help. We are doing HiSeq and MiSeq runs at the New York Genome Center and Memorial Sloan Kettering (thanks guys). We moved our data storage to a data center in New Jersay and we are borrowing some computing power at the NYU Center for Genomics and Systems Biology (thanks guys) and renting some power on the Amazon cloud. Our NGS lab director is moving DNA/RNA samples and prepared sequencing libraries around town by taxi. My challenge is trying to organize the flow of data from the labs back to the investigators (and to keep an archive in our data storage). This represents something like 500 GB of data per HiSeq run, and we are getting two or more of these per week. The time for data transfer is a significant obstacle - either by FTP over open  Internet (and through one or two Firewalls) or by copy onto USB drive (and then copy from USB to our local computers, then copy again to our archive and to remote data processing machines, which may have their own FTP servers). This is why it is starting to feel like we are juggling elephants.

Having run our own NGS lab for over 3 years, we are extremely aware of all of the different types of errors that can occur in the sequencing process (sample prep. pooling of libraries, recording incorrect barcodes in sample sheets,  machine mechanics and fluidics, computational glitches). Therefore we are rather obsessive about QC checking of the data. Maybe we are just working the kinks out of the system, but it seems like almost every run has some problem that needs to re-process the primary data (from the Run folder with its.BCL and other files, not just the final FASTQ). I hope we are not wearing out the patience of our generous collaborators.

Someone may be able to run a fully outsourced NGS lab and bioinformatics computing support service, but for us, it has not been easy.

Oct 16, 2012

Mutation linked to relapse in childhood ALL

Julia Meyer is a Ph.D. student in the lab of Bill Carroll in the NYU Cancer Institute, and I have the good luck to sit on her thesis advisory committee. Julia has been using RNA-seq to look for mutations that are specific to relapse of childhood acute lymphoblastic leukemia (ALL). ALL can often be cured in children, but years after remission about 20% of patients relapse, and prognosis for relapsed ALL is very poor. 

Julia studied pairs of RNA samples from 10 patients taken at original diagnosis and again after relapse (Illumina RNA-seq). The data analysis was very difficult since she initially found millions of variants, and even after extensive stringent filtering and matching between diagnosis and relapse, there were many false positives. Eventually she narrowed it down to just 20 non-synonymous mutations that were specific to the relapse samples. Two patients harbored (different) relapse-specific mutations in the same gene, NT5C2, which codes for the Cytosolic 5’-nucleotidase II. Full exon sequencing of NT5C2 was completed in 61 additional relapse specimens (using 454 amplicon protocol), identifying 5 additional mutations which were also confirmed as relapse specific. 

Conclusions: Mutations in NT5C2 are associated with the outgrowth of drug resistant cells in childhood ALL.

This work was published as an ASCO abstract at the 2012 ASCO annual meeting.

As a member of the thesis committee, I got a view of some really interesting followup studies. The NT5C2 gene product is a purine nucleotidase. Structural modeling of the relapse-associated mutations in the encoded protein  suggests alteration of enzyme subunit association/dissociation. Julia has found that cells transfected with the mutant version of the NT5C2 are RESISTANT to 6-mercaptopurine, which is one of the drugs used for long term maintenance chemotherapy of ALL. She also found very low levels of the mutant allele in some diagnosis samples (early stage disease). The obvious implication is that under long term drug treatment, a clone of tumor cells with activating mutations in NT5C2 increases and create a drug resistant relapse. 

Wow. This is the first molecular model for the cause of relapse of ALL. It could lead directly to diagnostics and therapy. 

Oct 15, 2012

My Next-Generation DNA Sequencing Informatics book has gone live on the Cold Spring Harbor Laboratory Press website (pre-orders).

It has the following chapters (see below). I am going to leak a few pages as a teaser, so I need to know which chapter is most interesting to some random selection of people. Votes will be counted in the comment section for this post.

1. Introduction to DNA Sequencing
Stuart Brown
2. History of Sequencing Informatics
Stuart Brown
3. Visualization of Next-Generation Sequencing Data
Phillip Ross Smith, Kranti Konganti, and Stuart Brown
4. DNA Sequence Alignment
Efstratios Efstathiadis
5. Genome Assembly Using Generalized de Bruijn Digraphs
D. Frank Hsu
6. De Novo Assembly of Bacterial Genomes from Short Sequence Reads
Silvia Argimón and Stuart Brown
7. Genome Annotation
Steven Shen
8. Using NGS to Detect Sequence Variants
Jinhua Wang, Zuojian Tang, and Stuart Brown
9. ChIP-seq
Zuojian Tang, Christina Schweikert, D. Frank Hsu, and Stuart Brown
10. RNA Sequencing with NGS
Stuart Brown, Jeremy Goecks, and James Taylor
11. Metagenomics
Alexander Alekseyenko and Stuart Brown
12 High-Performance Computing in DNA Sequencing Informatics
Efstratios Efstathiadis

Aug 9, 2012

New Blog on Next Generation Sequencing

I have started writing a new blog "channel" about Next Generation Sequencing for a website called BiteSizeBio. The first article went live today:

NGS- A Revolution in Technology

These articles will be more basic and broad then the internal lab stuff (and random personal observations) that I post here. I plan to write about one per month, and I will probably post links here when they go live.

Aug 8, 2012

I gave a short presentation at the NYU CTSI Translational Research in Progress seminar this week about our sequencing work on strains of Streptococcus mutans associated with severe tooth decay in children.  We sequenced and did de novo assembly on each of 20 strains, then did an in silico subtraction to find unique genomic elements associated with health or disease.

the details are in this PDF file:

Aug 1, 2012

I led a workshop on 7/31/2012 about NGS for Drug Development at this Hanson Wade conference in Boston:

NGS Bioinformatics for Drug Developers

I made a huge PPT slide deck to fill 2 hours, but in the workshop we ended up in a multi-way discussion for most of the time, so I never got to show half the slides.  I am posting them on
so I don't feel like the effort was wasted.

NGS drug dev PPT slides

Jun 8, 2012

MiSeq replaces 454 in Microbiome projects

One of my Bioinformatics students, Laura Cox, is working on a Human Microbiome Project  study with Martin Blaser. Yesterday, she presented a lab report to a standing room only crowd about her sequencing work on bacterial populations using the Illumina MiSeq machine. Up till now, HMP work was about the only sequencing that we consistently ran on the 454 machine. Laura showed that with the MiSeq paired-end 150 bp sequencing protocol, it was possible to sequence 16S amplicons (in the V4 region) from both ends and stitch them together using the ea-utils FASTQ-join [Erik Aronesty (2011). ea-utils : "Command-line tools for processing biological sequencing data";] to get about 260 bp reads on each amplicon. Laura used a custom multiplex scheme to get 192 different samples into one MiSeq run, which after demultiplexing, gave about 2,000 P-E reads per sample. 

She also demonstrated that the resulting sequence data could be processed with QIIME [] to get reasonable taxonomy information, build phylogenetic trees, and apply all the cute tools to calculate diversity and compare groups of samples by PCA and UNIFRAC [].

The economics of MiSeq are persuasive. It is giving amplicon data at about 40X less cost than 454. As our HMP protocols shift over to MiSeq, this will be the last year that we keep the 454 machine in the Genomics Core Lab. 

Mar 14, 2012

Collaborative work on Exome SNPs

I have noticed that a fair number of people who actually work with Next-Gen Sequence data read this blog, so perhaps we can use it for a collaborative project.

I want to write a paper about uneven coverage in exome sequencing leading to incorrect SNP calls. Our data is from tumor-normal pairs, and we see a lot of false negatives - failure to detect a SNP in a sample due to low coverage at that spot. Exome capture methods seem to have more than their fair share of low coverage spots (even with an average coverage over 100x), and these low coverage spots do differ somewhat from sample to sample. I'd like some other people to share data with us and/or do some similar analysis on other data sets so that we can make a stronger paper.

Feb 20, 2012

$18,000 Cancer Genome

Illumina has announced an updated and improved cancer genome sequencing service for $18,000. This will provide genome sequencing of the tumor at 80x and of normal tissue from the same patient at 40x. This is similar to the coverage offered by Complete Genomics for a similar service (at $12K). Illumina also offers a novel sample prep method (by partnership with the Broad Inst.) for very small samples and FFPE.

Perhaps the most interesting thing about the Illumina service is the bioinformatics support, which will include a new variant detection algorithm that looks at both the tumor and normal together, in order to reduce the false positives. The standard approach, available from other software, does variant detection separately for each sample, then tries to subtract the variants found in the normal from those found in the tumor. This method works very poorly, since many variants cannot be called accurately in tumor samples which contain various amounts of normal tissue mixed in as well as tumor genomic heterogeneity. Many other existing variants are simply not called in the normal sample (ie. false negatives) due to poor coverage, poor quality, nearby insertion/deletions or any other feature that fails stringent variant detection software.  We have been working on this same approach to the problem, but Illumina brings a much bigger team with a access to a LOT more data. Illumina will also provide custom annotation of discovered variants provided by a team of human bioinformaticians (rather than just running the data through a static annotation software pipeline).

I think this is a more realistic milestone for clinical sequencing than the mythical $1000 genome. Cancer patients are one of the few (common) clinical scenarios where whole genome sequencing could really pay off with actionable discoveries - allowing genetic information to be used to chose targeted drugs and other interventions. A simple genome sequence (at whatever coverage) of a healthy person does not provide much medically actionable data today. Furthermore, the informatics that can currently be applied to a single, cheaply acquired, genome sequence range from relatively inexpensive but simplistic one-size-fits-all software pipelines to the equally mythical $100,000 interpretation (presumably provided by a dedicated project team of expert informaticians and medical geneticists).