May 28, 2015

New 'Next-Gen Seq 2' book is at the printer

The second edition of the Next-Generation Sequencing Informatics book (that I edit) is at the printer and available for pre-order at Cold Spring Harbor Press and Amazon. We think it will ship on June 30th, maybe a bit sooner.

[James Hadfield at CoreGenomics blog has posted a review: ]

We have added new chapters on the latest sequencing technology, QC, de novo transcript assembly, proteogenomics and lots of updates and expansion in areas such as RNA-seq and ChIP-seq. It has a beautiful cover and its not too expensive.

Here is the official publication blurb:

Next-generation DNA sequencing (NGS) technology has revolutionized biomedical research, making genome and RNA sequencing an affordable and frequently used tool for a wide variety of research applications including variant (mutation) discovery, gene expression, transcription factor analysis, metagenomics, and epigenetics. Bioinformatics methods to support DNA sequencing have become and remain a critical bottleneck for many researchers and organizations wishing to make use of NGS technology. Next-Generation DNA Sequencing Bioinformatics, Second edition, provides thorough, plain language introduction to the necessary informatics methods and tools for analyzing NGS data as did the first edition, and provides detailed descriptions of algorithms, strengths and weaknesses of specific tools, pitfalls and alternative methods. Four new chapters in this edition cover: experimental design, sample preparation, and quality assessment of NGS data; Public databases for DNA Sequencing data; De novo transcript assembly; proteogenomics; and emerging sequencing technologies. The remaining chapters from the first edition have been updated with the latest information. This book also provides extensive reference to best-practice bioinformatics methods for NGS applications and tutorials for common workflows. The second edition of Next-Generation DNA Sequencing Bioinformatics addresses the informatics needs of students, laboratory scientists, and computing specialists who wish to take advantage of the explosion of research opportunities offered by new DNA sequencing technologies.

and the Table of Contents:

1) Introduction to DNA Sequencing
Stuart M. Brown
2) Quality Control and Data Processing
Stuart M. Brown
3) History of Sequencing Informatics
Stuart M. Brown
4) Public Sequence Databases
Stuart M. Brown
5) Visualization of Next-Generation Sequencing Data
Philip Ross Smith, Kranti Konganti, and Stuart M. Brown
6) DNA Sequence Alignment
Efstratios Efstathiadis
7) Genome Assembly Using Generalized de Bruijn Digraphs
D. Frank Hsu
8) De Novo Assembly of Bacterial Genomes from Short Sequence Reads
Silvia Argimón and Stuart M. Brown
9) De Novo Transcriptome Assembly
Lisa Cohen, Steven Shen, and Efstratios Efstathiadis
10) Genome Annotation
Steven Shen and Stuart M. Brown
11) Using NGS to Detect Genome Sequence Variants
Jinhua Wang
12) ChIP-seq
Stuart M. Brown, Zuojian Tang, Christina Schweikert, and D. Frank Hsu
13) RNA-seq with Next-Generation Sequencing
Stuart M. Brown and Jeremy Goecks
14) Metagenomics
Guillermo I. Perez-Perez, Miroslav Blumenberg, and Alexander V. Alekseyenko
15) Proteogenomics
Kelly V. Ruggles and David Fenyö
16) DNA Sequencing Technologies and Applications
Gerald A. Higgins and Brian D. Athey
17) Cloud-based Next-Generation Sequencing Informatics
Konstantinos Krampis, Efstratios Efstathiadis, and Stuart M. Brown

Feb 9, 2015

Password hell

This is not a Bioinformatics post, just an amusing technology catch-22 that I encountered this morning. At NYU we have automatic mandatory password updates for our accounts with IT. This includes email, login to my Windows desktop computer, and wireless devices on the secure WiFi network in our building. Since I am lazy about these things, I did not heed the warnings and follow the instructions in the "Password Update" email from our IT Department. Instead, at home on Sunday night, I got a message when I tried to log in to my email account saying that I should update my password, and a helpful little box appears where it is possible to type old password and new password, hit submit and its all good.

I made a new password, and checked my mail, but after about 5 min, I got knocked off the network and can't log back in. It's late, so I figure to deal with it at the office in the morning. At my desk, I can't log into my computer (uses the same network "kerberos" password), and my phone complains that it can't get on the local wireless network. I try new password, old password, and eventually get the helpful message that my account has been locked by the IT Dept, and I must call the helpdesk. Its 9 AM on Monday and the helpdesk picks up right away. Help Guy asks if I have any wireless devices that may be using the old password. I look at the offending iPhone, and shut off WiFi. Helpdesk says: "I still see wireless activity hitting your account with an invalid password." Back to my desk, where my desktop Mac is using WiFi and getting unhappy messages from the network. Shut down WiFi. Helpdesk still sees activity on my account. Think, think?? Into the drawer where I have a laptop that we use for teaching and public seminars, it is asleep, but somehow still hitting the wireless network with my old password. Turn off WiFi on that one, and finally the helpful helpdesk guy can unlock my account. Then I can go back to each device and rejoin the network with the new password. I guess I'm not the first idiot this has happened to. Moral of the story??? Follow instructions very carefully or your helpful technology tools will gang up against you.

Happy Ice Storm Day from New York

Sep 10, 2014

Introduction to Biostatistics and Bioinformatics course at NYUMC

We are giving a new course in the PhD program at NYU Med School (Sackler Institute) this semester called "Introduction to Biostatistics and Bioinformatics". It will have a mixture of lectures on Bioinformatics, Biostatistics, and Python programming. Hopefully we will be able to show the students the intersection of these topics as something like "Data Science for Biology".  Lecturers will be myself, David Fenyo, Judy Zhong, and Pamela Wu.

Course Overview
The goal for the Introduction to Biostatistics and Bioinformatics course is to provide an introduction to statistics and informatics methods for the analysis of data generated in biomedical research. Practical examples covering both small-scale lab experiments and high-throughput assays will be explored. The course covers a wide range of topics in a short time so the focus will be on the basic concepts, and in the practical programming exercises the students explore these basic concept and common pitfalls. An introduction of basic Python and R programming will be given throughout the course and many exercises will involve programming.

The lectures will be posted to YouTube each week. Here are our first ones from yesterday:
Intro Lecture/Data Visualization:
Python programming #1:

The course curriculum and links to lecture slides (PPT), readings, and various handouts and exercises is here:

Aug 20, 2014

Cheap genome projects

I have been helping out several  groups who want to do cheap, quick genome projects on previously "unsequenced" eukaryotic organisms. In terms of the genetic diversity of living things, at this point we have sampled very unevenly across taxonomic domains. Insects are particularly underrepresented in the whole genome database. According to the number of named species, insects represent over 80% of animal species. The NCBI has over 10,000 whole genome projects, but only 132 insects (and 36 of those are Drosophila species). So there is certainly taxonomic room to knock out some more insect genomes and discover interesting new stuff. A wide variety of valid arguments can be made for the usefulness of sequencing any number of as yet overlooked organisms in all taxonomic domains.

There are a lot of very useful experimental approaches that require a draft genome as a reference:  gene expression, transcription factors and epigenetics (ChIP-seq), and just the basic evolutionary biology of important genes that are present or absent in the genome, novel paralogs in important gene families, etc.  A few years ago, building a draft genome for a new organism was a major undertaking that required substantial funding and a dedicated research team. Today, the sequencing can be done fairly cheaply, but the bioinformatics work is extremely open ended. Clearly you can work and work and work to make a very well defined and annotated genome with maximum value for all possible users. But what is the optimal set of most useful genomic information that we can produce with a few person-weeks of time?  [that would perhaps serve as preliminary data to bring in showers of additional funding for follow-up studies]

NOTE ONE: Collect both genomic and RNA sequence data. These two data types are extremely complementary. Many earlier de novo genome projects built on collections of existing EST sequence data, which was the poor man's approach to draft genomes in the previous decade. The ESTs provided seeds for gene finding on the genomic DNA, training of gene-finding algorithms, etc. Now we can get a comprehensive genome wide set of RNA-seq data for the cost of one lane on a HiSeq machine and a sample prep kit. If you have the choice, get 100 bp paired-end sequencing of the RNA, it will map better and end up giving more value for the dollar. It might be possible to use paired-end RNA to bridge DNA contigs into scaffolds - as far as I know, this is an untested area. It would be very helpful to have a normalized RNA library, to get more coverage of poorly expressed genes – this is on my wish list for future Cheap-O Genome projects.

NOTE TWO: More data is good. High total genome coverage is good, but long insert paired-end DNA libraries build better genome contigs. This makes complete sense. Early "shotgun" genome projects relied on sequencing the ends of clones from libraries with various size inserts. It would be really nice to have 10 or 20 KB insert libraries, or "mate pair" sequences that come from the junctions of large  genomic fragments that have been circularized, but these are generally not available when your entire sequencing budget is in the single digit thousands of dollars. We were able to get a 550 bp insert library for a recent project and it led to an assembly with an N50 > 40 KB. Pretty good for two lanes of HiSeq data. 

For our next cheap genome, we are using the Illumina TruSeq Synthetic Long-Read kit (which is based on technology developed by Moleculo). This is a really clever idea: first it breaks the genome into ~10 KB fragments and sorts the fragments into wells of a 384 well plate. Just a few dozen to a few hundred fragments in each well. The fragments in each well are clonally amplified (sort of like 454 technology), then sheared into the normal size range for Illumina sequencing (300-500 bp) and tagged with barcode primers at the ends.  Then all the tagged fragments are pooled and sequenced normally on a HiSeq machine. Illumina has a custom assembly app (built in BaseSpace) that demultiplexs the data and does separate de novo assembly on each barcode set – so it is just assembling the small number of 10 KB fragments from one well. The final output is a set of "synthetic long reads" that really do seem to be 10 kb long.

(From Illumina product literature)

NOTE THREE: I like the SOAPdenovo assembler (127-kmer) for Illumina DNA sequence data. It did a good job for us on several different species with only a moderate consumption of computing resources (an overnight job on 32 processors with shared 128 GB of RAM). The final product is a set of contigs in FASTA format, some quite big, and a lot of little ones. Hopefully the sum of the contigs comes out to something similar to the expected genome size of the organism.  The quite new SOAPdenovo-Trans assembler for RNA-seq also worked quite well for us – at least in comparison to Trinity which is a huge computer hog.

I visualize the bioinformatics work in two parts. First, find the genes in our data. Second, annotate the found genes and the genome using reference data.  [Annotation will be described in another blog post.]

Ok, so here is my gene finding workflow for the Cheap-O Genome Project.
Gene Finding Workflow

  1.  Assemble DNA reads into genome contigs with SOAPdenovo assembler (127-kmer)
  2.  De novo gene finding on the DNA contigs with GeneScan or GeneMark (I used GeneMark).
  3. Assemble RNA-seq reads into "transcripts" with SOAPdenovo-Trans
  4. Map RNA-seq reads onto the DNA contigs with TopHat
  5.  Make another set of transcripts with Cufflinks (using no annotation file)
  6. Use BLAT to map the de novo assembled transcripts onto the DNA contigs
  7.  Use the extremely useful script written by  Dave Tang (  to convert the output of BLAT (in .psl format) into a .bed file, choosing only the best match for each query. Without this sorting and conversion, the BLAT results as a PSL file look like garbage in a genome browser.

OK, now assemble all 5 data sets into one nice visualization using IGV or GBrowse. We have a genome track (the DNA contigs from SOAPdenovo in FASTA format), an RNA track (the RNA-seq reads aligned to the draft genome in BAM format), a gene prediction track (GeneMark GTF file), the Cufflinks transcripts (transcripts.gtf), and the RNA assemblies (from SOAPdenovo-Trans) as a BED file.   For some genes, all of the data agree quite nicely.  For other genes, it's guess your best. 

Here are two IGV screenshots of examples from the same genome contig.  The first is a nice gene with plenty of RNA where all 3 annotation methods agree.

The second is a messy region where no gene model makes much sense, none of the methods agree at all, but there seems to be enough RNA (and spliced alignments!) to suggest real transcription is happening. Time to add some reference data by homology modeling (in my next post on annotation). 

Nov 16, 2013

Dr. Evan Eichler speaks about genomic structural variation

Yesterday I attended an excellent symposium on genomic structural variation organized by the Simons Foundation. The unifying theme from all of the speakers was the use of Pacific Biosicences long read technology to resolve large-scale duplicated sequences in the human genome. These long PacBio reads (5-10 kb) can be assembled across genetic regions with complex patterns of repeat structures, segmental duplications, inversions and deletions.

For me, The highlight of the afternoon was a talk by Evan Eichler from the University of Washington.  Dr. Eichler presented both detailed sequencing data from specific loci and a grand overview of structural variation that synthsizes copy number variation, multi-gene families, the biology of autism and human evolution. His first point was that the reference genome is missing substantial sections of duplicated DNA, which has significant variation from person to person. Assembly software will tend to collapse multiple, nearly identical paralogus gene copies into one locus. Dr Eichler’s group has constructed more accurate sequences for regions with these complex patterns of segmental duplication using long PacBio reads. He has identified paralogous copies of genes, which actually exist as  multi-gene families, and then created specific tags to track the copy number of various gene isoforms in different human genomes (such as from the 1000 genomes project).  For example the SRGAP2 locus has 4 isoforms, each of which may be repeated several times in the genomes of some people.

Second, he explained that these regions of frequent copy number variation are often the site of deletions in the genomes of people with autism.  These deletions and duplications may be quite large and typically include dozens of other genes besides the family of paralogs. In fact, the genome has hotspots of CNVs that are flanked by high-identity duplicated regions. In addition, some people may have additional duplications at hotspots, which create a predisposition for deletion or expansion events in their progeny.

Why do these deletions and duplications cause autism? Dr. Eicher suggested that brain development is a process that involves many genes, and it is particularly sensitive to gene dosage.

Dr. Eichler proposed a link to human evolution that is quite tantalizing. Many of the families of duplicated genes at the CNV hotspots are involved in brain development. These same genes are not duplicated in apes. A process of gene duplication and sequence variation allows for positive selection for new brain development phenotypes.  So the gene duplication process which created expanded and more complex human brains may also make us susceptible to neurologially damaging CNV mutations.