Nov 16, 2013

Dr. Evan Eichler speaks about genomic structural variation

Yesterday I attended an excellent symposium on genomic structural variation organized by the Simons Foundation. The unifying theme from all of the speakers was the use of Pacific Biosicences long read technology to resolve large-scale duplicated sequences in the human genome. These long PacBio reads (5-10 kb) can be assembled across genetic regions with complex patterns of repeat structures, segmental duplications, inversions and deletions.

For me, The highlight of the afternoon was a talk by Evan Eichler from the University of Washington.  Dr. Eichler presented both detailed sequencing data from specific loci and a grand overview of structural variation that synthsizes copy number variation, multi-gene families, the biology of autism and human evolution. His first point was that the reference genome is missing substantial sections of duplicated DNA, which has significant variation from person to person. Assembly software will tend to collapse multiple, nearly identical paralogus gene copies into one locus. Dr Eichler’s group has constructed more accurate sequences for regions with these complex patterns of segmental duplication using long PacBio reads. He has identified paralogous copies of genes, which actually exist as  multi-gene families, and then created specific tags to track the copy number of various gene isoforms in different human genomes (such as from the 1000 genomes project).  For example the SRGAP2 locus has 4 isoforms, each of which may be repeated several times in the genomes of some people.

Second, he explained that these regions of frequent copy number variation are often the site of deletions in the genomes of people with autism.  These deletions and duplications may be quite large and typically include dozens of other genes besides the family of paralogs. In fact, the genome has hotspots of CNVs that are flanked by high-identity duplicated regions. In addition, some people may have additional duplications at hotspots, which create a predisposition for deletion or expansion events in their progeny.

Why do these deletions and duplications cause autism? Dr. Eicher suggested that brain development is a process that involves many genes, and it is particularly sensitive to gene dosage.

Dr. Eichler proposed a link to human evolution that is quite tantalizing. Many of the families of duplicated genes at the CNV hotspots are involved in brain development. These same genes are not duplicated in apes. A process of gene duplication and sequence variation allows for positive selection for new brain development phenotypes.  So the gene duplication process which created expanded and more complex human brains may also make us susceptible to neurologially damaging CNV mutations.

Oct 26, 2013

A bit more on genome annotation

One bit of followup on the Pig genome annotation story. We usually visualize RNA-seq results in the IGV browser. It allows direct inspection of read alignments to your favorite genes and can also be helpful to spot sequence variations and splicing issues.  However, IGV has a set of pre-loaded default genomes that also seem to be derived from RefSeq. So once again, working with data from the pig,  There was no annotation for most of our genes of interest. This is fairly annoying since it means that the only way to look at the annotation of a gene is to first look up the gene in UCSC and then copy the exact chromosome coordinates to IGV, including intron-exon borders.

It is possible to fix this by downloading to the local computer the ENSEMBL gene annotations from UCSC Table Browser as a BED file (not too large), and then loading the BED file into IGV as another data track. This works nicely in terms of showing the genes and exons, but the gene labels still carry the ugly ENSEMBL names. Once again, the ensemblToGeneName track comes in handy, providing a table with the ENSEMBL name and the Official gene symbol for about 20,000 genes. We were able to add the gene symbol to the BED file, but this has to be done carefully (in Perl or Awk) since making file edits in Excel seems to break the BED file (at least for me).  Loading the edited BED file into IGV, I was then able to jump to genes by name and get screen shots of interesting regions that included a gene structure track with nice gene names.

Sep 26, 2013

Pig Genome Problems

I have had a few issues with genome annotation in the past couple of weeks that highlight how dependent NGS methods are on these reference files. In a mouse RNA-seq project, we found that RNA reads from some olfactory receptor genes were getting low expression counts (using TopHat/Cufflinks) because RNA reads were not mapping within the known exons of the genes. When we looked at the actual alignment BAM files in IGV, it was clear that the exon boundaries in the reference (RefSeq) genes were wrong. Both the 3' and 5' ends of some genes were missing regions with lots of RNA alignments. It would be nice to make a new reference genome using our RNA-seq reads to create our own annotation, but it is actually quite difficult to get one RNA sample with all genes expressed at useful levels.

In a different RNA-seq project on pig (a model system for cardiology), we ran the standard TopHat/Cufflinks pipeline using the RefSeq gene annotation from the Illumina IGENOMES project (a convenient way to get FASTA, GTF, and Bowtie index files for many genomes: When we looked at the alignment QC using the CollectRnaSeqMetrics from the Picard toolkit, we found that only about 20% of the reads were mapped to exons. A closer look at the annotation file showed that the RefSeq genome for pig (Sus Scrofa) only has about 4,000 genes. That just does not work for RNA-seq. ENSEMBL has a much more complete gene annotation for pig, but the ENSEMBL GTF file is not read nicely by TopHat/Cufflinks.  Thanks to my very helpful friend Igor Dolgalev, I learned that the UCSC Genome Browser has a reformatted ENSEMBL GTF annotation for pig (and most other genomes), but it is missing gene names. The GTF from UCSC Table Browser looks like this:
chr1 susScr3_ensGene stop_codon 268374986 268374988 0 - . gene_id "ENSSSCT00000005917"; transcript_id "ENSSSCT00000005917";
This GTF file has over 30,000 genes and 500,000 lines for individual exons.
UCSC also has a  conversion table for susScr3.ensemblToGeneName with names for about 20,000 of the ENSEMBL genes.  I have not figured out a simple way to write these names back into the GTF file without breaking it, so for now I am matching the names into the final gene expression results.
My only conclusion from these issues with reference genomes is that we depend on these things to be right, and when they are not, NGS data analysis can be badly messed up. There is no simple solution for providing a perfect reference for every genome for every purpose, so bioinformaticians need to stay on their toes and dig deeper when results from standard methods don't make sense.

Aug 22, 2013

Changes in software can change results

I gave a talk this week at the NGx conference in Providence RI.  Thanks to MaryAnn Brown for inviting me and Chris Dwan for chairing the session.  I am trying to share the slides from my talk as a "Google Slides" document at this URL:

Let me know if the file is accessible and how it looks. I have lots of ppt files from my bioinformatics course lectures, and  I want to know if this is a good way to share them.

The conference session was about analysis pipelines for sequencing data, so I decided to look at the changes that we have made in RNA-seq software over the past couple of years. We started out using the software provided by Illumina (ELAND alignment and CASAVA gene expression counting), then added edgeR for normalization and differential expression. In the past year or so we have been using the Tophat/Cufflinks/Cuffdiff package for a lot of projects. Recently we have been testing out the STAR aligner.

 I just wanted to see how much difference would show up in the final results (lists of differentially expressed genes) if we re-analyzed the same data set with each different package. This turns out to be quite difficult. Just between STAR and Cufflinks, different novel isoforms are created for each gene in each sample. Even when "merged," the lists of differentially expressed genes was very difficult to compare in a quantitative way. However, it is possible to focus on a set of "known genes" using a reference GTF file and not calling any new transcripts, and this allows for direct comparison between STAR and Tophat both at the level of reads per gene (FPKM) and at the level of p-values for differentially expressed genes. Overall, the expression levels are very highly correlated, and the p-values correlate at about 0.95.

Comparisons with the Illumina software kit are even more difficult. I learned that BAM is not actually a standard file format. Each alignment software has its own style of BAM file, and some are not compatible with each other. The official specification of the BAM format includes many "optional" fields and tags, and Cufflinks requires an "XS" tag for reads that are split by an intron, while ELAND/CASVA does not write this tag when aligning reads to its splice-junction database.  The best I could do was compare final gene lists ranked by p-value when analyzed by CASAVA/edgeR vs Tophat/Cuffdiff (all computed on the same FASTQ data files).  I got a 62% overlap between the top 1000 differentially regulated genes - which is kind of terrible. This suggests that a lot of the biological conclusions made in past RNA-seq experiments would be substantially different if we just changed the analysis software used on the exact same sequence data. Ouch!

May 13, 2013

Individual chapters of the Next Generation DNA Sequencing book are now on sale as downloads from the Cold Spring Harbor Press website:

I am really quite enthusiastic about this. Very few (if any) other books sell individual chapters as downloads. Hopefully some people will be willing to spring for $7 for an interesting or relevant chapter and then decide they want to get the whole  book.

Mar 8, 2013

Genotyping on MiSeq with overlapping paired-end reads

Our Illumina MiSeq can now do 2 X 250 bp paired-end reads. We are going to use an amplicon method for a simple genotyping assay to identify very rare mutations at a single SNP site. The idea is to amplify a short fragment (less than 300-400 bp) so the F and Rev reads overlap by 50 or 100 bp, not just by a few bp, as we have done in some other assays (such as 16S metagenomics), and put our target SNP in the center of the overlap region. In this way, the SNP will be located in a high quality region of both reads, and we can get the maximum accuracy in the genotype call.

One problem with using the MiSeq for amplicon assays is that it is very sensitive to the base-pair composition at each cycle. Illumina tech support insists that a spike-in of 50% Phix DNA is necessary for any amplicon assay. (Our Genomics lab tried it with 40% added Phix and got poor results.)  However, with the latest upgrades, the MiSeq is now producing 15 Million reads per run, so with the 50% Phix spike-in, it still produces over 7 M usable (high-quality) reads.

In the past, we have not used Illumina to test for rare mutations (ultra-deep sequencing) because the overall error rate is in the 0.3-0.5 range. Even if we restrict results to bases >Q30 at our target, we can't find mutations with a certainty below the error frequency of about 1 per thousand - since that would be the expected rate of false positives. With overlapping high quality reads, we can recalculate a new Q-score based on both reads (assuming that they both agree on a variant call). I am still thinking about the proper math for this (a joint probability of error), but it is something similar to the sum of the two Q-scores (product of the two error frequencies).  This would allow us to find mutations in the one per ten thousand  (or even hundred thousand) range with a low false positive rate.  The PANDA-seq program uses similar math to calculate Q-scores for overlapping regions, so we are going to use that for the first pass on this data.

Feb 25, 2013

Martin Blaser speaks about the early-life Microbiome

I like working at NYU. Interacting with very smart people has its own benefits. Even if their smarts don't directly rub off, at least I can learn something from listening to their ideas. Martin Blaser is a remarkably accomplished scientist (Past President of the Infectious Disease Society, member of the Institute of Medicine of the National Academy of Sciences, chair of the NIH Advisory Board for Clinical Research, and author of about 500 publications) with wide ranging interests in infectious disease. He has made substantial contributions to the study of Helicobacter pylori and its role in human diseases including ulcer and gastric cancer, as well as esophageal cancer, Crohn's disease and inflammatory bowel disease. Dr. Blaser has been an important contributor to the Human Microbiome Project. He has been recently profiles in the New York Times and the New Yorker magazine. Today he gave the Dean's Lecture at NYU Medical Center, where he discussed recent research in his lab on the effects of antibiotics on the microbiome of mice and the implications for human infants and children. 

Low levels of antibiotics are commonly used to supplement the feed of livestock animals, causing them to grow faster, more efficiently convert food to body mass, and to have higher levels of body fat. In a mouse system, the Blaser Lab has shown that steady low levels of antibiotics from birth to 28 weeks of age produce substantial changes in the microbe composition of the gut, as well as an increase in body fat.  The antibiotic altered microbiome, when combined with a high fat diet, has a synergistic effect on body fat (higher than either treatment alone). A similar effect is observed when several short pulses of antibiotic are administered at doses similar to what is used to treat infections in human children. The altered microbiome in antibiotic treated mice was also associated with immune suppression in the gut (lower levels of T-cells, lower levels of immune system molecules such as interleukins). 

Connecting back to human epidemiology, Americans have become much more obese in the past 50 years. In 1962, about 14% of Americans were obese, but by 2010, 42% of Americans have become obese. Blaser's point is that diet and exercise have not changed dramatically in the past generation (we ate almost as badly and were almost as lazy back in the 60's and 70's), but antibiotic use has changed dramatically.  Typical children now receive 10-20 courses of antibiotic treatments. Coincidentally, the prevalence of H. pylori in the gut has decreased  from over 50% down to just 5% of Americans.  The immune system changes that were observed in the antibiotic treated mice may correlate with the huge increases that are now observed in human asthma, celiac disease inflammatory bowel disease, type 1 diabetes, and other forms of allergy and autoimmune disorders.  Of course correlation does not prove causation, but the work of Blaser and others is beginning to identify  mechanisms that connect changes in the microbiome to these disorders. In mice, infection with H. .pylori has been shown to protect from respiratory allergic reactions.

Thanks to modern medicine, Americans now live longer, are much less likely to die of routine infections, and are taller, but we are much more obese, have more cardiac disease, and more autoimmune disease.  So, can we create a different equilibrium with our microbiome that retains the life saving anti-infection benefits of modern antibiotics, but restores the metabolic and immune balance from an earlier era?

Feb 17, 2013

Australian NGS Tutorial

These Galaxy-based NGS tutorials from the Australian "Super Science"  Genomics Virtual Lab are really excellent:

Very clear explanations, nice screen shots, and a hands-on demonstration for students to see for themselves why biological replicates are necessary to find differential expression with RNA-seq.

Jan 20, 2013

Its hard to make NGS simple

Our Informatics group has been running training seminars for some postdocs and upper level grad students to teach computer skills for NGS data analysis. This was at their request, not my idea. What the students really wanted was the ability to run the common NGS software on their own, rather than be completely dependent on the bioinformatics group to do all of the work for them.  Our HPC director, Efstratios Efstathiadis, and I made up some lectures on basic Unix command-line skills and tried to make some simple workflows to demonstrate common NGS applications such as alignment, visualization, ChIP-seq, and RNA-seq.

The first tutorial went pretty well. We created an Amazon Cloud virtual machine and make accounts for all of our students (it had to be on the Cloud since we still have no local servers - see hurricane story in my previous blog post). We installed BWA, SAMtools, and the necessary supporting modules. We uploaded a very small FASTQ file as a practice data set and a reference genome. Then in class, we taught about a dozen basic Unix commands for the complete novice (pwd, ls, cd, mkdir, man, rm, cp, mv, head, more, ...). Then we  had the students align the sample data with BWA, transform to sorted indexed BAM files with SAMtools, then download (.BAM and .BAI files) and visualize the final data on their own computers with IGV. This took about 2.5 hours for a class of 10 students, and was generally felt to be a solid success by all participants.

The second tutorial was supposed to be a bit closer to a real use case for a ChIP-seq experiment. We created a somewhat larger practice data set that contained data just from human chromosome 1, with 2 FASTQ files: a ChIP sample and an input control (about 1 million reads per file). I gave a little lecture on the basics of ChIP-seq technology, how peak calling software works, and what to look out for in terms of QC of the data. Then we had the students make the alignments, and transform into BAM format plus index for the two FASTQ files. Then run MACS to find peaks and produce a BED file as output. Then download BAM, BAI, and BED files to local laptops and visualize on IGV as well as load BED file into Galaxy and compare with locations of known gene start sites.

This tutorial turned out to be overly ambitious. The amount of CPU churning required for 10 students to run 2 BWA jobs each on 1 million read FASTQ files was much more than our standard Amazon EC2 VM could support. We should have created all of the output files in advance and shared them with the students. Then they could start each job, kill the job, and move on. Also, downloading all the final data from the EC2 instance was a hassle. We should have just passed around a USB drive with the final BAM files. This class went way beyond our 2.5 hours, so we never found the time to show the students how to load all gene Transcription Start Sites into Galaxy and overlap with the BED file to annotate ChIP-seq peaks with respect to promoters (one of my favorite tricks).

I'm supposed to follow up with an RNA-seq tutorial in a couple of weeks, and I doubt that I can make TopHat/Cufflinks simple enough to run smoothly in a classroom setting. Overall, I have learned that it is darn hard to make even routine NGS tasks simple and bullet proof. I am leaning more toward some type of Galaxy on the Cloud solution for lab scientists who want to take some control over their data analysis tasks.

Jan 2, 2013

A nice surprise for the new year, our Next Generation DNA Sequencing Informatics book is printed and shipping. I found a big box of them on my desk this morning.

Cold Spring Harbor Press is soon going to put up a service to sell single chapters by PDF download

It is apparently the best selling book on their website this month: