Next-Gen Sequencing

Some tips to optimize bacterial genome assembly

2018-10-29T16:10:00.000-04:00

I just finished a revised genome assembly for a collaborating lab. We do de novo sequencing of bacterial genomes all the time, sometimes in batches of 50 or 100 different isolates barcoded together on a high-yield HiSeq run. We typically aim for coverage in the 500x to 1000x range with paired-end 150 base reads.

This genome was assembled as part of a "pipeline" script using SPADES, but it did not come out very nicely - over 3000 contigs, most of them small with low coverage. A FastQC on the raw data looked good (all bases with mean quality >Q30) except for the last 10 bases show a big drop in quality AND there are Illumina adapters found sometimes as much as 70 bases into the 3' end of quite a lot of the reads.

I also checked some of the largest contigs by a simple BLASTn at the NCBI website, and they matched quite well to one of our favorite bacterial species (e-value = 0.0; percent identity = 93%). HOWEVER, many of the smaller contigs match the HUMAN genome. This is a big red flag for me, and probably should be checked for every bacterial (or any non-human) de-novo genome assembly.

So here is my improved de novo assembly protocol:

1) To remove the human DNA, I went back to the raw FASTQ files, and aligned to the reference human genome with Bowtie2. The simple way to do this is to use the '--un' flag in the Bowtie2 command, which dumps all unaligned reads to a separate output file. There are other more sophisticated ways to do this (using SAMtools) that interrogate if one or both of a set of paired-end reads align concordantly to the genome, but I wanted to be certain to remove any and all human reads. So I aligned each of my paired end data files (R1 and R2) separately and collected the unmatched reads for each. I found 0.5% human reads in my raw data. This may not seem like a lot, but the assembler will make thousands of contigs from this small amount of contaminant DNA.

This filter will probably remove a few reads (genome regions) that contain simple sequence repeats longer than 25 bases (for example ATATATAT..., or CCCCCCCCC...) that are found in both human and my bacteria, but we know that assembly of repeats is not reliable anyway with 150 base Illumina reads.

2) Using the unmatched files from the human screen, I filtered for quality and removed Illumina adapters with Trimmomatic. This also catches the case where one read matches human and is removed, but the mate-pair is not. These 'unpaired' reads are not included in the 'trimmed' output file.

3) The trimmed output files are now ready for assembly with SPADES. I used kmer sizes 21, 33, 55, 77, 99 as recommended by the SPADES manual for 150 bp paired Illumina reads. I also included the --careful flag which is recommended for bacterial genomes with high coverage.

4) I got good results with this assembly - only about 600 contigs, with the largest one containing more than half of the expected genome (N50 > 1 Mb). This FASTA data file of contigs was small enough to use the NCBI web BLAST server for a single search. I was also able to Format the output as a text report showing just the top 10 matches for each contig (with no alignments).

5) In addition to contig length, SPADES also reports the depth of kmer coverage in the header line for each contig in the multi-sequence FASTA file. I observed that all contigs with coverage greater than 100x matched to the same bacterial genus, but contigs with low coverage matched to all sorts of different things - quite a few to E.coli. I conclude that the low coverage contigs are low abundance contaminants in our sequencing library, and should not be included in the published de novo genome. [I do not trust pre-filtering by Bowtie against E.coli, we might lose some highly conserved genes]

6) Somebody more diligent than myself (perhaps one of my students in need of a final project in the Intro Bioinformatics course) can write a Biopython script to filter sequences in a FASTA file by coverage as reported in the SPADES FASTA header. I did it with some careful work with awk and Excel. My final genome is 2.2 Mb, with just 39 high coverage contigs which all have highly significant best BLAST matches to the same bacterial genus and over 2000 genes predicted by GeneMarkS.

The takehome message:
1) filter out Human reads from raw sequence data that will be used for de-novo assemblies (I don't know how well this will work for mammals, vertebrates, etc).
2) filter out low coverage contigs from your final contig assembly FASTA file.

Raw data:

Genome Coverage from BAM file

2018-01-31T16:45:00.001-05:00

There are many excellent tools for analysis of Next Gen Sequencing data in the standard BAM alignment format so I was surprised how difficult it was for me to get a nice graph of genome coverage. This will be trivial for a lot of hard core bioinformatics coders, so just move along if you are bored/annoyed.

I needed to check the evenness of coverage across intervals of a bacterial genome that we were re-sequencing for various experimental reasons. I aligned my FASTQ to a reference genome from GenBank using Bowtie2. There are several nice tools in the SAMTools and BEDTools kits that produce either a base by base coverage count or a histogram of coverage showing how many bases are covered how deeply. I wanted a map at 1 Kb resolution. It took a while to figure out that I first need to make a BED file of intervals from my genome - with correct names for the 'chromosomes' that match the SAM header, and then use 'samtools bedcov' to get my intervals. Then a simple graph from Excel or R shows the coverage per interval along the genome.

Here are the steps (as much for me to remember as for usefulness to anyone else)

1) Create a Bowtie2 index of the reference genome (can be from GenBank, or can be a de novo assembly of contigs created locally from this FASTQ data). Reference_in must be FASTA format.
bt2_base is the name you will call the index.

bowtie2-build

2) Align the FASTQ file(s) to the Reference. bt2-idx is the name of the index created in the previous step. There are a ton of options for stringency of alignment, format of input data, etc. I set this to use 16 CPU threads. I usually like to leave out the unaligned reads, which can reduce the output file size somewhat. [Of course, the unalinged reads are the goal when you use Bowtie as a filter to remove human from microbiome data or any other sort of contaminant screen.]

bowtie2 -p 16 --no-unal -x -1 -2 -S

3) Is is annoying that Bowtie produces output in SAM format, and 99% of the time, the very first thing you have to do is convert to sorted BAM. Note that samtools sort puts its own .bam on the end, so if your are not careful you will get files named output.bam.bam

samtools view -bS output.sam | samtools sort - file_sorted

4) Create a 'genome' file for your reference genome. This is just a tab delimited file that names the chromosomes (or individudal sequences that may be in a multi-FASTA file, such as contig names).
It is super easy to mess this up. The best way is to view the top of your output.sam file created by bowtie2. The lines that start with @ are your chromosome headers, and they very helpfully already show the length of each one. This is a bit of a pain if you have a genome with lots of contigs, but a little 'cut' and 'paste' in bash or Excel will get you there.

Here is what mine looked like:

@HD VN:1.0 SO:unsorted
@SQ SN:MKZW02000001.1 LN:5520555
@SQ SN:MKZW02000002.1 LN:248293
@PG ID:bowtie2 PN:bowtie2 VN:2.2.7 CL:"/local/apps/bowtie2/2.2.7/bowtie2-align-s --wrapper basic-0 -p 8 -x Kluy

and here is the genome file I made:

MKZW02000001.1 5520555
MKZW02000002.1 248293

5) Make a set of intervals with bedtools makewindows. I wanted 1 Kb intervals, so I use -w 1000.

The result is a simple BED file with one line for each 1Kb window of the genome.

bedtools makewindows -g genome.txt -w 1000 > genome_1k.bed

$ more genome_1k.bed

MKZW02000001.1 0 1000

MKZW02000001.1 1000 2000

MKZW02000001.1 2000 3000

MKZW02000001.1 3000 4000

6) Use samtools bedcov to count the total number of bases in the BAM file that are located in each of the intervals ('sum of per base read depths per BED region'). This works much faster than any other coverage tool that I have tested.

samtools bedcov genome_1k.bed kv_sorted.bam > kv_1k.cov

7) Divide the sum of coverage by the window size (/1000 in my case), and plot the average coverage per window as a scatter plot, using the end of each interval as the X axis and the coverage as the Y.

Histogram will only work nicely if you have very few intervals. In my case, high and low coverage outlier intervals are easily visible.

Genome Annotation Challenges

2018-01-04T14:42:00.000-05:00

Public databases of genetic information have a fundamental garbage-in>garbage-out problem. A huge number of useful databases are populated by pulling information from other databases and adding new value by computational inferences, but automated linking of databases can propagate incorrect information. The curators of primary repositories such as GenBank make an substantial effort to publish only correct information, so they are very conservative about annotating genes with only verifiable information. NCBI also has a policy that the original depositor of any given entry (a gene, protein, genome, experimental dataset, etc.) is the author of its annotation and metadata, and no one else can alter it.

Staphylococcus aureus is an important human pathogen with perhaps the largest number of whole genome sequences in public repositories of any bacteria. NCBI has 8367 Staph genomes in its “Genomes” section (on Jan 1, 2018), and another ~40,000 in the SRA and Whole Genome Shotgun sections. However, GenBank has chosen strain NCTC 8325 as the Reference Genome for Staph, and put its genes in RefSeq.

This genome was sequenced, annotated, and submitted on 27-Jan-2006 by Gillaspy et al from the Oklahoma Health Sciences Center. As a result of this “Reference Genome” designation, an automatic lookup of a Staph gene in GenBank is likely to get the annotation from NCTC 8325. This particular Staph genome has 2,767 protein coding genes (plus 30 pseudogenes, 61 tRNA, and 16 rRNA genes), however 1496 of these proteins are annotated with only “hypothetical protein” in their “gene product” or “description” field. This is very confusing, since many of these genes are 100% identical to proteins that have specific and well documented functions in other Staph strains.

Here is one example:

hypothetical protein SAOUHSC_00010 [Staphylococcus aureus subsp. aureus NCTC 8325]

NCBI Reference Sequence: YP_498618.1

FEATURES Location/Qualifiers source 1..231 /organism="Staphylococcus aureus subsp. aureus NCTC 8325" /strain="NCTC 8325" /sub_species="aureus" /db_xref="taxon:93061" Protein 1..231 /product="hypothetical protein"

GenBank knows that this is not the correct annotation for this protein. In the “Region” sub-field of the record (which is very rarely used by automated annotation tools that take data from GenBank) an appropriate function, COG and a CDD conserved domain are noted:

Region 7..231 /

region_name="AzlC"

/note="Predicted branched-chain amino acid permease (azaleucine resistance)

[Amino acid transport and metabolism]; COG1296"

/db_xref="CDD:224215"

NCBI also links this gene to an “Identical Protein Group” where 3957 proteins are listed with 100% amino acid identity, which are annotated variously as: “azaleucine resistance protein AzlC”, “branched-chain amino acid ABC transporter permease”, “AzlC”, and “Inner membrane protein YgaZ”. A very conservative annotation bot might panic at this level of inconsistency and default to the lowest common denominator of “hypothetical protein”. However, a more sophisticated automaton might compare the protein sequence to PFAM or COG protein functional families and assign a common annotation to them all.

The incorrect “hypothetical” annotations for Staph genes in GenBank can be found downstream in many other databases, such as the Database of Essential Genes, AureoWiki, KEGG, UniProt, etc. which all upload their primary annotation from GenBank. So someone sequencing a new strain of Staph and using any of these resources to annotate predicted genes will probably end up assigning “hypothetical protein” for the AzlC gene and many hundreds of others, perpetuating the cycle of misinformation.

In a lot of other cases, it does not seem possible for an algorithm to resolve messy annotations that a human expert might be able to figure out. For example, Staph strain COL has many hypothetical genes such as SACOL1097. NCBI Identical Proteins also show only “hypothetical protein” annotations. However, a BLAST search shows 95% identity to nitrogen fixation protein NifR.

hypothetical protein SACOL1097 [Staphylococcus aureus subsp. aureus COL]

GenBank: AAW37977.1

Identical Proteins FASTA Graphics

LOCUS AAW37977 59 aa linear BCT 31-JAN-2014

DEFINITION hypothetical protein SACOL1097 [Staphylococcus aureus subsp. aureus COL].

ACCESSION AAW37977

VERSION AAW37977.1

DBLINK BioProject: PRJNA238

BioSample: SAMN02603996

DBSOURCE accession CP000046.1

SOURCE Staphylococcus aureus subsp. aureus COL

Contaminated Genomes

2017-12-22T13:25:00.000-05:00

This is a long post, so first a quick summary. Some genome sequences contain contaminants. These contaminants create many problems when we use a trusted resource like GenBank or UniProt to summarize the sequences in a taxonomic group. I have illustrated one typical example, but there are thousands (maybe tens of thousands) of others.

I have been obsessing over errors and contamination in our public sequence databases. This week I was trying to use UniProt as a set of reference sequences for fungi. Our goal is fairly simple: To find the fungal DNA in a metagenomic shotgun sequence sample - which is just a mixture of all the DNA present in a scraping from mouth, throat, or any other body site.

UniProt makes it quite easy to sort all their proteins by taxonomy, and to download a subset of the data clustered at 100% (combining all exact duplicate sequences), 90%, or 50% amino acid identity. One might expect that fungal genes should not match bacteria at more than 50% identity. But surprisingly there are quite a lot of 50% and 90% clusters that contain both bacterial and fungal sequences (about 3000 of the 90% fungal clusters also contain bacterial proteins).

The UniProt support staff provided some very useful help to build a query on their system that finds only those clusters of 90% identical proteins that contain fungal genes, but NO (NO!) bacterial genes. In case you like this sort of thing, here is the exact query:

uniprot:(taxonomy:"Fungi [4751]") NOT taxonomy:"Bacteria [2]" AND identity:0.9
[Note the careful use of quote marks parenthesis and square brackets, this stuff is rather tricky]

So I downloaded this set of putative fungal proteins (UniProt very helpfully creates a single 'representative' UniRef sequence in FASTA format for each cluster). I tested the fungal proteins against all the gene coding sequences (CDS) from the E.coli genome using BLASTx. Once again, there are far too many high similarity matches.

One of the top matches is to a gene (Guanosine-3',5'-bis(diphosphate) 3'-pyrophosphohydrolase) from the fungus Beauveria bassiana that has 98% identity to E.coli. Since I am in an obsessive mood about this sort of thing, I decided that for this one example, I would collect some evidence to decide if we have strong sequence homology between bacteria and fungi for this gene, if Beauveria bassiana has a horizontal gene transfer, or if E. COLI CAN BE A CONTAMINANT IN GENOME SEQUENCES (!!!) [emphasis mine]

I put this Beauveria gene into a generic NCBI BLAST against all 'nr' proteins, and I got a very interesting result. There are exactly two matches to eukaryotes (Beauveria and a nematode), and 11,858 matches to bacteria, including lots of E.coli.

So I traced the Beauveria bassiana protein in UniProt back to its source as a whole genome shotgun sequence uploaded to GenBank on Nov 3, 2014 by the Institute of Plant Protection, Jilin Academy of Agricultural Sciences, Accession PRJNA178080, WGS ANFO00000000, Assembly GCA_000770705.1.

I downloaded the whole genome assembly and BLASTed it with the E.coli hydrolase gene from above. This very quickly pinpointed a contig 00271 (ANFO01000251.1 Beauveria bassiana D1-5 contig00271) that contains the matching sequence. The contig is 72,232 bases long. I then put this conting into NCBI BLAST against Bacteria. I get matches that correspond to lots of bacterial genes (POL I, RecG, iPGM, XanP, CpxA, GTP binding protein, GSI beta, and my ppGpp hydrolase) all with >90% identity and BLAST e-value 0.0.

Final answer: This is a contaminant. There was some E.coli DNA sequenced and assembled with the Beauveria DNA, and nobody checked before loading these sequences into GenBank.

My recommendation to GenBank and de novo genome sequencers everywhere is to check all predicted proteins from new genomes for matches to bacteria and human before loading them into a trusted database.

Gene expression analysis shows that an AHR2 mutant fish population from the Hudson River has a dramatically reduced response to dioxin

2017-08-24T12:00:00.001-04:00

Together with Ike Wirgin in the NYUMC Dept. of EnvironmentalMedicine, we just published a paper on a gene expression study of a fish population in the Hudson River in Genome Biology and Evolution. "A Dramatic Difference inGlobal Gene Expression between TCDD-Treated Atlantic Tomcod Larvae from theResistant Hudson River and a Nearby Sensitive Population"

Atlantic tomcod (Microgadus tomcod) is a fish that Ike has been studying for many years as an indicator of biological responses to toxic pollution of the Hudson River estuary which contains two of the largest Superfund sites in the nation because of PCB and dioxin (TCDD)contamination. It was previously shown that tomcod from the Hudson River had extraordinarily high prevalence of liver tumors in the 1970's, exceeding 90% in older fish. But in 2006 they found that the Hudson River population of this fish had many fewer tumors and a 100-fold reduction of induction of the Cytochrome P450 pathway in response to dioxin and PCB exposure (Wirgin and Chambers 2006). In a 2011 Science paper, they reported a two amino acid deletion of the AHR2 gene in the Hudson River population that was absent, or nearly so, in all other tomcod populations. The aryl hydrocarbon receptor is responsible for induction of CYP genes in all vertebrates and in activation of most toxicities from these contaminants (Wirginet al 2011).

Our goal for this project was to build a de novo sequence of the genome of the tomcod, annotate all the genes, and do a global analysis of gene expression (with RNAseq) to look at the genome-wide effects of the AHR2 mutation in Hudson River larvae as compared to wild type fish (collected from a clean location at Shinnecock Bay, on the South Shore of Long Island, in the Hamptons). All DNA and RNA sequencing was done at the NYUMC Genome Technology Center under the direction of Adriana Heguy.

From a bioinformatics point of view, this project was interesting because we decided to integrate both a genome assembly and multiple transcriptome assemblies to get the most complete set of full-length protein coding genes. For the transcriptome, we did de novo assembly with rnaSPAdes on many different RNAseq samples including embryo, juvenile, and adult liver. We made the genome assembly with SoapDenovo2, and then predicted gene coding regions with GLIMMER-HMM. We combined all of these different sets of transcript coding sequences with the EvidentialGene pipeline created by Don Gilbert. With the final merged set of transcripts, we used Salmon to (very quickly) quasi-map the reads in each RNAseq sample onto the transcriptome and quantify gene expression. Differential gene expression was computed with edgeR.

The results were extremely dramatic. At low doses of dioxin, the wild type larvae show a huge gene expression response, with about a thousand genes having large fold-changes (some key genes were validated by qRT-PCR). The mutant Hudson River larvae basically ignore these low doses, with almost no gene expression changes. At the highest does (1 ppb), the Hudson River fish show some gene expression changes, but mostly not in the same genes as in the wild type fish. Even the negative control larvae (not treated with dioxin) show a large difference in gene expression between the two populations.

RNA-seq Power calculation (FAQ)

2017-07-25T11:48:00.000-04:00

I spend a lot of time answering questions from researchers working with genomic data. If I put a lot of effort into an answer, I try to keep it in my file of 'Frequently Asked Questions' - even though this stuff does change fairly rapidly. Last week I got the common question: "How do I calculate Power for an RNA-seq experiment? So here is my FAQ answer. I have summarized from the work of many wise statisticians, with great reliance on the RnaSeqSampleSize R package by Shilin Zhao and the nice people at Vanderbilt who build a Shiny website interface to it.

----------

>> I’m considering including an RNA-Seq experiment in a grant proposal. Do you have any advice on how to calculate power for human specimens? I’m proposing to take FACS sorted lymphocytes from disease patients and two control groups. I believe other people analyze 10-20 individuals per group for similar types of experiments.

>> It would be great if you have language that I can use in the grant proposal to justify the cohort size. Also, we can use that number to calculate the budget for your services. Thanks!

>> Ken

Hi Ken,

Power calculations require that you make some assumptions about the experiment. Ideally, you have done some sort of pilot experiment first, so you have an estimate of the total number of expressed genes (RPKM>1), fold change, variability between samples within each treatment, and how many genes are going to be differentially expressed. The variability of your samples is probably the single most important issue - humans tend to vary a lot in gene expression, cultured cell lines not so much. You can reduce variability somewhat by choosing a uniform patient group - age, gender, body mass index, ethnicity, diet, current and previous drug use, etc.

Have a look at this web page for an example of an RNA-seq power calculator.

https://cqs.mc.vanderbilt.edu/shiny/RNAseqPS/

I plugged in the following data: FDR=0.05, ratio of reads between groups=1, total number of relevant genes 10,000 (ie. you will remove about half of all genes due to low overall expression prior to differential expression testing). Expected number of DE genes=500, fold change for DE genes=2, read count (RPKM) for DE genes =10, dispersion (Standard Dev) 0.5. With these somewhat reasonable values, you get sample size of 45. So, to get a smaller sample size, you can play with all of the parameters.

The estimated Sample Size:

Description:

"We are planning a RNA sequencing experiment to identify differential gene expression between two groups. Prior data indicates that the minimum average read counts among the prognostic genes in the control group is 10, the maximum dispersion is 0.5, and the ratio of the geometric mean of normalization factors is 1. Suppose that the total number of genes for testing is 10000 and the top 500 genes are prognostic. If the desired minimum fold change is 2, we will need to study 45 subjects in each group to be able to reject the null hypothesis that the population means of the two groups are equal with probability (power) 0.9 using exact test. The FDR associated with this test of this null hypothesis is 0.05."

To improve power (other than larger samples size or less variability among your patients), you can sequence deeper (which allows a more accurate and presumably less variable measure of expression for each gene), only look at the most highly expressed genes, or only look at genes that have large fold change. Again, it helps to have prior data to estimate these things.

When I do an actual RNA-seq data analysis, we can improve on the 'expected power' by cheating a bit on the estimate of variance (dispersion). We calculate a single variance estimate for ALL genes, then modify this variance for each individual gene (sort of a Bayesian approach). This allows for a lower variance than would happen if you just calculate StdDev for each gene in each treatment. This rests on an assumption that MOST genes are not differentially expressed in your experiment, and the variance of all genes across all samples is a valid estimate of background genetic variance.

Oxford MinION 1D and 2D reads

2017-02-06T12:15:00.000-05:00

We have been testing out the Oxford MinION DNA sequencing machine to see what it can contribute to larger de novo sequencing projects. Most of the posted data so far come from small genomes where moderate coverage is more easily obtained. Recent publications claim improved sequence quality for Oxfort MinION.

We are working on a new de novo sequence for the little skate shark ((Leucoraja erinacea), an interesting system to study developmental biology. The skate has a BIG genome, estimated at 4 Gb (bigger than human), so this is going to be a difficult project. The existing skate genome in Genbank and the SkateBase website is not in very good shape (3 million contigs with N50 size of 665 bp).

We got a couple of preliminary Oxford MinION reads from skate DNA - not nearly enough coverage to make a dent in this project, just having a look at the data. Oxford produces two kinds of data (in their own annoying FAST5 format, but I won't rant about that right now), single pass 1D and double pass 2D. [My Bioinformatics programmer Yuhan Hao did this analysis.] Here is what our data looks like.

So the 1D reads are really long - some more than 50 kb. The 2D reads are mostly 2-10 kb. The SkateBase has a complete contig of the mitochondrial genome, so we were able to align the Oxford sequences to this as a reference. Coverage was low, but we do have some regions where both 1D and 2D reads match the reference. What we can see is that the 1D reads have a lot of SNPs vs the reference, while the 2D reads have very few SNPs- so it is clear that the 2D reads have been successfully error corrected. Strangely, both the 1D and 2D reads have a lot of insertion-deletion errors (several per hundred bases) compared to the reference, and in fact they do not match each other - so we consider these to all be novel, uncorrected errors.

We also ran a standard Illumina whole genome shotgun sequencing run for the skate genome, which we aligned to the mitochondrial reference. With this data, we can see a small number of Oxford 2D SNPs shared by hundreds of Illumina reads, others not. None of the indels are supported by our Illumina reads.

Other investigators have had poor quality Oxoford sequences. With more coverage, we may be able to use the Oxoford reads as scaffolds for our de novo assembly. It may be possible to use Illumina reads for error correction, and mark all uncorrected areas of the Oxford sequences as low quality, but that is not the usual method for submitting draft genomes to Genbank.

GenomeWeb reports on updated Coffee Beetle genome

2017-01-17T16:21:00.001-05:00

A nice review of my 2015 Coffee Beetle paper in GenomeWeb today.
My genome had 163 million bases and 19,222 predicted protein-coding genes. I am very pleased to learn that a revised version of the draft genome sequence (from a group in Columbia) contains 160 million bases and 22,000 gene models. They also confirm the 12 horizontally transferred genes that I identified.

Coffee Pest, Plant Genomes Presented at PAG Conference

Researchers from the USDA's Agricultural Research Service, New York University, King Abdullah University of Science and Technology, and elsewhere published information on a 163 million base draft genome for the coffee berry borer in the journal Scientific Reports in 2015.
That genome assembly, produced with Illumina HiSeq 2000 reads, housed hundreds of small RNAs and an estimated 19,222 protein-coding genes, including enzymes, receptors, and transporters expected to contribute to coffee plant predation, pesticide response, and defense against potential pathogens. It also provided evidence of horizontal gene transfer involving not only mannanase, but several other bacterial genes as well.
At the annual Plant and Animal Genomes meeting here this week, National Center for Coffee Research (Cenicafe) scientist Lucio Navarro provided an update on efforts to sequence and interpret the coffee berry borer genome during a session on coffee genomics. For their own recent analyses, Navarro and his colleagues upgraded an earlier version of a coffee berry borer genome that had been generated by Roche 454 FLX sequencing, using Illumina short reads from male and female coffee berry borers to produce a consensus assembly spanning around 160 million bases. The assembly is believed to represent roughly 96 percent of the insect's genome.
In addition to producing a genome with improved contiguity levels, he reported, members of that team also combined 454 and Illumina reads to get consensus transcriptomes for the beetle. With these and other data, they identified almost 22,000 gene models, novel transposable element families, and their own evidence of horizontal gene transfer.

Finding differences between bacterial strains 100 bases at a time.

2016-12-08T14:56:00.001-05:00

This work was conducted mostly by my research assistant Yuhan Hao.

We have recently been using Bowtie2 for sequence comparisons related to shotgun metagenomics, where we directly sequence samples that contain mixtures of DNA from different organisms, with no PCR. Bowtie2 alignments can be made very stringently (>90% identity), and computed very rapidly for large data files with hundreds of millions of sequence reads. This allows us to identify DNA fragments by species and by gene function, provided we have a well-annotated database that contains DNA sequences from similar microbes. I know that MetaPhlan and other tools already do this, but I want to focus on the difference between bacteria (and viruses) at the strain level.

I have been playing around with the idea of constructing a database that will give strain-specific identification for human-associated microbes using Bowtie2 alignments with shotgun metagenomic data. There are plenty of sources for bacterial genome sequences, but the PATRIC database has a very good collection of human pathogen genomes. However the database is very redundant - with dozens or even hundreds of different strains for the same species, and the whole thing is too large to build into a Bowtie database (>350 Gb). So I am looking at ways to eliminate redundancy from the database while retaining high sensitivity at the species level to identify any sequence fragment, not just specific marker genes, and the ability to make strain-specific alignments. So I want to identify which genome fragments have shared sequences among a group of strains within a single species and which fragments are unique to each strain.

Since our sequence reads are usually ~100 bases, We are looking the similarity between bacterial strains when the genome is chopped into 100 bp pieces.

Bowtie2 can be limited to only perfect alignments (100 bases with zero mismatches) using the parameters --end-to-end and --scoremin C, 0, -1 and to something similar to 99% identity with --scoremin ‘L,0, -0.06’

Between two strains of Streptococcus agalactiae, if we limit the Bowtie2 alignments to perfect matches, half of the fragments align. So at the 100 base level, half of the genome is identical, and the other half has at least one variant. At 99% similarity (one mismatch per read), about 2/3 of the fragments align, and the other 1/3 has more than one variant, or in some cases no-similarity at all to the other genome. Yuhan Hao extended this experiment to another species (Strep pyogenes), where we see almost no alignment to Strep ag. of genome fragments at 100%, 99%, or ~97% identity (see the Table below). I have previously used the 97% sequence identity threshold to separate bacterial species, but I thought it was only for 16S rDNA sequences - here it seems to apply to almost every 100 base fragment in the whole genome.

So we can build a database with one reference genome per species and safely eliminate the 2/3 of fragments that are nearly identical between strains, and retain only those portions of the genome that are unique to each strain. I'm going to think about how to apply this iteratively (without creating a huge computing task), so we add only the truly unique fragments for EACH strain, rather than just testing if a fragment differs from the Reference for that species. With such a database, stringent Bowtie2 alignments will identify each sequence read by species and by strain and have very low false matches across different species.

We can visualize the 100 base alignments between two strains at the 100% identity level, 99% identity, and at the least stringent Bowtie2 setting (very fast local) to see where the strains differ. This makes a pretty picture (below), which reveals some fairly obvious biology: There are parts of the genome that are more conserved and parts that are more variable between strains. The top panel shows only the 100% perfect matches (grey bars), the second panel shows that we can add in some 99% matching reads (white bars with a single vertical color stripe to mark the mismatch base), and the lowest panel shows reads that are highly diverged (lots of colored mismatch bases and some reads that are clearly mis-aligned from elsewhere on the genome). So if we choose only the reads that differ by more than 99%, we can build a database that can use Bowtie2 to identify different strains and have few multiple matches or incorrectly matched reads.

This chart lists the number of UNMATCHED fragments after alignment of all non-overlapping 100 bp fragments from the genomes of each strain to the Strep. agalactiae genome that we arbitrarily chose as the Reference. C 0 -1 corresponds to perfect matches only, L 0 -0.06 is approximately 99% identity (by my calculation), and L 0, -0.2 is about 97% identity. The lower the stringency of alignment, the fewer fragments end up in the "unmatched" file.

S.ag Ref S.ag st2 S.ag st3 S.pyo st1 S.pyo st2

# 100b kmers 20615 22213 21660 17869 17094

C 0 -1 0 0 11243 10571 17718 16934 (perfect matches)

L 0 -0.06 0 0 6605 5918 17533 16762 (99% ident)

L 0 -0.1 0 0 6534 5839 17533 16762

L 0 -0.12 0 0 4812 4058 17371 16600

L 0 -0.15 0 0 4767 4006 17369 16600

L 0 -0.17 0 0 4756 3997 17369 16600

L 0 -0.2 0 0 4156 3419 17237 16441 (97% ident)

Functional Metagenomics from shotgun sequences

2016-05-27T12:13:00.002-04:00

A number of groups at our research center have recently become interested in metagenomic shotgun sequencing (MGS), which is simply taking samples that are presumed to contain some microbes, extracting DNA and sequencing all of it, shotgun style. This is seen as an improvement over metagenomics methods that amplify 16S ribosomal RNA genes using bacteria specific PCR primers, and then sequence these PCR products. The 16S approach has had a lot of success, in that essentially all of the important "microbiome" publications over the past 5 years have been based on the 16S method. The 16S approach has a number of advantages – the amount of sequencing effort per sample is quite small (1,000 to 10,000 sequences is usually considered adequate) and fairly robust computational methods have been developed to process the sequence data into abundance counts for taxonomic groups of bacteria.

There are a number of drawbacks for the 16S method. The data is highly biased by DNA extraction methods, PCR primers and conditions, DNA sequencing technology, and computational methods used to clean, trim, cluster, and identify the sequences. What I mean by biased is that if you change any of these factors, then you get a different set of taxa and abundances from the samples. Even when these biases are carefully addressed, the accuracy of the taxonomic calls are not very good or reliable. It is simply not possible to identify with high precision and accuracy all bacterial species (or strains) present in a DNA sample with just ~400 bp of 16S sequence data. Many 16S microbiome studies report differences in bacterial abundance at the genus or even the phylum level.

Two samples may be discovered to have reproducibledifferences in 16S sequence content, but the actual bacterial species or strains that differ are not confidently identified. Even more important, the low resolution of 16S studies may be missing a lot of important biology. Huge changes in environment can perhaps favor anaerobes over aerobes, but smaller changes in pH, nutrient abundance, or immune call populations and function may cause a shift from one species in a genus to another. And this difference in species or strain may bring important changes in metabolite flux, immune system interaction, etc.

It has been proposed that bulk metagenomic shotgun sequencing(MGS) of all of the DNA in a biosample, rather than just PCR amplified 16S sequences, would allow for more precise species and strain identification, and quantification of the actual microbial genes present. Some MGS methods also attempt to count genes in specific functional groups such as within the Gene Ontology, or KEGG pathways. Other people would like to discover completely novel genes that innovative bacteria have developed to do interesting metabolic things. This leads to a computationally hard problem. MGS data tend to be very large (200 million to 1 billion reads per sample), and databases of bacterial genes are incomplete. In fact, we probably have complete genome sequences for very much less than 0.1 percent of all bacteria in the world. We might also like to identify DNA from archea, viruses, and small eukaryotes in our samples.

Each fragment of DNA in an MGS data file has to be identified based on some type of inexact matching to some set of reference genes or genomes (a sequence alignment problem), which is computationally very demanding. In my experience, BLAST is the most sensitive tool to match diverged DNA sequences, but depending on the size of the database, it takes from 0.1 to 10 cpu seconds for each sequence to be aligned by BLAST. 100,000 sequences takes at least overnight (on 32 CPUs, 128 GB RAM), if not all week. I have never tried a billion. Multiply that by a billion reads per sample and you can see that we have a serious compute challenge. We have at least 200 samples with FASTQ files queued up for analysis, more being sequenced, and more investigators are preparing to start new studies. So we need a scalable solution that cannot be solved just by brute force BLAST searching on ever bigger collections of computers.

As more investigators have become interested in MGS, computing methods for processing this data have popped up like weeds. It's really difficult to review/benchmark all of these tools, since there are no clear boundaries for what analysis results they should deliver, and what methods they should be using. The omictools.com website lists 12 tools for "metagenomics gene prediction" and 5 for "metagenomics functional annotation" however these categories might be defined. The best benchmark paper I could find (Gardner et al 2015) looks at about 14.

Should the data be primer and quality trimmed (YES!), human and other contaminants removed (YES!), duplicates removed (maybe not???), clustered (????), assembled into contigs or complete genomes. Then what sort of database should the fragments be aligned to? The PFAM library of protein motifs, a set of complete bacterial genomes, some set of candidate genes?

So far, the most successful tools focus on the taxonomy/abundance problem – choosing some subset of the sequence data and comparing it to some set of reference sequences. I have chosen MetaPhalAn <http://www.ncbi.nlm.nih.gov/pubmed/22688413> to process our data because it does well in benchmark studies, runs quickly on our data, and Curtis Huttenhower has a superb track record of producing excellent bioinformatics software. In addition, we removed primers and quality trimmed using Trimmomatic, then removed human sequences by using Bowtie2 to align to the human reference genome (as much as 90% of the data in some samples). Using MetaPhalAn on the cleaned data files, we got species/abundance counts for our ~200 WGS samples in less than a week. [Note, when I say 'we', I actually mean that all the work was done by Hao Chen, my excellent Bioinformatics Programmer ]

An intentionally fuzzy top species abundance heatmap of pre-publication data for some MGS samples processed with MetaPhalAn. If we added info about which samples came from which patients, this might be considerably more interesting.

Moving on, our next objective is to identify microbial protein coding genes and metabolic pathways. Here, the methods become much more challenging, and difficult to benchmark. I basically don’t like the idea of assembling reads into contigs. This adds a lot of compute time, huge inconsistency among different samples (some will assemble better than others), and creates all kinds of bias for various species (high vs low GC genomes, repeats, etc.) and genes. Also, I'm having a hard time coming up with a solid data analysis plan that meets all of our objectives. Bacteria are diverse. A specific enzyme that fulfills a metabolic function (for example in a MetaCyc pathway), could differ by 80% of its DNA sequence from one type of bacteria to another, but still do the job. Alternately, there are plenty of multi-gene families of enzymes in bacteria with paralogs that differ in DNA sequence by 20% or less within the genome of a single organism, but perform different metabolic functions. And of course there are sequence variants in individual strains that inactivate an enzyme, or just modifiy one of its functions. There is no way we can compute such subtle stuff on billions of raw Illumina reads from a mixture of DNA fragments from unknown organisms (many of which have no reference data). So how do we compute up a report that realistically describes the functional differences in gene/pathway metabolic capacity between different sets of MGS samples?

If I have to design the gene content identifier myself, I will probably translate all reads in 6 frames, then use hmmscan vs. the PFAM library of known protein functional domains. The upside is that I know how to do this, and it is reasonably sensitive for most types of proteins. The downside is that many proteins will receive a very general function (ie: "kinase" or "7-TM domain"), which does not reveal exactly what metabolic functions they are involved in. And of course some 30-40% of proteins will come up with no known function – just like every new genome that we sequence. Another way to do this would be to use BLASTx against the set of bacterial proteins from UniProt. I can speed this up a bit by taking the UniRef 90% identity clusters (which reduces the database size by about 25%). An alternate proposal, which is much more clumsy, slower, and ad hoc, would be to BLASTn against a large set of bacterial genomes. Either all the complete microbial genomes in GenBank, or all the ones collected by the Human Microbiome Project, or some taxonomically filtered set (one genome per genus???). The point of searching against many complete genomes is that we have some chance of catching rare or unusual genes that are not well annotated as a COG, or KEGG pathway member.

According to Gardner et al (Sci. Rep. 2015), the best tools for identifying gene content in MGS samples are the free public servers MG-RAST and the EBI Metagenomics portal. So we have submitted some test samples to these. However, the queue is at least 10 days for processing by MG-RAST, so this is maybe not going to satisfy my backlog of metagenomics investigators who are rapidly cranking out more MGS samples. We probably need a local tool that will be under our own control in terms of compute time and more amenable to tweaking to the goals of each project.

At this point I'm asking blog readers for suggestions for a decent method or tool to identity gene content in big MGS FASTQ files which will be reasonably accurate in terms of protein/pathway function, and not crush my servers. Some support from a review or benchmark paper (other than by the tool's own authors...) would be nice also.

Irreproducible results

2016-03-02T10:53:00.001-05:00

I get frustrated by the lack of stability in genomics data collection and analysis, which of course leads to irreproducible results. I imagine (naively I'm sure) that in physics one measures a quant or a nanode of some particle and it stays measured that way for years and decades. I do accept the inevitable technology changes that lead us to measure similar things, such as gene expression, in different ways (Northern blots, microarrays, RNA-seq). However, my lab-based collaborators become very frustrated when the exact same data (such as RNA-seq FASTQ files) produce different results, such as changes in the list of top differentially expressed (DE) genes with different p-values, when analyzed with different software. This frustration grows even more severe when the different results come from a different version of the same software!

I was working through my RNA-seq tutorial with a group of students this week and they pointed out that my tutorial worksheet was wrong. Cufflinks did not produce any significant DE genes with our test data comparing two small RNA-seq data files. This was surprising to me, since I did the exact same workflow with the same data files last year and it worked out fine. So golly and darn it, I got hit with the irreproducible results bug. We keep past versions of software available on our computing server with an Environment Modules system, so I was able to quickly run a test of the exact same data files (aligned with the same Tophat version to the same reference genome) using different versions of Cufflinks. We have the following versions installed:
cufflinks/2.0.2 (July 2012)
cufflinks/2.1.1 (April 2013)
cufflinks/2.2.0 (March 2014)
cufflinks/2.2.1 (May 2014)

I just used the simple Cuffidiff workflow and looked at the gene_exp.diff output file for each software version. The results are quite different. Version 2.0.2 has 46 genes called "significant=yes" (multi-test adjusted q-value less than 0.05) with q-values running as low as 4.14E-10 (ok, one has a q-value of zero). Now this is not a great result from a biostatistics standpoint, since how can you expect to get significant p-values from RNA expression levels with two samples an no replicates? But it did make for an expedient exercise, since we could take the DE genes into DAVID and look for enriched biological functions and pathways.

Then in version 2.1.1 we have two "significant=yes" genes. In version 2.2.0 we have zero significant genes, and in version 2.2.1 also zero.

The top 10 genes, ranked by q-value also differ. There are no genes in common between the top 10 list for version 2.0.2 and 2.1.1, and only the top 2 genes are shared by version 2.2.0. Thankfully, there are no differences in top genes or q-values between 2.2.0 and 2.2.1 (versions released only 2 months apart). I'm sure that Cole Trapnell et al. are diligently improving the software, but the consequences for those of us trying to use the tools to make some sense out of biology experiments can be unsettling.

RNA-seq Workshop

2016-02-14T10:46:00.002-05:00

We ran a tutorial at NYU Med Center last week on the basics of RNA-seq data analysis. This tutorial was based on the use of our High-Performance Linux cluster, so we actually presented the class as 2 sessions: A 2-hour session on basic Linux commands (for the complete novice) plus writing and submitting Sun Grid Engine scripts. Then in the 2nd 2-hour session we focus on TopHat/Cufflinks data processing with some sample data files. So this tutorial has some parts that are rather specific to the NYUMC computing system, but it may be quit similar to what computing resources might be available at other schools and research centers.

The YouTube links to the Linux session (screencast):
https://youtu.be/M3RVfv6lUtc

and the RNA-seq session (screencast):
https://youtu.be/hksQlJLwKqohttps://youtu.be/hksQlJLwKqo

A wiki website with the Powerpoint slides and a collection of other resources:
https://genome.med.nyu.edu/hpcf/wiki/RNA-seq_tutorial_2016

The Tardigrade Miscalculation

2016-01-28T12:06:00.001-05:00

There was a lot of publicity back in November about the genome sequence of theTardigrade (Hypsibius dujardini), a small animal (0.05 – 1mm) that is somewhat similar to nematodes. These are fascinating little creatures that have been described as incredibly resistant to all manner of physical stress – high and low temperatures (reportedly from -272oC to +151oC), high pressure, complete vacuum (Tardigrades in Space = TARDIS {I kid you not}), ionizing radiation, and can survive without food or water for more than 10 years as kind of a dehydrated little lump.

Tippett Studio/Cosmos A Spacetime Odyssey

The reason the genome of the Tardigrade was such big news in November is that the group doing the bioinformatics analysis claimed that the genome contained 6,663 genes from bacteria, a full sixth of the genome, and twice as many horizontally transferred genes as have ever been seen in any other organism (Boothby et al, PNAS 112(52):15976-81. doi: 10.1073/pnas.1510461112. PMID: 26598659). This "weird science" observation was covered by National Geographic, Science News, Phys.org, Meta Science News, and of course the Univ. of North Carolina press site.

However, it seems quite clear now that this claim about horizontal DNA from bacteria (and maybe other phyla) in the genome of the Tardigrade was wrong. In fact, another group (Georgios Koutsovoulos, Sujai Kumar, Dominik R Laetsch, Lewis Stevens, Jennifer Daub, Claire Conlon, Habib Maroon, Fran Thomas, Aziz Aboobaker, Mark Blaxter) also working on the sequence of the exact same species has rapidly published a preprint manuscript on the bioRxiv preprint server "The genome of the tardigrade Hypsibius dujardini" that clearly refutes the claims of Boothby et al. and points out their mistakes in genome analysis: "Cross-comparison of the assemblies, using raw read and RNA-Seq data, confirmed that the overwhelming majority of the putative HGT candidates in the previous genome were predicted from scaffolds at very low coverage and were not transcribed."

It is quite easy to get contaminants when you are doing whole genome sequencing for a multicellular organism. You grind up your target species, extract DNA and put it into the sequencing machine. Any bacteria and other small organisms on the surface or in the gut come along for the ride and can contribute their DNA to the sequencing library. Surprisingly, a small amount of bacterial contaminating DNA (perhaps just 1%) can lead to a large number of bacterial contigs in the final genome assembly. I can think of a couple of reasons for this, based on the small size of bacterial genomes (~1 MB), vs metazoan genomes (most >100 MB). First, relative genome coverage of a contaminant bacteria will be much higher for each KB of sequence data, so the 1% of contaminating DNA may have deep coverage of a bacterial genome. Second, any two bacterial DNA fragments randomly selected from a library have a much higher chance to overlap (less complex genome), so they will assemble better.

There are a few QC steps that one can take on the raw data. There is a nice tool called Kraken (Wood DE, Salzberg SL Genome Biology 2014, 15:R46) that can quickly run through an entire FASTQ file (4 million reads per minute on a single core) and mark each read according to a set of reference genomes based on exact matching of 31 base k-mers. The Kraken team also make available a pre-built 4 GB database constructed from complete bacterial, archaeal, and viral genomes in RefSeq. DeconSeq is another good tool to find contaminants with an easy web interface. Of course, some legitimate reads from any target organism will share k-mer sized chunks with some bacteria, viruses, etc. (and some sequences from contaminating bacteria will not be in any database), so one has to make some tough choices about what to remove from the data before assembly.

After assembly, there are some additional steps one can take to flag contaminants. It is extremely helpful (I would now say required) to have some RNA-seq data from the same organism. RNA-seq data is prepared using a poly-A protocol, so no bacterial RNA contaminants should be present. Any contigs (with predicted genes) that do not contain a reasonable amount of aligned RNA-seq reads are highly suspect. Any contig that has predicted genes only from a different species is clearly a red flag.

While the authors of the original have not (yet) published a retraction, the citation in PubMed does carry a link to the refuting article provided by author Sujai Kumar

Rather than rant on about proper workflows for genome annotation (a best practices document does exist: Mark Yandell & Daniel Ence, Nature Reviews Genetics 13, 329-342 doi:10.1038/nrg3174) let me just say to the authors, the reviewers and the editors at PNAS that "EXTRODINARY CLAIMS REQUIRE EXTRODINARY EVIDENCE" (Carl Sagan). Or as said by Laplace: “The weight of evidence for an extraordinary claim must be proportioned to its strangeness.”

Cancer Moonshot in the Cloud

2016-01-20T11:25:00.000-05:00

I've been reading a bit about the "Cancer Moonshot" discussion at the Davos economics conference.

http://www.weforum.org/events/world-economic-forum-annual-meeting-2016/sessions/cancer-moonshot-a-call-to-action

Naturally I'm interested in the possible increase in funding for genomics and bioinformatics research, but also the discussion of 'big data' and sharing of genomics data are issues that I bump into all the time. It is almost impossible to overstate the amount of hoops an ordinary scientist has to jump through to obtain access or to share human genomic data that has already been published. There is an entire system of "authorized access" that requires not only that scientists swear to handle genomic data securely and make no attempt to connect genomic data back to patient identities, but also that the University (or research institute) where they work must monitor and enforce these rules. I have had to deal with this system to upload human microbiome data (DNA sequences from bacteria found in or on the human body) that are contaminated with some human DNA. [But not with the coffee beetle genome!] Then I had to apply again for authorization to view my own data to make sure it had been loaded properly.

Why is cancer genomic data protected? Unfortunately, some annoyingly clever people such as Yaniv Erlich have shown that it is possible (fairly easy in his hands) to identify people by name and address just from some of their genome sequence. Patients who agree to participate in research are supposed to be guaranteed privacy - they wanted to share information with scientists about the genetic nature of their tumors, not to share their health care records with nosy neighbors, privacy hackers and identity thieves.

Why do we need thousands of cancer genomes? One key goal of cancer research is personalized medicine - matching up people with customized treatments based on the genetics of their cancer. Current technology is pretty good for DNA sequencing of tumors - for a single cancer patient we can come up with a list of somatic mutations (found only in the tumor) for a few thousand dollars worth of sequencing effort (and a poorly measured amount of bioinformatician and oncologist time). One of the biggest challenges right now is sorting through the list of mutations to figure out which ones are important drivers of cancer growth and disease severity - and should therefore be targeted by drugs or other therapy. Some mutations are well known to be bad actors, others are new mutations in genes that have been found to be mutated in other cancers, others are complete unknowns. Data is needed from (hundreds of) thousands of tumors together with records of treatment response and other medical outcomes in order to build strong predictive models that will reliably advise the doctor about the medical importance of each observed mutation. Another challenge is the heterogeneity of cells within a single person's cancer. As DNA sequencing technology improves, investigators have started to sequence small bits of tumors, or even single cells. They observe different mutations in different cells or sub-clones. Now a key question is if the common resistance to drug treatment is a result of new mutations that occur during (or after) treament, or if the resistant cells already exist in the tumor, but are selected for growth by drug treatment. Overall, this means that precision cancer treatment may require a large number of different genome sequences from each patient, both during diagnosis and to monitor the course of treatment and post-treatment.

So cancer genomic research requires thousands of genomes (deeply sequenced for accuracy and control of artifacts), which means that each authorized investigator must download terabytes of data, and then come up with the data storage and compute power to run his or her clever analysis. In addition to the strictly administrative hurdles of applying for and maintaining an authorized access to cancer genomic data, there are the problems of data transfer, data storage, and big computing power. So the NIH (or other funding agency) has to pay once to generate the cancer genomic data, then again to store it and provide a high bandwidth web or FTP data sharing system, then again to administer the authorized access system, then again for each interested scientist to build a local computing system powerful enough to download, store, and analyze the data (and for University administrators to triple check that they are doing it properly, and again for the NIH administrators to check up on the University administrators to insure they are doing their checking properly). This is an impressive amount of redundancy and wasted effort, even for the US Government.

There is an obvious solution to this problem: 'Use the Cloud, Luke'. A single Cloud computing system can store all cancer genomic data in a central location, together with a sufficiently massive amount of compute power so that authorized investigators can log in and run their analysis remotely. This technology already exists; Google, Amazon, Microsoft, IBM, Verizon, and at least a dozen other companies already have data centers large enough to handle the necessary data storage and compute tasks. It would be handy to build a whizz bang compute system with all kinds of custom software designed for cancer genomics, but that would take time (and government contractors). A better, faster, simpler system would just stick the genomic data in a central location and let researchers launch virtual machines with whatever software they want (or design for themselves). Amazon EC2 has this infrastructure already in place. It could be merged with the NIH authorized access system in a week-long hackathon. Cancer research Funding agencies could award Cloud compute credits (or just let people budget for Cloud computing in the standard grant application).

Cancer Moonshot:

Masters in Biomedical Informatics at NYU School of Medicine

2015-10-23T09:56:00.001-04:00

We are starting a new Masters program in Biomedical Informatics at NYU School of Medicine in 2016. We currently have about a dozen PhD students, but the Masters program is intended to serve a wider group with more diverse backgrounds.

Research Adventure with ENCODE Data

2015-09-04T11:11:00.000-04:00

At NYU, first-year PhD students in the Sackler Institute start their first semester with a week-long full-time "Research Adventure" workshop. I was asked (at short notice) to mentor a group of students for something in Bionformatics. Since I had recently attended the 2015 ENCODE Users Meeting, I decided to make the workshop all about working with ENCODE data.

I included tutorials about access to ENCODE data, an Intro to Linux for complete computing novices (quite a few of our students), Genomic Intervals in the UCSC Genome Browser, use of BEDTools to compare genomic intervals for various factors, and an a tutorial in R for data display. Later in the week we looked at gene expression with RNA-seq using TopHat and Cufflinks. The general plan for the 5-day workshop (for 6 students) was as follows:

Monday

9-11:00 am Lecture (2 hr): Introduction to Gene Regulation and Epigenetics

11-12:00 am Lecture (1 hr): Use of the HighPerformance Computing Cluster

12:00-2 pm Working Lunch with HPC System Manager (2 hr): Set up HPC account for each student, practice Linux commands, move files from laptop to HPC account

2-4 pm Exercise 1: Tutorials for Accessing ENCODE data through the ENCODEPortal, UCSC Genome Browser and ENSEMBL Browser

Tuesday

9-11:00 am Lecture & Demo: (2 hr): The UCSC Genome Browser, BED file format, and BEDTools software

11-12:00 am Exercise 2: BEDTools Tutorial

12-1:00 pm Lunch

1-3:00 pm Exercise 3: Use of ENCODE Data and BEDTools to compute the Intersection of DNAse hypersensitive sites with promoters of all RefSeq genes

Wednesday

9-10:30 am Lecture: Computing Gene Expression with RNA-Seq (1.5 hr)

10:30-12 am Exercise 4: Align ENCODE RNA-seq data to hg19 reference genome with TopHat

12-1:00 pm Lunch

1-4 pm Continue work on Exercise 4

Thursday

9-10:00 am Lecture (1 hr): Intro to data visualization with R

10-12:00 am Exercise 5: TryRCodeschool tutorial.

12-1:00 pm Lunch

1-2:00 pm Lecture (1 hr): Differential Gene Expression with Cufflinks

2-4:00 pm Planning for Research Project – choose ENCODE data for transcription factors, gene expression, and epigenetic markers. Literature search.

Friday

9-12:00 am Work on Research Project

12-1:00 pm Lunch

1-4:00 pm Work on Data analysis and prepare presentation

I had six students in our Research team: Elaine Fisher, Reuben Moncada, Shushan Sargsian, Beny Shapiro, Jong Shin, and Bo Xia, I have pasted images from their final presentation below (can't upload PowerPoint or PDF in this Blogger).

My overall impression of the week was that the students learned a huge amount of computing skills, but it was a bit bumpy when we got to the RNA-seq methods. They had really good success comparing various Transcription Factor binding sites to known genes (promoter region, TSS, 3'UTR, exons, introns, 5'UTR), finding interactions between TF's by finding overlapping or nearby binding sites, We also found nice overlaps between ChIP-seq TF binding sites and DNAse sensitive sites, histone modification sites, and computationally predicted TF binding sites. Also, the students did a nice job of measuring overlapping vs. nearby binding sites (bedtools slop), and measuring the significance of intersections using bedtools shuffle to create a statistical model of random intersections as a control.

FASTQ data download and alignment is slow and error prone (we had a lot of trouble making SGE scripts that would run correctly on our compute cluster). I should have shown TopHat just as a demo and used a small local FASTQ data file as an example rather than download and re-align ENCODE data. Using Cufflinks/Cuffdiff to compare gene expression from different cell lines was feasible with real ENCODE BAM files, but we had to learn this earlier in the week and spend more time to create SGE scripts that would run nicely with multithreading (to complete in a reasonable amount of time).

If I did this sort of tutorial again, I would figure out a way for the students to measure differential gene expression between cell lines from pre-computed ENCODE RNA-seq quantified data (wig files).

Coffee Berry Borer genome published

2015-07-31T11:05:00.000-04:00

Our paper on the de novo genome sequence and annotation of the Coffee Berry Borer (a beetle) is published today in Nature Scientific Reports. This was a really fun project, where I was pushed to do a lot more in-depth study of insect biology (such as antimicrobial and cytochrome P450 proteins). We also discovered that this beetle has captured a bunch of bacterial proteins into its genome (horizontal gene transfer) - which seems odd, but was actually previously reported for this insect and many others. Interestingly, most of these captured bacterial proteins provide starch digesting enzymes, which support the beetle's lifestyle of living entirely inside of the coffee bean and eating nothing but coffee! We are of course hoping that these genes can be used as some sort of target for control of the pest, which causes something like a billion $$ of annual damage worldwide to our beloved coffee.

http://www.nature.com/srep/2015/150731/srep12525/full/srep12525.html

http://www.nature.com/srep/2015/150731/srep12525/pdf/srep12525.pdf

2015-07-29T16:49:00.002-04:00

I am writing new lectures and organizing a lot of teaching material to teach 4 (!) classes this fall at two different universities (NYU and Fordham). I would like to keep the teaching materials in a nice easily accessible online location, and easily share with my students without a lot of hassle to sign them all up or whatever. I had a fairly good experience with Google Drive for a short course this Spring, so I'm trying it out now. Here is the master link to all of my 2015 teaching material:

https://drive.google.com/open?id=0BzalvBlHvt6LfldpaWxZQXVLcTZxUmpWZFdqSTBGeWl0MlJHeXBFQmhTTHBaX3JHNXowVDg

Stuff will appear, change, possibly disappear from this location as I keep sorting and rewriting, up to and during the classes. Most of the material is my own, some journal articles that I provide as readings to my students, and some shameless theft of good lectures, exercises, and tutorials from other folks smarter or better at explaining stuff than I am.

We are also planning to make Screencast type videos of most of the lectures, which get dumped on YouTube. I will try to find some sensible way of organizing them and sharing via this NGS blog.

2015-07-16T11:36:00.002-04:00

CSHL Press has made the RNA-seq chapter of my Next-Gen Seq book available free from their website: RNA Sequencing with Next-Generation Sequencing.

http://www.cshlpress.org/pdf/sample/2015/nextgen2/NGS2Chap13.pdf

New 'Next-Gen Seq 2' book is at the printer

2015-05-28T16:34:00.000-04:00

The second edition of the Next-Generation Sequencing Informatics book (that I edit) is at the printer and available for pre-order at Cold Spring Harbor Press and Amazon. We think it will ship on June 30th, maybe a bit sooner.

[James Hadfield at CoreGenomics blog has posted a review: http://core-genomics.blogspot.co.uk/2015/05/book-review-next-generation-dna.html ]

We have added new chapters on the latest sequencing technology, QC, de novo transcript assembly, proteogenomics and lots of updates and expansion in areas such as RNA-seq and ChIP-seq. It has a beautiful cover and its not too expensive.

Here is the official publication blurb:

Next-generation DNA sequencing (NGS) technology has revolutionized biomedical research, making genome and RNA sequencing an affordable and frequently used tool for a wide variety of research applications including variant (mutation) discovery, gene expression, transcription factor analysis, metagenomics, and epigenetics. Bioinformatics methods to support DNA sequencing have become and remain a critical bottleneck for many researchers and organizations wishing to make use of NGS technology. Next-Generation DNA Sequencing Bioinformatics, Second edition, provides thorough, plain language introduction to the necessary informatics methods and tools for analyzing NGS data as did the first edition, and provides detailed descriptions of algorithms, strengths and weaknesses of specific tools, pitfalls and alternative methods. Four new chapters in this edition cover: experimental design, sample preparation, and quality assessment of NGS data; Public databases for DNA Sequencing data; De novo transcript assembly; proteogenomics; and emerging sequencing technologies. The remaining chapters from the first edition have been updated with the latest information. This book also provides extensive reference to best-practice bioinformatics methods for NGS applications and tutorials for common workflows. The second edition of Next-Generation DNA Sequencing Bioinformatics addresses the informatics needs of students, laboratory scientists, and computing specialists who wish to take advantage of the explosion of research opportunities offered by new DNA sequencing technologies.

and the Table of Contents:

1) Introduction to DNA Sequencing

Stuart M. Brown

2) Quality Control and Data Processing

Stuart M. Brown

3) History of Sequencing Informatics

Stuart M. Brown

4) Public Sequence Databases

Stuart M. Brown

5) Visualization of Next-Generation Sequencing Data

Philip Ross Smith, Kranti Konganti, and Stuart M. Brown

6) DNA Sequence Alignment

Efstratios Efstathiadis

7) Genome Assembly Using Generalized de Bruijn Digraphs

D. Frank Hsu

8) De Novo Assembly of Bacterial Genomes from Short Sequence Reads

Silvia Argimón and Stuart M. Brown

9) De Novo Transcriptome Assembly

Lisa Cohen, Steven Shen, and Efstratios Efstathiadis

10) Genome Annotation

Steven Shen and Stuart M. Brown

11) Using NGS to Detect Genome Sequence Variants

Jinhua Wang

12) ChIP-seq

Stuart M. Brown, Zuojian Tang, Christina Schweikert, and D. Frank Hsu

13) RNA-seq with Next-Generation Sequencing

Stuart M. Brown and Jeremy Goecks

14) Metagenomics

Guillermo I. Perez-Perez, Miroslav Blumenberg, and Alexander V. Alekseyenko

15) Proteogenomics

Kelly V. Ruggles and David Fenyö

16) DNA Sequencing Technologies and Applications

Gerald A. Higgins and Brian D. Athey

17) Cloud-based Next-Generation Sequencing Informatics

Konstantinos Krampis, Efstratios Efstathiadis, and Stuart M. Brown

Password hell

2015-02-09T10:03:00.002-05:00

This is not a Bioinformatics post, just an amusing technology catch-22 that I encountered this morning. At NYU we have automatic mandatory password updates for our accounts with IT. This includes email, login to my Windows desktop computer, and wireless devices on the secure WiFi network in our building. Since I am lazy about these things, I did not heed the warnings and follow the instructions in the "Password Update" email from our IT Department. Instead, at home on Sunday night, I got a message when I tried to log in to my email account saying that I should update my password, and a helpful little box appears where it is possible to type old password and new password, hit submit and its all good.

I made a new password, and checked my mail, but after about 5 min, I got knocked off the network and can't log back in. It's late, so I figure to deal with it at the office in the morning. At my desk, I can't log into my computer (uses the same network "kerberos" password), and my phone complains that it can't get on the local wireless network. I try new password, old password, and eventually get the helpful message that my account has been locked by the IT Dept, and I must call the helpdesk. Its 9 AM on Monday and the helpdesk picks up right away. Help Guy asks if I have any wireless devices that may be using the old password. I look at the offending iPhone, and shut off WiFi. Helpdesk says: "I still see wireless activity hitting your account with an invalid password." Back to my desk, where my desktop Mac is using WiFi and getting unhappy messages from the network. Shut down WiFi. Helpdesk still sees activity on my account. Think, think?? Into the drawer where I have a laptop that we use for teaching and public seminars, it is asleep, but somehow still hitting the wireless network with my old password. Turn off WiFi on that one, and finally the helpful helpdesk guy can unlock my account. Then I can go back to each device and rejoin the network with the new password. I guess I'm not the first idiot this has happened to. Moral of the story??? Follow instructions very carefully or your helpful technology tools will gang up against you.

Happy Ice Storm Day from New York
-Stuart

Introduction to Biostatistics and Bioinformatics course at NYUMC

2014-09-10T15:14:00.000-04:00

We are giving a new course in the PhD program at NYU Med School (Sackler Institute) this semester called "Introduction to Biostatistics and Bioinformatics". It will have a mixture of lectures on Bioinformatics, Biostatistics, and Python programming. Hopefully we will be able to show the students the intersection of these topics as something like "Data Science for Biology". Lecturers will be myself, David Fenyo, Judy Zhong, and Pamela Wu.

Course Overview

The goal for the Introduction to Biostatistics and Bioinformatics course is to provide an introduction to statistics and informatics methods for the analysis of data generated in biomedical research. Practical examples covering both small-scale lab experiments and high-throughput assays will be explored. The course covers a wide range of topics in a short time so the focus will be on the basic concepts, and in the practical programming exercises the students explore these basic concept and common pitfalls. An introduction of basic Python and R programming will be given throughout the course and many exercises will involve programming.

The lectures will be posted to YouTube each week. Here are our first ones from yesterday:
Intro Lecture/Data Visualization: http://youtu.be/YDUPzq7i49U
Python programming #1: http://youtu.be/r2N-thn7j4o

The course curriculum and links to lecture slides (PPT), readings, and various handouts and exercises is here: http://fenyolab.org/ibb2014

Cheap genome projects

2014-08-20T12:34:00.001-04:00

I have been helping out several groups who want to do cheap, quick genome projects on previously "unsequenced" eukaryotic organisms. In terms of the genetic diversity of living things, at this point we have sampled very unevenly across taxonomic domains. Insects are particularly underrepresented in the whole genome database. According to the number of named species, insects represent over 80% of animal species. The NCBI has over 10,000 whole genome projects, but only 132 insects (and 36 of those are Drosophila species). So there is certainly taxonomic room to knock out some more insect genomes and discover interesting new stuff. A wide variety of valid arguments can be made for the usefulness of sequencing any number of as yet overlooked organisms in all taxonomic domains.

There are a lot of very useful experimental approaches that require a draft genome as a reference: gene expression, transcription factors and epigenetics (ChIP-seq), and just the basic evolutionary biology of important genes that are present or absent in the genome, novel paralogs in important gene families, etc. A few years ago, building a draft genome for a new organism was a major undertaking that required substantial funding and a dedicated research team. Today, the sequencing can be done fairly cheaply, but the bioinformatics work is extremely open ended. Clearly you can work and work and work to make a very well defined and annotated genome with maximum value for all possible users. But what is the optimal set of most useful genomic information that we can produce with a few person-weeks of time? [that would perhaps serve as preliminary data to bring in showers of additional funding for follow-up studies]

NOTE ONE: Collect both genomic and RNA sequence data. These two data types are extremely complementary. Many earlier de novo genome projects built on collections of existing EST sequence data, which was the poor man's approach to draft genomes in the previous decade. The ESTs provided seeds for gene finding on the genomic DNA, training of gene-finding algorithms, etc. Now we can get a comprehensive genome wide set of RNA-seq data for the cost of one lane on a HiSeq machine and a sample prep kit. If you have the choice, get 100 bp paired-end sequencing of the RNA, it will map better and end up giving more value for the dollar. It might be possible to use paired-end RNA to bridge DNA contigs into scaffolds - as far as I know, this is an untested area. It would be very helpful to have a normalized RNA library, to get more coverage of poorly expressed genes – this is on my wish list for future Cheap-O Genome projects.

NOTE TWO: More data is good. High total genome coverage is good, but long insert paired-end DNA libraries build better genome contigs. This makes complete sense. Early "shotgun" genome projects relied on sequencing the ends of clones from libraries with various size inserts. It would be really nice to have 10 or 20 KB insert libraries, or "mate pair" sequences that come from the junctions of large genomic fragments that have been circularized, but these are generally not available when your entire sequencing budget is in the single digit thousands of dollars. We were able to get a 550 bp insert library for a recent project and it led to an assembly with an N50 > 40 KB. Pretty good for two lanes of HiSeq data.

For our next cheap genome, we are using the Illumina TruSeq Synthetic Long-Read kit (which is based on technology developed by Moleculo). This is a really clever idea: first it breaks the genome into ~10 KB fragments and sorts the fragments into wells of a 384 well plate. Just a few dozen to a few hundred fragments in each well. The fragments in each well are clonally amplified (sort of like 454 technology), then sheared into the normal size range for Illumina sequencing (300-500 bp) and tagged with barcode primers at the ends. Then all the tagged fragments are pooled and sequenced normally on a HiSeq machine. Illumina has a custom assembly app (built in BaseSpace) that demultiplexs the data and does separate de novo assembly on each barcode set – so it is just assembling the small number of 10 KB fragments from one well. The final output is a set of "synthetic long reads" that really do seem to be 10 kb long.

(From Illumina product literature)

NOTE THREE: I like the SOAPdenovo assembler (127-kmer) for Illumina DNA sequence data. It did a good job for us on several different species with only a moderate consumption of computing resources (an overnight job on 32 processors with shared 128 GB of RAM). The final product is a set of contigs in FASTA format, some quite big, and a lot of little ones. Hopefully the sum of the contigs comes out to something similar to the expected genome size of the organism. The quite new SOAPdenovo-Trans assembler for RNA-seq also worked quite well for us – at least in comparison to Trinity which is a huge computer hog.

I visualize the bioinformatics work in two parts. First, find the genes in our data. Second, annotate the found genes and the genome using reference data. [Annotation will be described in another blog post.]

Ok, so here is my gene finding workflow for the Cheap-O Genome Project.

Gene Finding Workflow

Assemble DNA reads into genome contigs with SOAPdenovo assembler (127-kmer)
De novo gene finding on the DNA contigs with GeneScan or GeneMark (I used GeneMark).
Assemble RNA-seq reads into "transcripts" with SOAPdenovo-Trans
Map RNA-seq reads onto the DNA contigs with TopHat
Make another set of transcripts with Cufflinks (using no annotation file)
Use BLAT to map the de novo assembled transcripts onto the DNA contigs
Use the extremely useful psl_to_bed_best_score.pl script written by Dave Tang (https://gist.github.com/davetang/7314846) to convert the output of BLAT (in .psl format) into a .bed file, choosing only the best match for each query. Without this sorting and conversion, the BLAT results as a PSL file look like garbage in a genome browser.

OK, now assemble all 5 data sets into one nice visualization using IGV or GBrowse. We have a genome track (the DNA contigs from SOAPdenovo in FASTA format), an RNA track (the RNA-seq reads aligned to the draft genome in BAM format), a gene prediction track (GeneMark GTF file), the Cufflinks transcripts (transcripts.gtf), and the RNA assemblies (from SOAPdenovo-Trans) as a BED file. For some genes, all of the data agree quite nicely. For other genes, it's guess your best.

Here are two IGV screenshots of examples from the same genome contig. The first is a nice gene with plenty of RNA where all 3 annotation methods agree.

The second is a messy region where no gene model makes much sense, none of the methods agree at all, but there seems to be enough RNA (and spliced alignments!) to suggest real transcription is happening. Time to add some reference data by homology modeling (in my next post on annotation).

Dr. Evan Eichler speaks about genomic structural variation

2013-11-16T17:56:00.000-05:00

Yesterday I attended an excellent symposium on genomic structural variation organized by the Simons Foundation. The unifying theme from all of the speakers was the use of Pacific Biosicences long read technology to resolve large-scale duplicated sequences in the human genome. These long PacBio reads (5-10 kb) can be assembled across genetic regions with complex patterns of repeat structures, segmental duplications, inversions and deletions.

For me, The highlight of the afternoon was a talk by Evan Eichler from the University of Washington. Dr. Eichler presented both detailed sequencing data from specific loci and a grand overview of structural variation that synthsizes copy number variation, multi-gene families, the biology of autism and human evolution. His first point was that the reference genome is missing substantial sections of duplicated DNA, which has significant variation from person to person. Assembly software will tend to collapse multiple, nearly identical paralogus gene copies into one locus. Dr Eichler’s group has constructed more accurate sequences for regions with these complex patterns of segmental duplication using long PacBio reads. He has identified paralogous copies of genes, which actually exist as multi-gene families, and then created specific tags to track the copy number of various gene isoforms in different human genomes (such as from the 1000 genomes project). For example the SRGAP2 locus has 4 isoforms, each of which may be repeated several times in the genomes of some people.

Second, he explained that these regions of frequent copy number variation are often the site of deletions in the genomes of people with autism. These deletions and duplications may be quite large and typically include dozens of other genes besides the family of paralogs. In fact, the genome has hotspots of CNVs that are flanked by high-identity duplicated regions. In addition, some people may have additional duplications at hotspots, which create a predisposition for deletion or expansion events in their progeny.

Why do these deletions and duplications cause autism? Dr. Eicher suggested that brain development is a process that involves many genes, and it is particularly sensitive to gene dosage.

Dr. Eichler proposed a link to human evolution that is quite tantalizing. Many of the families of duplicated genes at the CNV hotspots are involved in brain development. These same genes are not duplicated in apes. A process of gene duplication and sequence variation allows for positive selection for new brain development phenotypes. So the gene duplication process which created expanded and more complex human brains may also make us susceptible to neurologially damaging CNV mutations.

http://www.ncbi.nlm.nih.gov/pubmed/23892896?ordinalpos=1&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum

A bit more on genome annotation

2013-10-26T09:31:00.000-04:00

One bit of followup on the Pig genome annotation story. We usually visualize RNA-seq results in the IGV browser. It allows direct inspection of read alignments to your favorite genes and can also be helpful to spot sequence variations and splicing issues. However, IGV has a set of pre-loaded default genomes that also seem to be derived from RefSeq. So once again, working with data from the pig, There was no annotation for most of our genes of interest. This is fairly annoying since it means that the only way to look at the annotation of a gene is to first look up the gene in UCSC and then copy the exact chromosome coordinates to IGV, including intron-exon borders.

It is possible to fix this by downloading to the local computer the ENSEMBL gene annotations from UCSC Table Browser as a BED file (not too large), and then loading the BED file into IGV as another data track. This works nicely in terms of showing the genes and exons, but the gene labels still carry the ugly ENSEMBL names. Once again, the ensemblToGeneName track comes in handy, providing a table with the ENSEMBL name and the Official gene symbol for about 20,000 genes. We were able to add the gene symbol to the BED file, but this has to be done carefully (in Perl or Awk) since making file edits in Excel seems to break the BED file (at least for me). Loading the edited BED file into IGV, I was then able to jump to genes by name and get screen shots of interesting regions that included a gene structure track with nice gene names.