Oct 29, 2018

Some tips to optimize bacterial genome assembly

I just finished a revised genome assembly for a collaborating lab.  We do de novo sequencing of bacterial genomes all the time, sometimes in batches of 50 or 100 different isolates barcoded together on a high-yield HiSeq run. We typically aim for coverage in the 500x to 1000x range with paired-end 150 base reads.

This genome was assembled as part of a "pipeline" script using SPADES, but it did not come out very nicely - over 3000 contigs, most of them small with low coverage.  A FastQC on the raw data looked good (all bases with mean quality  >Q30) except for the last 10 bases show a big drop in quality AND there are Illumina adapters found sometimes as much as 70 bases into the 3' end of quite a lot of the reads.

I also checked some of the largest contigs by a simple BLASTn at the NCBI website, and they matched quite well to one of our favorite bacterial species (e-value = 0.0;  percent identity = 93%). HOWEVER, many of the smaller contigs match the HUMAN genome. This is a big red flag for me, and probably should be checked for every bacterial (or any non-human) de-novo genome assembly.

So here is my improved de novo assembly protocol:

1) To remove the human DNA, I went back to the raw FASTQ files, and aligned to the reference human genome with Bowtie2. The simple way to do this is to use the '--un' flag in the Bowtie2 command, which dumps all unaligned reads to a separate output file.  There are other more sophisticated ways to do this (using SAMtools)  that interrogate if one or both of a set of paired-end reads align concordantly to the genome, but I wanted to be certain to remove any and all human reads. So I aligned each of my paired end data files (R1 and R2) separately and collected the unmatched reads for each.  I found 0.5% human reads in my raw data. This may not seem like a lot, but the assembler will make thousands of contigs from this small amount of contaminant DNA.

This filter will probably remove a few reads (genome regions) that contain simple sequence repeats longer than 25 bases (for example ATATATAT..., or CCCCCCCCC...) that are found in both human and my bacteria, but we know that assembly of repeats is not reliable anyway with 150 base Illumina reads.

2) Using the unmatched files from the human screen, I filtered for quality and removed Illumina adapters with Trimmomatic. This also catches the case where one read matches human and is removed, but the mate-pair is not. These 'unpaired' reads are not included in the 'trimmed' output file. 

3) The trimmed output files are now ready for assembly with SPADES.  I used kmer sizes 21, 33, 55, 77, 99 as recommended by the SPADES manual for 150 bp paired Illumina reads. I also included the --careful flag which is recommended for bacterial genomes with high coverage.

4) I got good results with this assembly - only about 600 contigs, with the largest one containing more than half of the expected genome (N50 > 1 Mb).  This FASTA data file of contigs was small enough to use the NCBI web BLAST server for a single search. I was also able to  Format the output as a text report showing just the top 10 matches for each contig (with no alignments). 

5) In addition to contig length, SPADES also reports the depth of kmer coverage in the header line for each contig in the multi-sequence FASTA file. I observed that all contigs with coverage greater than 100x matched to the same bacterial genus, but contigs with low coverage matched to all sorts of different things - quite a few to E.coli.  I conclude that the low coverage contigs are low abundance contaminants in our sequencing library, and should not be included in the published de novo genome.  [I do not trust pre-filtering by Bowtie against E.coli, we might lose some highly conserved genes]

6) Somebody more diligent than myself (perhaps one of my students in need of a final project in the Intro Bioinformatics course)  can write a Biopython script to filter sequences in a FASTA file by coverage as reported in the SPADES FASTA header.  I did it with some careful work with awk and Excel.  My final genome is 2.2 Mb, with just 39 high coverage contigs which all have highly significant best BLAST matches to the same bacterial genus and over 2000  genes predicted by GeneMarkS.

The takehome message:
1) filter out Human reads from raw sequence data that will be used for de-novo assemblies (I don't know how well this will work for mammals, vertebrates, etc). 
2) filter out low coverage contigs from your final contig assembly FASTA file.

Raw data:

Jan 31, 2018

Genome Coverage from BAM file

There are many excellent tools for analysis of Next Gen Sequencing data in the standard BAM alignment format so I was surprised how difficult it was for me to get a nice graph of genome coverage.  This will be trivial for a lot of hard core bioinformatics coders, so just move along if you are bored/annoyed.

I needed to check the evenness of coverage across intervals of a bacterial genome that we were re-sequencing for various experimental reasons. I aligned my FASTQ to a reference genome from GenBank using Bowtie2. There are several nice tools in the SAMTools and BEDTools kits that produce either a base by base coverage count or a histogram of coverage showing how many bases are covered how deeply.  I wanted a map at 1 Kb resolution. It took a while to figure out that I first need to make a BED file of intervals from my genome - with correct names for the 'chromosomes' that match the SAM header, and then use 'samtools bedcov' to get my intervals. Then a simple graph from Excel or R shows the coverage per interval along the genome.

Here are the steps (as much for me to remember as for usefulness to anyone else)

1) Create a Bowtie2 index of the reference genome (can be from GenBank, or can be a de novo assembly of contigs created locally from this FASTQ data).   Reference_in must be FASTA format.
bt2_base is the name you will call the index.


2) Align the FASTQ file(s) to the Reference. bt2-idx is the name of the index created in the previous step. There are a ton of options for stringency of alignment, format of input data, etc. I set this to use 16 CPU threads.  I usually like to leave out the unaligned reads, which can reduce the output file size somewhat. [Of course, the unalinged reads are the goal when you use Bowtie as a filter to remove human from microbiome data or any other sort of contaminant screen.] 

bowtie2 -p 16 
--no-unal -x -1 -2 -S

3) Is is annoying that Bowtie produces output in SAM format, and 99% of the time, the very first thing you have to do is convert to sorted BAM. Note that samtools sort puts its own .bam on the end, so if your are not careful you will get files named output.bam.bam

samtools view -bS output.sam | samtools sort - file_sorted

4) Create a 'genome' file for your reference genome. This is just a tab delimited file that names the chromosomes (or individudal sequences that may be in a multi-FASTA file, such as contig names).
It is super easy to mess this up. The best way is to view the top of your output.sam file created by bowtie2. The lines that start with @ are your chromosome headers, and they very helpfully already show the length of each one.  This is a bit of a pain if you  have a genome with lots of contigs, but a little 'cut' and 'paste' in bash or Excel will get you there.

Here is what mine looked like:

@HD     VN:1.0  SO:unsorted
@SQ     SN:MKZW02000001.1       LN:5520555
@SQ     SN:MKZW02000002.1       LN:248293
@PG     ID:bowtie2      PN:bowtie2      VN:2.2.7        CL:"/local/apps/bowtie2/2.2.7/bowtie2-align-s --wrapper basic-0 -p 8 -x Kluy

and here is the genome file I made:

MKZW02000001.1  5520555
MKZW02000002.1  248293

5) Make a set of intervals with bedtools makewindows. I wanted 1 Kb intervals, so I use -w 1000.
The result is a simple BED file with one line for each 1Kb window of the genome. 

bedtools makewindows -g genome.txt -w 1000 > genome_1k.bed

$ more genome_1k.bed

MKZW02000001.1  0       1000
MKZW02000001.1  1000    2000
MKZW02000001.1  2000    3000
MKZW02000001.1  3000    4000

6) Use samtools bedcov to count the total number of bases in the BAM file that are located in each of the intervals ('sum of per base read depths per BED region'). This works much faster than any other coverage tool that I have tested. 

samtools bedcov genome_1k.bed kv_sorted.bam > kv_1k.cov

7) Divide the sum of coverage by the window size (/1000 in my case), and plot the average coverage per window as a scatter plot, using the end of each interval as the X axis and the coverage as the Y. 
Histogram will only work nicely if you have very few intervals. In my case, high and low coverage outlier intervals are easily visible. 

Jan 4, 2018

Genome Annotation Challenges

Public databases of genetic information have a fundamental garbage-in>garbage-out problem. A huge number of useful databases are populated by pulling information from other databases and adding new value by computational inferences, but automated linking of databases can propagate incorrect information. The curators of primary repositories such as GenBank make an substantial effort to publish only correct information, so they are very conservative about annotating genes with only verifiable information. NCBI also has a policy that the original depositor of any given entry (a gene, protein, genome, experimental dataset, etc.) is the author of its annotation and metadata, and no one else can alter it. 

Staphylococcus aureus is an important human pathogen with perhaps the largest number of whole genome sequences in public repositories of any bacteria. NCBI has 8367 Staph genomes in its “Genomes” section (on Jan 1, 2018), and another ~40,000 in the SRA and Whole Genome Shotgun sections. However, GenBank has chosen strain NCTC 8325 as the Reference Genome for Staph, and put its genes in RefSeq. 

This genome was sequenced, annotated, and submitted on 27-Jan-2006 by Gillaspy et al from the Oklahoma Health Sciences Center. As a result of this “Reference Genome” designation, an automatic lookup of a Staph gene in GenBank is likely to get the annotation from NCTC 8325. This particular Staph genome has 2,767 protein coding genes (plus 30 pseudogenes, 61 tRNA, and 16 rRNA genes), however 1496 of these proteins are annotated with only “hypothetical protein” in their “gene product” or “description” field. This is very confusing, since many of these genes are 100% identical to proteins that have specific and well documented functions in other Staph strains. 

Here is one example:

hypothetical protein SAOUHSC_00010 [Staphylococcus aureus subsp. aureus NCTC 8325]
NCBI Reference Sequence: YP_498618.1
FEATURES             Location/Qualifiers      source          1..231                      /organism="Staphylococcus aureus subsp. aureus NCTC 8325"                      /strain="NCTC 8325"                      /sub_species="aureus"                      /db_xref="taxon:93061"      Protein         1..231                      /product="hypothetical protein"

GenBank knows that this is not the correct annotation for this protein. In the “Region” sub-field of the record (which is very rarely used by automated annotation tools that take data from GenBank) an appropriate function, COG and a CDD conserved domain are noted:

Region  7..231 /
/note="Predicted branched-chain amino acid permease (azaleucine resistance) 
[Amino acid transport and metabolism]; COG1296"                      

NCBI also links this gene to an “Identical Protein Group” where 3957 proteins are listed with 100% amino acid identity, which are annotated variously as: “azaleucine resistance protein AzlC”, “branched-chain amino acid ABC transporter permease”, “AzlC”, and “Inner membrane protein YgaZ”. A very conservative annotation bot might panic at this level of inconsistency and default to the lowest common denominator of “hypothetical protein”. However, a more sophisticated automaton might compare the protein sequence to PFAM or COG protein functional families and assign a common annotation to them all.

The incorrect “hypothetical” annotations for Staph genes in GenBank can be found downstream in many other databases, such as the Database of Essential Genes, AureoWiki, KEGG, UniProt, etc. which all upload their primary annotation from GenBank. So someone sequencing a new strain of Staph and using any of these resources to annotate predicted genes will probably end up assigning “hypothetical protein” for the AzlC gene and many hundreds of others, perpetuating the cycle of misinformation.

In a lot of other cases, it does not seem possible for an algorithm to resolve messy annotations that a human expert might  be able to figure out. For example, Staph strain COL has many hypothetical genes such as SACOL1097. NCBI Identical Proteins also show only “hypothetical protein” annotations.  However, a BLAST search shows 95% identity to nitrogen fixation protein NifR.  

hypothetical protein SACOL1097 [Staphylococcus aureus subsp. aureus COL]
GenBank: AAW37977.1
Identical Proteins FASTA Graphics
LOCUS       AAW37977                  59 aa            linear   BCT 31-JAN-2014
DEFINITION  hypothetical protein SACOL1097 [Staphylococcus aureus subsp. aureus  COL].
VERSION     AAW37977.1
DBLINK      BioProject: PRJNA238
            BioSample: SAMN02603996
DBSOURCE    accession CP000046.1
SOURCE      Staphylococcus aureus subsp. aureus COL

Dec 22, 2017

Contaminated Genomes

This is a long post, so first a quick summary. Some genome sequences contain contaminants. These contaminants create many problems when we use a trusted resource like GenBank or UniProt to summarize the sequences in a taxonomic group. I have illustrated one typical example, but there are thousands (maybe tens of thousands) of others.

I have been obsessing over errors and contamination in our public sequence databases. This week I was trying to use UniProt as a set of reference sequences for fungi. Our goal is fairly simple: To find the fungal DNA in a metagenomic shotgun sequence sample - which is just a mixture of all the DNA present in a scraping from mouth, throat, or any other body site.

UniProt makes it quite easy to sort all their proteins by taxonomy, and to download a subset of the data clustered at 100% (combining all exact duplicate sequences), 90%, or 50% amino acid identity. One might expect that fungal genes should not match bacteria at more than 50% identity. But surprisingly there are quite a lot of 50% and 90% clusters that contain both bacterial and fungal sequences (about 3000 of the 90% fungal clusters also contain bacterial proteins). 

The UniProt support staff provided some very useful help to build a query on their system that finds only those clusters of 90% identical proteins that contain fungal genes, but NO (NO!) bacterial genes. In case you like this sort of thing, here is the exact query:

uniprot:(taxonomy:"Fungi [4751]") NOT taxonomy:"Bacteria [2]" AND identity:0.9
[Note the careful use of quote marks parenthesis and square brackets, this stuff is rather tricky]

So I downloaded this set of putative fungal proteins (UniProt very helpfully creates a single 'representative' UniRef sequence in FASTA format for each cluster). I tested the fungal proteins against all the gene coding sequences (CDS) from the E.coli genome using BLASTx.  Once again, there are far too many high similarity matches.

One of the top matches is to a gene (Guanosine-3',5'-bis(diphosphate) 3'-pyrophosphohydrolase) from the fungus Beauveria bassiana that has 98% identity to E.coli. Since I am in an obsessive mood about this sort of thing, I decided that for this one example, I would collect some evidence to decide if we have strong sequence homology between bacteria and fungi for this gene, if Beauveria bassiana has a horizontal gene transfer, or if E. COLI CAN BE A CONTAMINANT IN GENOME SEQUENCES (!!!) [emphasis mine]

I put this Beauveria gene into a generic NCBI BLAST against all 'nr' proteins, and I got a very interesting result. There are exactly two matches to eukaryotes (Beauveria and a nematode), and 11,858 matches to bacteria, including lots of E.coli.

So I traced the Beauveria bassiana protein in UniProt back to its source as a whole genome shotgun sequence uploaded to GenBank on Nov 3, 2014 by the Institute of Plant Protection, Jilin Academy of Agricultural Sciences, Accession PRJNA178080, WGS ANFO00000000, Assembly GCA_000770705.1

I downloaded the whole genome assembly and BLASTed it with the E.coli hydrolase gene from above.  This very quickly pinpointed a contig  00271 (ANFO01000251.1 Beauveria bassiana D1-5 contig00271) that contains the matching sequence. The contig is 72,232 bases long. I then put this conting into NCBI BLAST against Bacteria. I get matches that correspond to lots of bacterial genes (POL I, RecG, iPGM, XanP, CpxA, GTP binding protein, GSI beta, and my ppGpp hydrolase) all with  >90% identity and BLAST e-value 0.0. 

Final answer: This is a contaminant. There was some E.coli DNA sequenced and assembled with the Beauveria DNA, and nobody checked before loading these sequences into GenBank.

My recommendation to GenBank and de novo genome sequencers everywhere is to check all predicted proteins from new genomes for matches to bacteria and human before loading them into a trusted database.

Aug 24, 2017

Gene expression analysis shows that an AHR2 mutant fish population from the Hudson River has a dramatically reduced response to dioxin

Together with Ike Wirgin in the NYUMC Dept. of EnvironmentalMedicine, we just published a paper on a gene expression study of a fish population in the Hudson River in Genome Biology and Evolution. "A Dramatic Difference inGlobal Gene Expression between TCDD-Treated Atlantic Tomcod Larvae from theResistant Hudson River and a Nearby Sensitive Population"

Atlantic tomcod (Microgadus tomcod) is a fish that Ike has been studying for many years as an indicator of biological responses to toxic pollution of the Hudson River estuary which contains two of the largest Superfund sites in the nation because of PCB and dioxin (TCDD)contamination. It was previously shown that tomcod from the Hudson River had extraordinarily high prevalence of liver tumors in the 1970's, exceeding 90% in older fish.  But in 2006 they found that the Hudson River population of this fish had many fewer tumors and a 100-fold reduction of induction of the Cytochrome P450 pathway in response to dioxin and PCB exposure (Wirgin and Chambers 2006). In a 2011 Science paper, they reported a two amino acid deletion of the AHR2 gene in the Hudson River population that was absent, or nearly so, in all other tomcod populations.  The aryl hydrocarbon receptor is responsible for induction of CYP genes in all vertebrates and in activation of most toxicities from these contaminants (Wirginet al 2011).

Our goal for this project was to build a de novo sequence of the genome of the tomcod, annotate all the genes, and do a global analysis of gene expression (with RNAseq) to look at the genome-wide effects of the AHR2 mutation in Hudson River larvae as compared to wild type fish (collected from a clean location at Shinnecock Bay, on the South Shore of Long Island, in the Hamptons). All DNA and RNA sequencing was done at the NYUMC Genome Technology Center under the direction of Adriana Heguy

From a bioinformatics point of view, this project was interesting because we decided to integrate both a genome assembly and multiple transcriptome assemblies to get the most complete set of full-length protein coding genes.  For the transcriptome, we did de novo assembly with rnaSPAdes on many different RNAseq samples including embryo, juvenile, and adult liver.  We made the genome assembly with SoapDenovo2, and then predicted gene coding regions with GLIMMER-HMM. We combined all of these different sets of transcript coding sequences with the EvidentialGene pipeline created by Don Gilbert. With the final merged set of transcripts, we used Salmon to (very quickly) quasi-map the reads in each RNAseq sample onto the transcriptome and quantify gene expression.  Differential gene expression was computed with edgeR.

The results were extremely dramatic. At low doses of dioxin, the wild type larvae show a huge gene expression response, with about a thousand genes having large fold-changes (some key genes were validated by qRT-PCR). The mutant Hudson River larvae basically ignore these low doses, with almost no gene expression changes. At the highest does (1 ppb), the Hudson River fish show some gene expression changes, but mostly not in the same genes as in the wild type fish.  Even the negative control larvae (not treated with dioxin) show a large difference in gene expression between the two populations.