Oct 29, 2018

Some tips to optimize bacterial genome assembly

I just finished a revised genome assembly for a collaborating lab.  We do de novo sequencing of bacterial genomes all the time, sometimes in batches of 50 or 100 different isolates barcoded together on a high-yield HiSeq run. We typically aim for coverage in the 500x to 1000x range with paired-end 150 base reads.

This genome was assembled as part of a "pipeline" script using SPADES, but it did not come out very nicely - over 3000 contigs, most of them small with low coverage.  A FastQC on the raw data looked good (all bases with mean quality  >Q30) except for the last 10 bases show a big drop in quality AND there are Illumina adapters found sometimes as much as 70 bases into the 3' end of quite a lot of the reads.

I also checked some of the largest contigs by a simple BLASTn at the NCBI website, and they matched quite well to one of our favorite bacterial species (e-value = 0.0;  percent identity = 93%). HOWEVER, many of the smaller contigs match the HUMAN genome. This is a big red flag for me, and probably should be checked for every bacterial (or any non-human) de-novo genome assembly.

So here is my improved de novo assembly protocol:

1) To remove the human DNA, I went back to the raw FASTQ files, and aligned to the reference human genome with Bowtie2. The simple way to do this is to use the '--un' flag in the Bowtie2 command, which dumps all unaligned reads to a separate output file.  There are other more sophisticated ways to do this (using SAMtools)  that interrogate if one or both of a set of paired-end reads align concordantly to the genome, but I wanted to be certain to remove any and all human reads. So I aligned each of my paired end data files (R1 and R2) separately and collected the unmatched reads for each.  I found 0.5% human reads in my raw data. This may not seem like a lot, but the assembler will make thousands of contigs from this small amount of contaminant DNA.

This filter will probably remove a few reads (genome regions) that contain simple sequence repeats longer than 25 bases (for example ATATATAT..., or CCCCCCCCC...) that are found in both human and my bacteria, but we know that assembly of repeats is not reliable anyway with 150 base Illumina reads.

2) Using the unmatched files from the human screen, I filtered for quality and removed Illumina adapters with Trimmomatic. This also catches the case where one read matches human and is removed, but the mate-pair is not. These 'unpaired' reads are not included in the 'trimmed' output file. 

3) The trimmed output files are now ready for assembly with SPADES.  I used kmer sizes 21, 33, 55, 77, 99 as recommended by the SPADES manual for 150 bp paired Illumina reads. I also included the --careful flag which is recommended for bacterial genomes with high coverage.

4) I got good results with this assembly - only about 600 contigs, with the largest one containing more than half of the expected genome (N50 > 1 Mb).  This FASTA data file of contigs was small enough to use the NCBI web BLAST server for a single search. I was also able to  Format the output as a text report showing just the top 10 matches for each contig (with no alignments). 

5) In addition to contig length, SPADES also reports the depth of kmer coverage in the header line for each contig in the multi-sequence FASTA file. I observed that all contigs with coverage greater than 100x matched to the same bacterial genus, but contigs with low coverage matched to all sorts of different things - quite a few to E.coli.  I conclude that the low coverage contigs are low abundance contaminants in our sequencing library, and should not be included in the published de novo genome.  [I do not trust pre-filtering by Bowtie against E.coli, we might lose some highly conserved genes]

6) Somebody more diligent than myself (perhaps one of my students in need of a final project in the Intro Bioinformatics course)  can write a Biopython script to filter sequences in a FASTA file by coverage as reported in the SPADES FASTA header.  I did it with some careful work with awk and Excel.  My final genome is 2.2 Mb, with just 39 high coverage contigs which all have highly significant best BLAST matches to the same bacterial genus and over 2000  genes predicted by GeneMarkS.

The takehome message:
1) filter out Human reads from raw sequence data that will be used for de-novo assemblies (I don't know how well this will work for mammals, vertebrates, etc). 
2) filter out low coverage contigs from your final contig assembly FASTA file.

Raw data: