I needed to check the evenness of coverage across intervals of a bacterial genome that we were re-sequencing for various experimental reasons. I aligned my FASTQ to a reference genome from GenBank using Bowtie2. There are several nice tools in the SAMTools and BEDTools kits that produce either a base by base coverage count or a histogram of coverage showing how many bases are covered how deeply. I wanted a map at 1 Kb resolution. It took a while to figure out that I first need to make a BED file of intervals from my genome - with correct names for the 'chromosomes' that match the SAM header, and then use 'samtools bedcov' to get my intervals. Then a simple graph from Excel or R shows the coverage per interval along the genome.
Here are the steps (as much for me to remember as for usefulness to anyone else)
1) Create a Bowtie2 index of the reference genome (can be from GenBank, or can be a de novo assembly of contigs created locally from this FASTQ data). Reference_in must be FASTA format.
bt2_base is the name you will call the index.
2) Align the FASTQ file(s) to the Reference. bt2-idx is the name of the index created in the previous step. There are a ton of options for stringency of alignment, format of input data, etc. I set this to use 16 CPU threads. I usually like to leave out the unaligned reads, which can reduce the output file size somewhat. [Of course, the unalinged reads are the goal when you use Bowtie as a filter to remove human from microbiome data or any other sort of contaminant screen.]
bowtie2 -p 16 --no-unal -x
3) Is is annoying that Bowtie produces output in SAM format, and 99% of the time, the very first thing you have to do is convert to sorted BAM. Note that samtools sort puts its own .bam on the end, so if your are not careful you will get files named output.bam.bam
samtools view -bS output.sam | samtools sort - file_sorted
4) Create a 'genome' file for your reference genome. This is just a tab delimited file that names the chromosomes (or individudal sequences that may be in a multi-FASTA file, such as contig names).
It is super easy to mess this up. The best way is to view the top of your output.sam file created by bowtie2. The lines that start with @ are your chromosome headers, and they very helpfully already show the length of each one. This is a bit of a pain if you have a genome with lots of contigs, but a little 'cut' and 'paste' in bash or Excel will get you there.
Here is what mine looked like:
@HD VN:1.0 SO:unsorted
@SQ SN:MKZW02000001.1 LN:5520555
@SQ SN:MKZW02000002.1 LN:248293
@PG ID:bowtie2 PN:bowtie2 VN:2.2.7 CL:"/local/apps/bowtie2/2.2.7/bowtie2-align-s --wrapper basic-0 -p 8 -x Kluy
and here is the genome file I made:
5) Make a set of intervals with bedtools makewindows. I wanted 1 Kb intervals, so I use -w 1000.
The result is a simple BED file with one line for each 1Kb window of the genome.
bedtools makewindows -g genome.txt -w 1000 > genome_1k.bed
$ more genome_1k.bed
MKZW02000001.1 0 1000
MKZW02000001.1 1000 2000
MKZW02000001.1 2000 3000
MKZW02000001.1 3000 4000
6) Use samtools bedcov to count the total number of bases in the BAM file that are located in each of the intervals ('sum of per base read depths per BED region'). This works much faster than any other coverage tool that I have tested.
samtools bedcov genome_1k.bed kv_sorted.bam > kv_1k.cov
7) Divide the sum of coverage by the window size (/1000 in my case), and plot the average coverage per window as a scatter plot, using the end of each interval as the X axis and the coverage as the Y.
Histogram will only work nicely if you have very few intervals. In my case, high and low coverage outlier intervals are easily visible.