Aug 24, 2017

Gene expression analysis shows that an AHR2 mutant fish population from the Hudson River has a dramatically reduced response to dioxin

Together with Ike Wirgin in the NYUMC Dept. of EnvironmentalMedicine, we just published a paper on a gene expression study of a fish population in the Hudson River in Genome Biology and Evolution. "A Dramatic Difference inGlobal Gene Expression between TCDD-Treated Atlantic Tomcod Larvae from theResistant Hudson River and a Nearby Sensitive Population"

Atlantic tomcod (Microgadus tomcod) is a fish that Ike has been studying for many years as an indicator of biological responses to toxic pollution of the Hudson River estuary which contains two of the largest Superfund sites in the nation because of PCB and dioxin (TCDD)contamination. It was previously shown that tomcod from the Hudson River had extraordinarily high prevalence of liver tumors in the 1970's, exceeding 90% in older fish.  But in 2006 they found that the Hudson River population of this fish had many fewer tumors and a 100-fold reduction of induction of the Cytochrome P450 pathway in response to dioxin and PCB exposure (Wirgin and Chambers 2006). In a 2011 Science paper, they reported a two amino acid deletion of the AHR2 gene in the Hudson River population that was absent, or nearly so, in all other tomcod populations.  The aryl hydrocarbon receptor is responsible for induction of CYP genes in all vertebrates and in activation of most toxicities from these contaminants (Wirginet al 2011).

Our goal for this project was to build a de novo sequence of the genome of the tomcod, annotate all the genes, and do a global analysis of gene expression (with RNAseq) to look at the genome-wide effects of the AHR2 mutation in Hudson River larvae as compared to wild type fish (collected from a clean location at Shinnecock Bay, on the South Shore of Long Island, in the Hamptons). All DNA and RNA sequencing was done at the NYUMC Genome Technology Center under the direction of Adriana Heguy

From a bioinformatics point of view, this project was interesting because we decided to integrate both a genome assembly and multiple transcriptome assemblies to get the most complete set of full-length protein coding genes.  For the transcriptome, we did de novo assembly with rnaSPAdes on many different RNAseq samples including embryo, juvenile, and adult liver.  We made the genome assembly with SoapDenovo2, and then predicted gene coding regions with GLIMMER-HMM. We combined all of these different sets of transcript coding sequences with the EvidentialGene pipeline created by Don Gilbert. With the final merged set of transcripts, we used Salmon to (very quickly) quasi-map the reads in each RNAseq sample onto the transcriptome and quantify gene expression.  Differential gene expression was computed with edgeR.

The results were extremely dramatic. At low doses of dioxin, the wild type larvae show a huge gene expression response, with about a thousand genes having large fold-changes (some key genes were validated by qRT-PCR). The mutant Hudson River larvae basically ignore these low doses, with almost no gene expression changes. At the highest does (1 ppb), the Hudson River fish show some gene expression changes, but mostly not in the same genes as in the wild type fish.  Even the negative control larvae (not treated with dioxin) show a large difference in gene expression between the two populations.  

Jul 25, 2017

RNA-seq Power calculation (FAQ)

I spend a lot of time answering questions from researchers working with genomic data. If I put a lot of effort into an answer, I try to keep it in my file of 'Frequently Asked Questions' - even though this stuff does change fairly rapidly. Last week I got the common question: "How do I calculate Power for an RNA-seq experiment?  So here is my FAQ answer. I have summarized from the work of many wise statisticians, with great reliance on the RnaSeqSampleSize R package by Shilin Zhao and the nice people at Vanderbilt who build a Shiny website interface to it.


>> I’m considering including an RNA-Seq experiment in a grant proposal. Do you have any advice on how to calculate power for human specimens? I’m proposing to take FACS sorted lymphocytes from disease patients and two control groups. I believe other people analyze 10-20 individuals per group for similar types of experiments.
>> It would be great if you have language that I can use in the grant proposal to justify the cohort size. Also, we can use that number to calculate the budget for your services. Thanks!
>> Ken

Hi Ken,

Power calculations require that you make some assumptions about the experiment.  Ideally, you have done some sort of pilot experiment first, so you have an estimate of the total number of expressed genes (RPKM>1), fold change, variability between samples within each treatment, and how many genes are going to be differentially expressed.   The variability of your samples is probably the single most important issue - humans tend to vary a lot in gene expression, cultured cell lines not so much. You can reduce variability somewhat by choosing a uniform patient group - age, gender, body mass index, ethnicity, diet, current and previous drug use, etc.

Have a look at this web page for an example of an RNA-seq  power calculator.

I plugged in the following data:   FDR=0.05, ratio of reads between groups=1, total number of relevant genes 10,000 (ie. you will remove about half of all genes due to low overall expression prior to differential expression testing).  Expected number of DE genes=500, fold change for DE genes=2, read count (RPKM) for DE genes =10, dispersion (Standard Dev) 0.5.  With these somewhat reasonable values, you get sample size of 45.   So, to get a smaller sample size, you can play with all of the parameters. 

The estimated Sample Size:
"We are planning a RNA sequencing experiment to identify differential gene expression between two groups. Prior data indicates that the minimum average read counts among the prognostic genes in the control group is 10, the maximum dispersion is 0.5, and the ratio of the geometric mean of normalization factors is 1. Suppose that the total number of genes for testing is 10000 and the top 500 genes are prognostic. If the desired minimum fold change is 2, we will need to study 45 subjects in each group to be able to reject the null hypothesis that the population means of the two groups are equal with probability (power) 0.9 using exact test. The FDR associated with this test of this null hypothesis is 0.05."

To improve power (other than larger samples size or less variability among your patients), you can sequence deeper (which allows a more accurate and presumably less variable measure of expression for each gene), only look at the most highly expressed genes, or only look at genes that have large fold change. Again, it helps to have prior data to estimate these things.

When I do an actual RNA-seq data analysis, we can improve on the 'expected power' by cheating a bit on the estimate of variance (dispersion). We calculate a single variance estimate for ALL genes, then modify this variance for each individual gene (sort of a Bayesian approach). This allows for a lower variance than would happen if you just calculate StdDev for each gene in each treatment.  This rests on an assumption that MOST genes are not differentially expressed in your experiment, and the variance of all genes across all samples is a valid estimate of background genetic variance.

Feb 6, 2017

Oxford MinION 1D and 2D reads

We have been testing out the Oxford MinION DNA sequencing machine to see what it can contribute to larger de novo sequencing projects.  Most of the posted data so far come from small genomes where moderate coverage is more easily obtained.  Recent publications claim improved sequence quality for Oxfort MinION.

We are working on a new de novo sequence for the little skate shark ((Leucoraja erinacea), an interesting system to study developmental biology.  The skate has a BIG genome, estimated at 4 Gb (bigger than human), so this is going to be a difficult project. The existing skate genome in Genbank and the SkateBase website is not in very good shape (3 million contigs with  N50 size of 665 bp).

We got a couple of preliminary Oxford MinION reads from skate DNA - not nearly enough coverage to make a dent in this project, just having a look at the data. Oxford produces two kinds of data (in their own annoying FAST5 format, but I won't rant about that right now), single pass 1D and double pass 2D.  [My Bioinformatics programmer Yuhan Hao did this analysis.] Here is what our data looks like.

So the 1D reads are really long - some more than 50 kb. The 2D reads are mostly 2-10 kb.  The SkateBase has a complete contig of the mitochondrial genome, so we were able to align the Oxford sequences to this as a reference. Coverage was low, but we do have some regions where both 1D and 2D reads match the reference. What we can see is that the 1D reads have a lot of SNPs vs the reference, while the 2D reads have very few SNPs- so it is clear that the 2D reads have been successfully error corrected.  Strangely, both the 1D and 2D reads have a lot of insertion-deletion errors (several per hundred bases) compared to the reference, and in fact they do not match each other - so we consider these to all be novel, uncorrected errors.

We also ran a standard Illumina whole genome shotgun sequencing run for the skate genome, which we aligned to the mitochondrial reference. With this data, we can see a small number of Oxford 2D SNPs shared by hundreds of Illumina reads, others not. None of the indels are supported by our Illumina reads.

Other investigators have had poor quality Oxoford sequences. With more coverage, we may be able to use the Oxoford reads as scaffolds for our de novo assembly. It may be possible to use Illumina reads for error correction, and mark all uncorrected areas of the Oxford sequences as low quality, but that is not the usual method for submitting draft genomes to Genbank.

Jan 17, 2017

GenomeWeb reports on updated Coffee Beetle genome

A nice review of my 2015 Coffee Beetle paper in GenomeWeb today.
My genome had 163 million bases and 19,222 predicted protein-coding genes. I am very pleased to learn that a revised version of the draft genome sequence (from a group in Columbia) contains 160 million bases and 22,000 gene models. They also confirm the 12 horizontally transferred genes that I identified.

Coffee Pest, Plant Genomes Presented at PAG Conference

Researchers from the USDA's Agricultural Research Service, New York University, King Abdullah University of Science and Technology, and elsewhere published information on a 163 million base draft genome for the coffee berry borer in the journal Scientific Reports in 2015.
That genome assembly, produced with Illumina HiSeq 2000 reads, housed hundreds of small RNAs and an estimated 19,222 protein-coding genes, including enzymes, receptors, and transporters expected to contribute to coffee plant predation, pesticide response, and defense against potential pathogens. It also provided evidence of horizontal gene transfer involving not only mannanase, but several other bacterial genes as well.
At the annual Plant and Animal Genomes meeting here this week, National Center for Coffee Research (Cenicafe) scientist Lucio Navarro provided an update on efforts to sequence and interpret the coffee berry borer genome during a session on coffee genomics. For their own recent analyses, Navarro and his colleagues upgraded an earlier version of a coffee berry borer genome that had been generated by Roche 454 FLX sequencing, using Illumina short reads from male and female coffee berry borers to produce a consensus assembly spanning around 160 million bases. The assembly is believed to represent roughly 96 percent of the insect's genome.
In addition to producing a genome with improved contiguity levels, he reported, members of that team also combined 454 and Illumina reads to get consensus transcriptomes for the beetle. With these and other data, they identified almost 22,000 gene models, novel transposable element families, and their own evidence of horizontal gene transfer.

Dec 8, 2016

Finding differences between bacterial strains 100 bases at a time.

This work was conducted mostly by my research assistant Yuhan Hao.

We have recently been using Bowtie2 for sequence comparisons related to shotgun metagenomics, where we directly sequence samples that contain mixtures of DNA from different organisms, with no PCR. Bowtie2 alignments can be made very stringently (>90% identity), and computed very rapidly for large data files with hundreds of millions of sequence reads. This allows us to identify DNA fragments by species and by gene function, provided we have a well-annotated database that contains DNA sequences from similar microbes. I know that MetaPhlan and other tools already do this, but I want to focus on the difference between bacteria (and viruses) at the strain level.

I have been playing around with the idea of constructing a database that will give strain-specific identification for human-associated microbes using Bowtie2 alignments with shotgun metagenomic data. There are plenty of sources for bacterial genome sequences, but the PATRIC database has a very good collection of human pathogen genomes. However the database is very redundant - with dozens or even hundreds of different strains for the same species, and the whole thing is too large to build into a Bowtie database (>350 Gb). So I am looking at ways to eliminate redundancy from the database while retaining high sensitivity at the species level to identify any sequence fragment, not just specific marker genes,  and the ability to make strain-specific alignments. So I want to identify which genome fragments have shared sequences among a group of strains within a single species and which fragments are unique to each strain.

Since our sequence reads are usually ~100 bases, We are looking the similarity between bacterial strains when the genome is chopped into 100 bp pieces.

Bowtie2 can be limited to only perfect alignments (100 bases with zero mismatches) using the parameters  --end-to-end  and  --scoremin C, 0, -1  and to something similar to 99% identity with    --scoremin  ‘L,0, -0.06’ 

Between two strains of  Streptococcus agalactiae, if we limit the Bowtie2 alignments to perfect matches,  half of the fragments align. So at the 100 base level, half of the genome is identical, and the other half has at least one variant. At 99% similarity (one mismatch per read), about 2/3 of the fragments align, and the other 1/3 has more than one variant, or in some cases no-similarity at all to the other genome. Yuhan Hao extended this experiment to another species (Strep pyogenes), where we see almost no alignment to Strep ag.  of genome fragments at 100%, 99%, or ~97% identity (see the Table below).  I have previously used the 97% sequence identity threshold to separate bacterial species, but I thought it was only for 16S rDNA sequences - here it seems to apply to almost every 100 base fragment in the whole genome. 

So we can build a database with one reference genome per species and safely eliminate the 2/3 of fragments that are nearly identical between strains, and retain only those portions of the genome that are unique to each strain.  I'm going to think about how to apply this iteratively (without creating a huge computing task), so we add only the truly unique fragments for EACH strain, rather than just testing if a fragment differs from the Reference for that species. With such a database, stringent Bowtie2 alignments will identify each sequence read by species and by strain and have very low false matches across different species. 

We can visualize the 100 base alignments between two strains at the 100% identity level, 99% identity, and at the least stringent Bowtie2 setting (very fast local) to see where the strains differ. This makes a pretty picture (below), which reveals some fairly obvious biology: There are parts of the genome that are more conserved and parts that are more variable between strains. The top panel shows only the 100% perfect matches (grey bars), the second panel shows that we can add in some 99% matching reads (white bars with a single vertical color stripe to mark the mismatch base), and the lowest panel shows reads that are highly diverged (lots of colored mismatch bases and some reads that are clearly mis-aligned from elsewhere on the genome). So if we choose only the reads that differ by more than 99%, we can build a database that can use Bowtie2 to identify different strains and have few multiple matches or incorrectly matched reads. 

This chart lists the number of UNMATCHED fragments after alignment of all non-overlapping 100 bp fragments from the genomes of each strain to the Strep. agalactiae genome that we arbitrarily chose as the Reference.   C 0 -1 corresponds to perfect matches only, L 0 -0.06 is approximately 99% identity (by my calculation), and L 0, -0.2 is about 97% identity. The lower the stringency of alignment, the fewer fragments end up in the "unmatched" file. 

                       Ref st2 st3   S.pyo st1 S.pyo st2
# 100b kmers          20615      22213      21660      17869      17094
C 0 -1       0              0              11243      10571      17718      16934  (perfect matches)
L 0 -0.06  0              0              6605        5918        17533      16762  (99% ident)
L 0 -0.1    0              0              6534        5839        17533      16762
L 0 -0.12  0              0              4812        4058        17371      16600
L 0 -0.15  0              0              4767        4006        17369      16600
L 0 -0.17  0              0              4756        3997        17369      16600

L 0 -0.2    0              0              4156        3419        17237      16441 (97% ident)