tag:blogger.com,1999:blog-44572164023991275792024-03-13T11:13:15.337-04:00Next-Gen SequencingA working guide to the rapidly developing world of Next-Generation DNA sequencing, with an emphasis on bioinformaticsAnonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.comBlogger72125tag:blogger.com,1999:blog-4457216402399127579.post-52306798834374957452018-10-29T16:10:00.000-04:002018-10-29T16:10:48.139-04:00Some tips to optimize bacterial genome assemblyI just finished a revised genome assembly for a collaborating lab. We do de novo sequencing of bacterial genomes all the time, sometimes in batches of 50 or 100 different isolates barcoded together on a high-yield HiSeq run. We typically aim for coverage in the 500x to 1000x range with paired-end 150 base reads.<br />
<br />
This genome was assembled as part of a "pipeline" script using SPADES, but it did not come out very nicely - over 3000 contigs, most of them small with low coverage. A FastQC on the raw data looked good (all bases with mean quality >Q30) except for the last 10 bases show a big drop in quality AND there are Illumina adapters found sometimes as much as 70 bases into the 3' end of quite a lot of the reads.<br />
<br />
I also checked some of the largest contigs by a simple BLASTn at the NCBI website, and they matched quite well to one of our favorite bacterial species (e-value = 0.0; percent identity = 93%). HOWEVER, many of the smaller contigs match the HUMAN genome. This is a big <span style="color: red;"><b>red flag</b></span> for me, and probably <b><u>should be checked for every bacterial (or any non-human) de-novo genome assembly.</u></b><br />
<br />
So here is my improved de novo assembly protocol:<br />
<br />
1) To remove the human DNA, I went back to the raw FASTQ files, and <b>aligned to the reference human genome with Bowtie2</b>. The simple way to do this is to use the '--un' flag in the Bowtie2 command, which dumps all unaligned reads to a separate output file. There are other more sophisticated ways to do this (using SAMtools) that interrogate if one or both of a set of paired-end reads align concordantly to the genome, but I wanted to be certain to remove any and all human reads. So I aligned each of my paired end data files (R1 and R2) separately and collected the unmatched reads for each. I found <span style="color: magenta;"><b>0.5% human reads </b></span>in my raw data. This may not seem like a lot, but the assembler will make thousands of contigs from this small amount of contaminant DNA. <br /><br />This filter will probably remove a few reads (genome regions) that contain simple sequence repeats longer than 25 bases (for example ATATATAT..., or CCCCCCCCC...) that are found in both human and my bacteria, but we know that assembly of repeats is not reliable anyway with 150 base Illumina reads.<br />
<br />
2) Using the unmatched files from the human screen, I filtered for quality and removed Illumina adapters with <a href="http://www.usadellab.org/cms/?page=trimmomatic" target="_blank">Trimmomatic.</a> This also catches the case where one read matches human and is removed, but the mate-pair is not. These 'unpaired' reads are not included in the 'trimmed' output file. <br />
<br />
3) The trimmed output files are now ready for assembly with <a href="http://spades.bioinf.spbau.ru/release3.5.0/manual.html#sec3.4" target="_blank">SPADES</a>. I used kmer sizes 21, 33, 55, 77, 99 as recommended by the SPADES manual for 150 bp paired Illumina reads. I also included the --careful flag which is recommended for bacterial genomes with high coverage.<br />
<br />
4) I got good results with this assembly - only about 600 contigs, with the largest one containing more than half of the expected genome (N50 > 1 Mb). This FASTA data file of contigs was small enough to use the NCBI web BLAST server for a single search. I was also able to Format the output as a text report showing just the top 10 matches for each contig (with no alignments). <br />
<br />
5) In addition to contig length, SPADES also reports the depth of kmer coverage in the header line for each contig in the multi-sequence FASTA file. I observed that all contigs with coverage greater than 100x matched to the same bacterial genus, but contigs with low coverage matched to all sorts of different things - quite a few to E.coli. I conclude that the low coverage contigs are low abundance contaminants in our sequencing library, and should not be included in the published de novo genome. [I do not trust pre-filtering by Bowtie against E.coli, we might lose some highly conserved genes]<br />
<br />6) Somebody more diligent than myself (perhaps one of my students in need of a final project in the Intro Bioinformatics course) can write a Biopython script to filter sequences in a FASTA file by coverage as reported in the SPADES FASTA header. I did it with some careful work with awk and Excel. My final genome is 2.2 Mb, with just 39 high coverage contigs which all have highly significant best BLAST matches to the same bacterial genus and over 2000 genes predicted by GeneMarkS. <br />
<br />
The takehome message:<br />
1) filter out Human reads from raw sequence data that will be used for de-novo assemblies (I don't know how well this will work for mammals, vertebrates, etc). <br />
2) filter out low coverage contigs from your final contig assembly FASTA file. <br />
<br />
<br />
Raw data:<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjX9zqv0E4fEnIJPpuyfXg98f2yBksJg03X-rY6y4AtnK7AT5aU12-qyehEQ-Z2FaMBD-a4EB4p_1ylnTpNPjkrubpM_3uDkGv35PfhlHMDqPcIgZsMYitzE__IOn3b3WAKGHYkxvXQSQT3/s1600/qual.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="600" data-original-width="800" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjX9zqv0E4fEnIJPpuyfXg98f2yBksJg03X-rY6y4AtnK7AT5aU12-qyehEQ-Z2FaMBD-a4EB4p_1ylnTpNPjkrubpM_3uDkGv35PfhlHMDqPcIgZsMYitzE__IOn3b3WAKGHYkxvXQSQT3/s640/qual.png" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFj1Lttzct_T-fMuBplz8G8juREHCaJ_fIw0ck3qiYI47_oYmb5tvUhkMhWWqHFo2q67HD6zpOBy1uNOVJO8xIIoyUpXVq6ASHN3eeYYVHMlsZw37IKmfyaHavEK1IrlWvtU7XwKwZ3jId/s1600/Adapters.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="600" data-original-width="1110" height="344" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFj1Lttzct_T-fMuBplz8G8juREHCaJ_fIw0ck3qiYI47_oYmb5tvUhkMhWWqHFo2q67HD6zpOBy1uNOVJO8xIIoyUpXVq6ASHN3eeYYVHMlsZw37IKmfyaHavEK1IrlWvtU7XwKwZ3jId/s640/Adapters.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com24tag:blogger.com,1999:blog-4457216402399127579.post-45522076710531775772018-01-31T16:45:00.001-05:002018-01-31T16:45:21.988-05:00Genome Coverage from BAM fileThere are many excellent tools for analysis of Next Gen Sequencing data in the standard BAM alignment format so I was surprised how difficult it was for me to get a nice graph of genome coverage. This will be trivial for a lot of hard core bioinformatics coders, so just move along if you are bored/annoyed.<br />
<br />
I needed to check the evenness of coverage across intervals of a bacterial genome that we were re-sequencing for various experimental reasons. I aligned my FASTQ to a reference genome from GenBank using Bowtie2. There are several nice tools in the SAMTools and BEDTools kits that produce either a base by base coverage count or a histogram of coverage showing how many bases are covered how deeply. I wanted a map at 1 Kb resolution. It took a while to figure out that I first need to make a BED file of intervals from my genome - with correct names for the 'chromosomes' that match the SAM header, and then use 'samtools bedcov' to get my intervals. Then a simple graph from Excel or R shows the coverage per interval along the genome.<br />
<br />
Here are the steps (as much for me to remember as for usefulness to anyone else)<br />
<br />
1) Create a Bowtie2 index of the reference genome (can be from GenBank, or can be a de novo assembly of contigs created locally from this FASTQ data). Reference_in must be FASTA format.<br />
bt2_base is the name you will call the index.<br />
<br />
bowtie2-build <span style="background-color: white; color: #666666; font-family: "Courier New", Courier;"> <reference_in> <bt2_base></bt2_base></reference_in></span><br />
<span style="background-color: white; color: #666666; font-family: "Courier New", Courier;"><br /></span>
<span style="background-color: white; color: #666666;"><span style="font-family: inherit;">2) Align the FASTQ file(s) to the Reference. bt2-idx is the name of the index created in the previous step. There are a ton of options for stringency of alignment, format of input data, etc. I set this to use 16 CPU threads. I usually like to leave out the unaligned reads, which can reduce the output file size somewhat. [Of course, the unalinged reads are the goal when you use Bowtie as a filter to remove human from microbiome data or any other sort of contaminant screen.] </span></span><br />
<span style="background-color: white; color: #666666; font-family: "Courier New", Courier;"><br /><a href="http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml" target="_blank">bowtie2 -p 16 </a></span><a href="http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml" target="_blank"><span style="background-color: white; color: #666666; font-family: "Courier New", Courier;">--no-unal </span><span style="background-color: white; color: #666666; font-family: "Courier New", Courier;">-x <bt2-idx> -1 <m1> -2 <m2> -S <output .sam=""></output></m2></m1></bt2-idx></span></a><br />
<span style="background-color: white; color: #666666; font-family: "Courier New", Courier;"><br /></span>
<span style="background-color: white; color: #666666; font-family: "Courier New", Courier;">3) Is is annoying that Bowtie produces output in SAM format, and 99% of the time, the very first thing you have to do is convert to sorted BAM. Note that samtools sort puts its own .bam on the end, so if your are not careful you will get files named output.bam.bam</span><br />
<span style="background-color: white; color: #666666; font-family: "Courier New", Courier;"><br /></span>
<span style="background-color: #f8f8f8; color: #0000ee; font-family: Menlo, Monaco, Consolas, "Courier New", monospace; font-size: 13px; orphans: 3; text-decoration-line: underline; white-space: pre-wrap; widows: 3;">samtools view -bS output.sam | samtools sort - file_sorted</span><br />
<span style="background-color: white; color: #666666; font-family: "Courier New", Courier;"><br /></span>
<span style="background-color: white; color: #666666; font-family: "Courier New", Courier;"><br /></span>
4) Create a 'genome' file for your reference genome. This is just a tab delimited file that names the chromosomes (or individudal sequences that may be in a multi-FASTA file, such as contig names).<br />
It is super easy to mess this up. The best way is to view the top of your output.sam file created by bowtie2. The lines that start with @ are your chromosome headers, and they very helpfully already show the length of each one. This is a bit of a pain if you have a genome with lots of contigs, but a little 'cut' and 'paste' in bash or Excel will get you there.<br />
<br />
Here is what mine looked like:<br />
<br />
@HD VN:1.0 SO:unsorted<br />
@SQ SN:MKZW02000001.1 LN:5520555<br />
@SQ SN:MKZW02000002.1 LN:248293<br />
@PG ID:bowtie2 PN:bowtie2 VN:2.2.7 CL:"/local/apps/bowtie2/2.2.7/bowtie2-align-s --wrapper basic-0 -p 8 -x Kluy<br />
<br />
and here is the genome file I made:<br />
<br />
MKZW02000001.1 5520555<br />
MKZW02000002.1 248293<br />
<div>
<br /></div>
<div>
<br /></div>
<div>
5) Make a set of intervals with<a href="http://quinlanlab.org/tutorials/bedtools/answers.html" target="_blank"> bedtools makewindows</a>. I wanted 1 Kb intervals, so I use -w 1000.</div>
<div>
The result is a simple BED file with one line for each 1Kb window of the genome. </div>
<div>
<br /></div>
<div>
bedtools makewindows -g genome.txt -w 1000 > genome_1k.bed</div>
<div>
<br /></div>
<div>
$ more genome_1k.bed</div>
<div>
<br /></div>
<div>
<div>
MKZW02000001.1 0 1000</div>
<div>
MKZW02000001.1 1000 2000</div>
<div>
MKZW02000001.1 2000 3000</div>
<div>
MKZW02000001.1 3000 4000</div>
</div>
<div>
<br /></div>
<div>
6) Use samtools bedcov to count the total number of bases in the BAM file that are located in each of the intervals ('sum of per base read depths per BED region'). This works much faster than any other coverage tool that I have tested. </div>
<div>
<br /></div>
<div>
samtools bedcov genome_1k.bed kv_sorted.bam > kv_1k.cov</div>
<div>
<br /></div>
<div>
7) Divide the sum of coverage by the window size (/1000 in my case), and plot the average coverage per window as a scatter plot, using the end of each interval as the X axis and the coverage as the Y. </div>
<div>
Histogram will only work nicely if you have very few intervals. In my case, high and low coverage outlier intervals are easily visible. </div>
<div>
<br /></div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHIX1nmYVfl2Ikf8sFswE4qZXEmk-j0EQL1TDvXAhuPekRBgcSfQg7xnDveTSHnA6rKZtse4qV9cm39p0ftn-Vi5lowIPeJb63bp-UzOL-3wFwk2JiVF84SzBVUZmAzItug8qsc_9iqhs8/s1600/VT15_cov.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="289" data-original-width="481" height="384" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHIX1nmYVfl2Ikf8sFswE4qZXEmk-j0EQL1TDvXAhuPekRBgcSfQg7xnDveTSHnA6rKZtse4qV9cm39p0ftn-Vi5lowIPeJb63bp-UzOL-3wFwk2JiVF84SzBVUZmAzItug8qsc_9iqhs8/s640/VT15_cov.jpg" width="640" /></a></div>
<br />Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com6tag:blogger.com,1999:blog-4457216402399127579.post-23716815318654722932018-01-04T14:42:00.000-05:002018-01-04T14:42:28.338-05:00Genome Annotation Challenges<div class="MsoNormal">
Public databases of genetic information have a fundamental garbage-in>garbage-out
problem. A huge number of useful databases are populated by pulling information
from other databases and adding new value by computational inferences, but
automated linking of databases can propagate incorrect information. The
curators of primary repositories such as GenBank make an substantial effort to
publish only correct information, so they are very conservative about
annotating genes with only verifiable information. NCBI also has a policy that
the original depositor of any given entry (a gene, protein, genome, experimental
dataset, etc.) is the author of its annotation and metadata, and no one else
can alter it.<span style="mso-spacerun: yes;"> </span></div>
<div class="MsoNormal">
<br /></div>
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"Times New Roman";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--><br />
<div class="MsoNormal">
<i style="mso-bidi-font-style: normal;"><span style="color: windowtext; text-decoration: none; text-underline: none;">Staphylococcus aureus</span></i><span style="color: windowtext; text-decoration: none; text-underline: none;"> is an
important </span>human pathogen with perhaps the largest number of whole
genome sequences in public repositories of any bacteria. NCBI has 8367 Staph genomes
in its “Genomes” section (on Jan 1, 2018), and another ~40,000 in the SRA and
Whole Genome Shotgun sections. However, GenBank has chosen strain <a href="https://www.ncbi.nlm.nih.gov/genome/?term=txid93061" target="_blank">NCTC 8325 as the Reference Genome</a> for Staph, and put its genes in RefSeq. </div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<img src="blob:https://www.blogger.com/2384e4b4-c64b-4d11-bba1-a3a7e6846a53" /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
This genome was
sequenced, annotated, and submitted on 27-Jan-2006 by Gillaspy et al from the Oklahoma
Health Sciences Center. As a result of this “Reference Genome” designation, an
automatic lookup of a Staph gene in GenBank is likely to get the annotation
from NCTC 8325. This particular Staph genome has 2,767 protein coding genes (plus
30 pseudogenes, 61 tRNA, and 16 rRNA genes), however 1496 of these proteins are
annotated with only “hypothetical protein” in their “gene product” or
“description” field. This is very confusing, since many of these genes are 100%
identical to proteins that have specific and well documented functions in other
Staph strains. </div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Here is one example:</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b><span style="color: #222222; font-family: Arial;">hypothetical protein SAOUHSC_00010 [Staphylococcus aureus subsp. aureus NCTC 8325]<o:p></o:p></span></b></div>
<div class="MsoNormal">
<span style="color: #444444; font-family: Arial; font-size: 11pt;">NCBI Reference Sequence: YP_498618.1<o:p></o:p></span></div>
<div class="MsoNormal" style="margin: 0.1pt 0in;">
<span style="font-family: Courier; font-size: 8.5pt;">FEATURES Location/Qualifiers </span><span style="font-family: Courier; font-size: 8.5pt;"> source 1..231 /organism="Staphylococcus aureus subsp. aureus NCTC 8325" /strain="NCTC 8325" /sub_species="aureus" /db_xref="taxon:<a href="https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=93061"><span style="color: #642a8f;">93061</span></a>" <a href="https://www.ncbi.nlm.nih.gov/protein/YP_498618.1?from=1&to=231&sat=4&sat_key=169497019"><span style="color: #642a8f;">Protein</span></a> 1..231 /product="hypothetical protein"</span><span style="font-family: Courier; font-size: 8.5pt;"><o:p></o:p></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
GenBank knows that this is not the correct annotation for this protein. In the “Region” sub-field of the record (which is very rarely used by automated annotation tools that take data from GenBank) an appropriate function, COG and a CDD conserved domain are noted:</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
</div>
<span class="feature"><a href="https://www.ncbi.nlm.nih.gov/protein/YP_498618.1?from=7&to=231&sat=4&sat_key=169497019"><span style="color: #642a8f;">Region</span></a> 7..231 /</span><br />
<pre style="margin: 0.1pt 0in;"><span class="feature">region_name="AzlC" </span></pre>
<pre style="margin: 0.1pt 0in;"><span class="feature">/note="Predicted branched-chain amino acid permease (azaleucine resistance) </span></pre>
<pre style="margin: 0.1pt 0in;"><span class="feature">[Amino acid transport and metabolism]; COG1296" </span></pre>
<pre style="margin: 0.1pt 0in;"><span class="feature">/db_xref="CDD:<a href="https://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=224215"><span style="color: #642a8f;">224215</span></a>"</span></pre>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"Times New Roman";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--></div>
<div class="MsoNormal">
NCBI also links this gene to an “Identical Protein Group”
where 3957 proteins are listed with 100% amino acid identity, which are
annotated variously as: “azaleucine resistance protein AzlC”, “branched-chain
amino acid ABC transporter permease”, “AzlC”, and “Inner membrane protein YgaZ”.
A very conservative annotation bot might panic at this level of inconsistency
and default to the lowest common denominator of “hypothetical protein”.
However, a more sophisticated automaton might compare the protein sequence to PFAM
or COG protein functional families and assign a common annotation to them all.</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"Times New Roman";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--></div>
<div class="MsoNormal">
The incorrect “hypothetical” annotations for Staph genes in
GenBank can be found downstream in many other databases, such as the Database
of Essential Genes, AureoWiki, KEGG, UniProt, etc. which all upload their
primary annotation from GenBank. So someone sequencing a new strain of Staph
and using any of these resources to annotate predicted genes will probably end
up assigning “hypothetical protein” for the AzlC gene and many hundreds of
others, perpetuating the cycle of misinformation.</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<pre style="margin: 0.1pt 0in;"><span class="feature"><img height="468" src="blob:https://www.blogger.com/273b2ea2-64a0-4a5f-94a1-aae126214696" width="640" /></span></pre>
<pre style="margin: 0.1pt 0in;"><span class="feature">
</span></pre>
<pre style="margin: 0.1pt 0in;"><span class="feature"><img height="292" src="blob:https://www.blogger.com/15e59d20-ba4d-48ca-863f-a043ccc0ecf9" width="640" /></span></pre>
<pre style="margin: 0.1pt 0in;"><span class="feature">
</span></pre>
<pre style="margin: 0.1pt 0in;"><span class="feature"><img height="304" src="blob:https://www.blogger.com/573e37d4-6993-4ab7-bf9a-6c170ccc44bc" width="640" />
<img height="575" src="blob:https://www.blogger.com/63c91ccb-ec89-4dbc-9d50-9edf757671bc" width="640" /></span></pre>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"Times New Roman";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--></div>
<div class="MsoNormal">
In a lot of other cases, it does not seem possible for an
algorithm to resolve messy annotations that a human expert might<span style="mso-spacerun: yes;"> </span>be able to figure out. For example,
Staph strain COL has many hypothetical genes such as SACOL1097. NCBI Identical
Proteins also show only “hypothetical protein” annotations.<span style="mso-spacerun: yes;"> </span>However, a BLAST search shows 95%
identity to nitrogen fixation protein NifR.<span style="mso-spacerun: yes;"> </span></div>
<div class="MsoNormal">
<span style="mso-spacerun: yes;"><br /></span></div>
<div class="MsoNormal">
<img height="314" src="blob:https://www.blogger.com/b98744b7-fa5a-4a73-bb8f-15bc88eaee46" width="320" /></div>
<div class="MsoNormal">
<span style="mso-spacerun: yes;"><br /></span></div>
<div class="MsoNormal">
hypothetical protein SACOL1097 [Staphylococcus aureus subsp.
aureus COL]</div>
<div class="MsoNormal">
GenBank: AAW37977.1</div>
<div class="MsoNormal">
Identical Proteins FASTA Graphics</div>
<div class="MsoNormal">
LOCUS<span style="mso-spacerun: yes;"> </span>AAW37977<span style="mso-spacerun: yes;">
</span>59 aa<span style="mso-spacerun: yes;">
</span>linear<span style="mso-spacerun: yes;"> </span>BCT
31-JAN-2014</div>
<div class="MsoNormal">
DEFINITION<span style="mso-spacerun: yes;">
</span>hypothetical protein SACOL1097 [Staphylococcus aureus subsp. aureus<span style="mso-spacerun: yes;"> </span>COL].</div>
<div class="MsoNormal">
ACCESSION<span style="mso-spacerun: yes;">
</span>AAW37977</div>
<div class="MsoNormal">
VERSION<span style="mso-spacerun: yes;"> </span>AAW37977.1</div>
<div class="MsoNormal">
DBLINK<span style="mso-spacerun: yes;"> </span>BioProject: PRJNA238</div>
<div class="MsoNormal">
<span style="mso-spacerun: yes;">
</span>BioSample: SAMN02603996</div>
<div class="MsoNormal">
DBSOURCE<span style="mso-spacerun: yes;">
</span>accession CP000046.1</div>
<div class="MsoNormal">
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"Times New Roman";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->
<!--StartFragment-->
<span style="font-family: Cambria; font-size: 12.0pt; mso-ansi-language: EN-US; mso-ascii-theme-font: minor-latin; mso-bidi-font-family: "Times New Roman"; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Cambria; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-latin; mso-hansi-theme-font: minor-latin;">SOURCE Staphylococcus
aureus subsp. aureus COL</span><!--EndFragment-->
</div>
<div class="MsoNormal">
<span style="font-family: Cambria; font-size: 12.0pt; mso-ansi-language: EN-US; mso-ascii-theme-font: minor-latin; mso-bidi-font-family: "Times New Roman"; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Cambria; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-latin; mso-hansi-theme-font: minor-latin;"><br /></span></div>
<div class="MsoNormal">
<img height="366" src="blob:https://www.blogger.com/0befa5ab-eba0-428f-85a8-638e12a0f4a4" width="640" /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<img height="195" src="blob:https://www.blogger.com/6e51577f-d2dc-4388-98cf-effb6cdeed55" width="640" /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<pre style="margin-bottom: .1pt; margin-left: 0in; margin-right: 0in; margin-top: .1pt;">
</pre>
<pre style="margin-bottom: .1pt; margin-left: 0in; margin-right: 0in; margin-top: .1pt;"><span class="feature">
<!--EndFragment-->
<!--EndFragment-->
</span></pre>
<pre style="margin-bottom: .1pt; margin-left: 0in; margin-right: 0in; margin-top: .1pt;"><span class="feature"><span style="color: black; font-size: 8.5pt;">
</span></span></pre>
<pre style="margin-bottom: .1pt; margin-left: 0in; margin-right: 0in; margin-top: .1pt;"><span class="feature"><span style="color: black; font-size: 8.5pt;">
</span></span></pre>
<pre style="margin-bottom: .1pt; margin-left: 0in; margin-right: 0in; margin-top: .1pt;"><span class="feature"><span style="color: black; font-size: 8.5pt;">
</span></span></pre>
<pre style="margin-bottom: .1pt; margin-left: 0in; margin-right: 0in; margin-top: .1pt;"><span class="feature"><span style="color: black; font-size: 8.5pt;">
</span></span></pre>
<pre style="margin-bottom: .1pt; margin-left: 0in; margin-right: 0in; margin-top: .1pt;"><span class="feature"><span style="color: black; font-size: 8.5pt;">
</span></span></pre>
<pre style="margin-bottom: .1pt; margin-left: 0in; margin-right: 0in; margin-top: .1pt;"><span class="feature"><span style="color: black; font-size: 8.5pt;">
</span></span></pre>
Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com4tag:blogger.com,1999:blog-4457216402399127579.post-475775924955826812017-12-22T13:25:00.000-05:002017-12-22T13:25:38.682-05:00Contaminated GenomesThis is a long post, so first a quick summary. <b>Some genome sequences contain contaminants.</b> These contaminants create many problems when we use a trusted resource like <b><u>GenBank</u></b> or <b><u>UniProt</u></b> to summarize the sequences in a taxonomic group. I have illustrated one typical example, but there are thousands (maybe tens of thousands) of others.<br />
<br />
I have been obsessing over errors and contamination in our public sequence databases. This week I was trying to use UniProt as a set of reference sequences for fungi. Our goal is fairly simple: To find the fungal DNA in a metagenomic shotgun sequence sample - which is just a mixture of all the DNA present in a scraping from mouth, throat, or any other body site.<br />
<br />
UniProt makes it quite easy to sort all their proteins by taxonomy, and to download a subset of the data clustered at 100% (combining all exact duplicate sequences), 90%, or 50% amino acid identity. One might expect that fungal genes should not match bacteria at more than 50% identity. But surprisingly there are quite a lot of 50% and 90% clusters that contain both bacterial and fungal sequences (about 3000 of the 90% fungal clusters also contain bacterial proteins). <br />
<br />
The UniProt support staff provided some very useful help to build a query on their system that finds only those clusters of 90% identical proteins that contain fungal genes, but NO (NO!) bacterial genes. In case you like this sort of thing, here is the exact query:<br />
<br />
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ansi-language: EN-US; mso-ascii-theme-font: minor-latin; mso-bidi-font-family: "Times New Roman"; mso-bidi-language: AR-SA; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-latin; mso-hansi-theme-font: minor-latin;"><span style="color: red;">uniprot:(taxonomy:"Fungi [4751]") NOT
taxonomy:"Bacteria [2]" AND identity:0.9</span></span><br />
[Note the careful use of quote marks parenthesis and square brackets, this stuff is rather tricky]<br />
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ansi-language: EN-US; mso-ascii-theme-font: minor-latin; mso-bidi-font-family: "Times New Roman"; mso-bidi-language: AR-SA; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-latin; mso-hansi-theme-font: minor-latin;"><br /></span>
So I downloaded this set of putative fungal proteins (UniProt very helpfully creates a single 'representative' UniRef sequence in FASTA format for each cluster). I tested the fungal proteins against all the gene coding sequences (CDS) from the E.coli genome using BLASTx. Once again, there are far too many high similarity matches.<br />
<br />
One of the top matches is to a gene (<span style="background-color: white; color: #222222; font-family: arial, sans-serif;"><span style="font-size: xx-small;">Guanosine-3',5'-bis(diphosphate) 3'-pyrophosphohydrolase)</span></span><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 14.608px;"> </span>from the fungus <i>Beauveria bassiana</i> that has 98% identity to E.coli. Since I am in an obsessive mood about this sort of thing, I decided that for this one example, I would collect some evidence to decide if we have strong sequence homology between bacteria and fungi for this gene, if <i>Beauveria bassiana </i>has a horizontal gene transfer, or if <b>E. COLI CAN BE A CONTAMINANT IN GENOME SEQUENCES (!!!)</b> [emphasis mine]<br />
<br />
I put this <i>Beauveria </i>gene into a generic NCBI BLAST against all 'nr' proteins, and I got a very interesting result. There are exactly two matches to eukaryotes (<i>Beauveria </i>and a nematode), and 11,858 matches to bacteria, including lots of E.coli.<br />
<br />
So I traced the <i>Beauveria bassiana </i>protein in UniProt back to its source as a whole genome shotgun sequence uploaded to GenBank on Nov 3, 2014 by the <b style="background-color: white; font-family: arial, helvetica, clean, sans-serif; font-size: 13px;">Institute of Plant Protection, Jilin Academy of Agricultural Sciences, Accession </b><span style="background-color: white; font-family: arial, helvetica, clean, sans-serif; font-size: 13px;">PRJNA178080, WGS </span><a class="RegularLink" href="https://www.ncbi.nlm.nih.gov/nuccore/ANFO00000000" style="background-color: #f4f4f4; color: #642a8f; font-family: arial, helvetica, clean, sans-serif; font-size: 11.7px; text-decoration-line: none;" title="GenBank WGS master accession">ANFO00000000</a>, Assembly <a class="RegularLink" href="https://www.ncbi.nlm.nih.gov/assembly/GCA_000770705.1" style="background-color: #f4f4f4; color: #642a8f; font-family: arial, helvetica, clean, sans-serif; font-size: 11.7px; text-decoration-line: none;" title="Genome assembly info">GCA_000770705.1</a>. <br />
<br />
I downloaded the whole genome assembly and BLASTed it with the E.coli hydrolase gene from above. This very quickly pinpointed a contig 00271 (<span style="background-color: white; font-family: monospace, serif; font-size: 13px; white-space: pre-wrap;">ANFO01000251.1 Beauveria bassiana D1-5 contig00271) that contains the matching sequence. The contig</span> is 72,232 bases long. I then put this conting into NCBI BLAST against Bacteria. I get matches that correspond to lots of bacterial genes (POL I, RecG, iPGM, XanP, CpxA, GTP binding protein, GSI beta, and my ppGpp hydrolase) all with >90% identity and BLAST e-value 0.0. <br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHtWb2rGEwG1imIGluKE9_KrVi3LiWyGbsFEiy7pRIn6E4GnHEa91zo9TWoY69JKSUn5RWQTQGP7wIW_5stVD5Dx19hI3Xhl79FD1ZGRGrND52_j4C9Z4TCH8FBXz_Xpw2ynCNG0AojOFE/s1600/Contam_blast.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="259" data-original-width="713" height="232" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHtWb2rGEwG1imIGluKE9_KrVi3LiWyGbsFEiy7pRIn6E4GnHEa91zo9TWoY69JKSUn5RWQTQGP7wIW_5stVD5Dx19hI3Xhl79FD1ZGRGrND52_j4C9Z4TCH8FBXz_Xpw2ynCNG0AojOFE/s640/Contam_blast.JPG" width="640" /></a></div>
<br />
<br />
<br />
Final answer: This is a contaminant. There was some <i>E.coli </i>DNA sequenced and assembled with the <i>Beauveria </i>DNA, and nobody checked before loading these sequences into GenBank.<br />
<br />
My recommendation to GenBank and de novo genome sequencers everywhere is to check all predicted proteins from new genomes for matches to bacteria and human before loading them into a trusted database.<br />
<br />
<br />
<br />
<br />Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com4tag:blogger.com,1999:blog-4457216402399127579.post-60021177931434719772017-08-24T12:00:00.001-04:002017-08-24T12:01:49.804-04:00Gene expression analysis shows that an AHR2 mutant fish population from the Hudson River has a dramatically reduced response to dioxin<!--[if gte mso 9]><xml>
<o:DocumentProperties>
<o:Version>14.00</o:Version>
</o:DocumentProperties>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><br />
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:dispDef/>
<m:lMargin m:val="0"/>
<m:rMargin m:val="0"/>
<m:defJc m:val="centerGroup"/>
<m:wrapIndent m:val="1440"/>
<m:intLim m:val="subSup"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="267">
<w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/>
<w:LsdException Locked="false" Priority="39" Name="toc 1"/>
<w:LsdException Locked="false" Priority="39" Name="toc 2"/>
<w:LsdException Locked="false" Priority="39" Name="toc 3"/>
<w:LsdException Locked="false" Priority="39" Name="toc 4"/>
<w:LsdException Locked="false" Priority="39" Name="toc 5"/>
<w:LsdException Locked="false" Priority="39" Name="toc 6"/>
<w:LsdException Locked="false" Priority="39" Name="toc 7"/>
<w:LsdException Locked="false" Priority="39" Name="toc 8"/>
<w:LsdException Locked="false" Priority="39" Name="toc 9"/>
<w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/>
<w:LsdException Locked="false" Priority="10" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Title"/>
<w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/>
<w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/>
<w:LsdException Locked="false" Priority="22" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Strong"/>
<w:LsdException Locked="false" Priority="20" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/>
<w:LsdException Locked="false" Priority="59" SemiHidden="false"
UnhideWhenUsed="false" Name="Table Grid"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/>
<w:LsdException Locked="false" Priority="1" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 1"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 1"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 1"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/>
<w:LsdException Locked="false" Priority="34" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/>
<w:LsdException Locked="false" Priority="29" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Quote"/>
<w:LsdException Locked="false" Priority="30" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 1"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 1"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 2"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 2"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 2"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 2"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 2"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 3"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 3"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 3"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 3"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 3"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 4"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 4"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 4"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 4"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 4"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 5"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 5"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 5"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 5"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 5"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 6"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 6"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 6"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 6"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 6"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/>
<w:LsdException Locked="false" Priority="19" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/>
<w:LsdException Locked="false" Priority="21" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/>
<w:LsdException Locked="false" Priority="31" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/>
<w:LsdException Locked="false" Priority="32" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/>
<w:LsdException Locked="false" Priority="33" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Book Title"/>
<w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
<w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
</w:LatentStyles>
</xml><![endif]--><!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin-top:0in;
mso-para-margin-right:0in;
mso-para-margin-bottom:10.0pt;
mso-para-margin-left:0in;
line-height:115%;
mso-pagination:widow-orphan;
font-size:11.0pt;
font-family:"Calibri","sans-serif";
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->
<br />
<div class="MsoNormal">
Together with <a href="https://med.nyu.edu/faculty/isaac-i-wirgin" target="_blank">Ike Wirgin</a> in the NYUMC <a href="https://med.nyu.edu/environmentalmedicine/" target="_blank">Dept. of EnvironmentalMedicine</a>, we just published a paper on a gene expression study of a fish
population in the Hudson River in <span style="color: black; font-family: "segoe ui" , "sans-serif";">Genome Biology and Evolution. "<a href="https://academic.oup.com/gbe/article/doi/10.1093/gbe/evx159/4091608/A-Dramatic-Difference-in-Global-Gene-Expression?guestAccessKey=4a08b41a-4f00-459f-8d00-fbe2a10a712b" target="_blank">A Dramatic Difference inGlobal Gene Expression between TCDD-Treated Atlantic Tomcod Larvae from theResistant Hudson River and a Nearby Sensitive Population</a>"</span></div>
<br />
<br />
<!--[if gte mso 9]><xml>
<o:DocumentProperties>
<o:Version>14.00</o:Version>
</o:DocumentProperties>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><br />
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:dispDef/>
<m:lMargin m:val="0"/>
<m:rMargin m:val="0"/>
<m:defJc m:val="centerGroup"/>
<m:wrapIndent m:val="1440"/>
<m:intLim m:val="subSup"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="267">
<w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/>
<w:LsdException Locked="false" Priority="39" Name="toc 1"/>
<w:LsdException Locked="false" Priority="39" Name="toc 2"/>
<w:LsdException Locked="false" Priority="39" Name="toc 3"/>
<w:LsdException Locked="false" Priority="39" Name="toc 4"/>
<w:LsdException Locked="false" Priority="39" Name="toc 5"/>
<w:LsdException Locked="false" Priority="39" Name="toc 6"/>
<w:LsdException Locked="false" Priority="39" Name="toc 7"/>
<w:LsdException Locked="false" Priority="39" Name="toc 8"/>
<w:LsdException Locked="false" Priority="39" Name="toc 9"/>
<w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/>
<w:LsdException Locked="false" Priority="10" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Title"/>
<w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/>
<w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/>
<w:LsdException Locked="false" Priority="22" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Strong"/>
<w:LsdException Locked="false" Priority="20" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/>
<w:LsdException Locked="false" Priority="59" SemiHidden="false"
UnhideWhenUsed="false" Name="Table Grid"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/>
<w:LsdException Locked="false" Priority="1" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 1"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 1"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 1"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/>
<w:LsdException Locked="false" Priority="34" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/>
<w:LsdException Locked="false" Priority="29" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Quote"/>
<w:LsdException Locked="false" Priority="30" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 1"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 1"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 2"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 2"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 2"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 2"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 2"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 3"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 3"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 3"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 3"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 3"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 4"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 4"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 4"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 4"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 4"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 5"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 5"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 5"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 5"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 5"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 6"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 6"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 6"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 6"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 6"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/>
<w:LsdException Locked="false" Priority="19" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/>
<w:LsdException Locked="false" Priority="21" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/>
<w:LsdException Locked="false" Priority="31" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/>
<w:LsdException Locked="false" Priority="32" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/>
<w:LsdException Locked="false" Priority="33" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Book Title"/>
<w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
<w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
</w:LatentStyles>
</xml><![endif]--><!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin-top:0in;
mso-para-margin-right:0in;
mso-para-margin-bottom:10.0pt;
mso-para-margin-left:0in;
line-height:115%;
mso-pagination:widow-orphan;
font-size:11.0pt;
font-family:"Calibri","sans-serif";
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:minor-latin;}
</style>
<![endif]--><span style="font-family: "calibri" , "sans-serif"; font-size: 11.0pt; line-height: 115%;">Atlantic
tomcod (<i style="mso-bidi-font-style: normal;">Microgadus tomcod</i>) is a fish
that Ike has been studying for many years as an indicator of biological
responses to toxic pollution of the Hudson River estuary which contains two of
the largest S<a href="https://www.riverkeeper.org/campaigns/stop-polluters/pcbs/" target="_blank">uperfund sites in the nation because of PCB and dioxin (TCDD)contamination</a>. It was previously shown that tomcod from the Hudson River had
extraordinarily high prevalence of liver tumors in the 1970's, exceeding 90% in
older fish.<span style="mso-spacerun: yes;"> </span>But in 2006 they found that
the Hudson River population of this fish had many fewer tumors and a 100-fold
reduction of induction of the Cytochrome P450 pathway in response to dioxin and
PCB exposure </span><span style="font-family: "calibri" , "sans-serif"; font-size: 12.0pt; line-height: 115%;">(<a href="http://news.nationalgeographic.com/news/2011/02/110217-hudson-river-pcb-fish-evolution-water/" target="_blank">Wirgin and Chambers 2006</a>)</span><span style="font-family: "calibri" , "sans-serif"; font-size: 11.0pt; line-height: 115%;">. In a 2011 <a href="https://www.ncbi.nlm.nih.gov/pubmed/21330491" target="_blank">Science paper</a>, they reported a two amino acid deletion of the AHR2 gene in the Hudson
River population that was absent, or nearly so, in all other tomcod populations.<span style="mso-spacerun: yes;"> </span>The aryl hydrocarbon receptor is responsible
for induction of CYP genes in all vertebrates and in activation of most
toxicities from these contaminants </span><span style="font-family: "calibri" , "sans-serif"; font-size: 12.0pt; line-height: 115%;">(<a href="https://www.ncbi.nlm.nih.gov/pubmed/21330491" target="_blank">Wirginet al 2011</a>)</span><span style="font-family: "calibri" , "sans-serif"; font-size: 11.0pt; line-height: 115%;">.</span><br />
<br />
<span style="font-family: "calibri" , "sans-serif"; font-size: 11.0pt; line-height: 115%;"><!--[if gte mso 9]><xml>
<o:DocumentProperties>
<o:Version>14.00</o:Version>
</o:DocumentProperties>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:dispDef/>
<m:lMargin m:val="0"/>
<m:rMargin m:val="0"/>
<m:defJc m:val="centerGroup"/>
<m:wrapIndent m:val="1440"/>
<m:intLim m:val="subSup"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="267">
<w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/>
<w:LsdException Locked="false" Priority="39" Name="toc 1"/>
<w:LsdException Locked="false" Priority="39" Name="toc 2"/>
<w:LsdException Locked="false" Priority="39" Name="toc 3"/>
<w:LsdException Locked="false" Priority="39" Name="toc 4"/>
<w:LsdException Locked="false" Priority="39" Name="toc 5"/>
<w:LsdException Locked="false" Priority="39" Name="toc 6"/>
<w:LsdException Locked="false" Priority="39" Name="toc 7"/>
<w:LsdException Locked="false" Priority="39" Name="toc 8"/>
<w:LsdException Locked="false" Priority="39" Name="toc 9"/>
<w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/>
<w:LsdException Locked="false" Priority="10" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Title"/>
<w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/>
<w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/>
<w:LsdException Locked="false" Priority="22" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Strong"/>
<w:LsdException Locked="false" Priority="20" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/>
<w:LsdException Locked="false" Priority="59" SemiHidden="false"
UnhideWhenUsed="false" Name="Table Grid"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/>
<w:LsdException Locked="false" Priority="1" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 1"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 1"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 1"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/>
<w:LsdException Locked="false" Priority="34" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/>
<w:LsdException Locked="false" Priority="29" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Quote"/>
<w:LsdException Locked="false" Priority="30" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 1"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 1"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 2"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 2"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 2"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 2"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 2"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 3"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 3"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 3"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 3"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 3"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 4"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 4"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 4"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 4"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 4"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 5"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 5"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 5"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 5"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 5"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 6"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 6"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 6"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 6"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 6"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/>
<w:LsdException Locked="false" Priority="19" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/>
<w:LsdException Locked="false" Priority="21" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/>
<w:LsdException Locked="false" Priority="31" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/>
<w:LsdException Locked="false" Priority="32" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/>
<w:LsdException Locked="false" Priority="33" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Book Title"/>
<w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
<w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
</w:LatentStyles>
</xml><![endif]--><!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin-top:0in;
mso-para-margin-right:0in;
mso-para-margin-bottom:10.0pt;
mso-para-margin-left:0in;
line-height:115%;
mso-pagination:widow-orphan;
font-size:11.0pt;
font-family:"Calibri","sans-serif";
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->
</span><br />
<div class="MsoNormal">
Our goal for this project was to build a <i style="mso-bidi-font-style: normal;">de novo</i> sequence of the genome of the
tomcod, annotate all the genes, and do a global analysis of gene expression
(with RNAseq) to look at the genome-wide effects of the AHR2 mutation in Hudson
River larvae as compared to wild type fish (collected from a clean location at
Shinnecock Bay, on the South Shore of Long Island, in the Hamptons). All DNA
and RNA sequencing was done at the<a href="https://med.nyu.edu/research/scientific-cores-shared-resources/genome-technology-center" target="_blank"> NYUMC Genome Technology Center</a> under the
direction of <a href="https://med.nyu.edu/faculty/adriana-heguy" target="_blank">Adriana Heguy</a>. </div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
From a bioinformatics point of view, this project was interesting
because we decided to integrate both a genome assembly and multiple
transcriptome assemblies to get the most complete set of full-length protein
coding genes.<span style="mso-spacerun: yes;"> </span>For the transcriptome, we
did <i style="mso-bidi-font-style: normal;">de novo</i> assembly with <a href="http://cab.spbu.ru/software/rnaspades/" target="_blank">rnaSPAdes</a>
on many different RNAseq samples including embryo, juvenile, and adult
liver.<span style="mso-spacerun: yes;"> </span>We made the genome assembly with
<a href="https://github.com/aquaskyline/SOAPdenovo2" target="_blank">SoapDenovo2</a>, and then predicted gene coding regions with <a href="https://ccb.jhu.edu/software/glimmerhmm/" target="_blank">GLIMMER-HMM</a>. We
combined all of these different sets of transcript coding sequences with the
<a href="http://arthropods.eugenes.org/genes2/about/EvidentialGene_trassembly_pipe.html" target="_blank">EvidentialGene</a> pipeline created by Don Gilbert. With the final merged set of
transcripts, we used Salmon to (very quickly) quasi-map the reads in each
RNAseq sample onto the transcriptome and quantify gene expression.<span style="mso-spacerun: yes;"> </span>Differential gene expression was computed
with <a href="https://bioconductor.org/packages/release/bioc/html/edgeR.html" target="_blank">edgeR</a>. </div>
<br />
<span style="font-family: "calibri" , "sans-serif"; font-size: 11.0pt; line-height: 115%;"><span style="font-family: "calibri" , "sans-serif"; font-size: 11.0pt; line-height: 115%;">The results were extremely dramatic. At low
doses of dioxin, the wild type larvae show a huge gene expression response,
with about a thousand genes having large fold-changes (some key genes were
validated by qRT-PCR). The mutant Hudson River larvae basically ignore these
low doses, with almost no gene expression changes. At the highest does (1 ppb),
the Hudson River fish show some gene expression changes, but mostly not in the
same genes as in the wild type fish. <span style="mso-spacerun: yes;"> </span>Even the negative control larvae (not treated
with dioxin) show a large difference in gene expression between the two
populations. </span> </span><br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipupbk-qzAo_8WyZ2ht-tmjeHqVDVW01R__30TN8aikPsyAx30jI_Y5yAKvykjk2pEsF58WrEOIH3rorpHBp1NVemqXBxmoyqUxP23tawNZKfyDw1TPwyqFarR4GSxP7axWyrMSsEEKrxa/s1600/blog.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1600" data-original-width="1200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipupbk-qzAo_8WyZ2ht-tmjeHqVDVW01R__30TN8aikPsyAx30jI_Y5yAKvykjk2pEsF58WrEOIH3rorpHBp1NVemqXBxmoyqUxP23tawNZKfyDw1TPwyqFarR4GSxP7axWyrMSsEEKrxa/s1600/blog.png" /></a></div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com11tag:blogger.com,1999:blog-4457216402399127579.post-50063193841094643442017-07-25T11:48:00.000-04:002017-07-25T11:48:09.067-04:00RNA-seq Power calculation (FAQ)I spend a lot of time answering questions from researchers working with genomic data. If I put a lot of effort into an answer, I try to keep it in my file of 'Frequently Asked Questions' - even though this stuff does change fairly rapidly. Last week I got the common question: "How do I calculate Power for an RNA-seq experiment? So here is my FAQ answer. I have summarized from the work of many wise statisticians, with great reliance on the <a href="http://master.bioconductor.org/packages/devel/bioc/html/RnaSeqSampleSize.html" target="_blank">RnaSeqSampleSize</a> R package by Shilin Zhao and the nice people at Vanderbilt who build a <a href="https://cqs.mc.vanderbilt.edu/shiny/RNAseqPS/" target="_blank">Shiny website</a> interface to it. <br />
<br />
----------<br />
<br />
<div class="MsoPlainText">
>> I’m considering including an RNA-Seq experiment
in a grant proposal. Do you have any advice on how to calculate power for human
specimens? I’m proposing to take FACS sorted lymphocytes from disease patients and two control groups. I believe other people
analyze 10-20 individuals per group for similar types of experiments.</div>
<div class="MsoPlainText">
>> </div>
<div class="MsoPlainText">
>> It would be great if you have language that I
can use in the grant proposal to justify the cohort size. Also, we can use that
number to calculate the budget for your services. Thanks!</div>
<div class="MsoPlainText">
>> </div>
<div class="MsoPlainText">
>> Ken</div>
<br />
<br />
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><br />
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
</w:Compatibility>
<w:DoNotOptimizeForBrowser/>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:dispDef/>
<m:lMargin m:val="0"/>
<m:rMargin m:val="0"/>
<m:defJc m:val="centerGroup"/>
<m:wrapIndent m:val="1440"/>
<m:intLim m:val="subSup"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="267">
<w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/>
<w:LsdException Locked="false" Priority="39" Name="toc 1"/>
<w:LsdException Locked="false" Priority="39" Name="toc 2"/>
<w:LsdException Locked="false" Priority="39" Name="toc 3"/>
<w:LsdException Locked="false" Priority="39" Name="toc 4"/>
<w:LsdException Locked="false" Priority="39" Name="toc 5"/>
<w:LsdException Locked="false" Priority="39" Name="toc 6"/>
<w:LsdException Locked="false" Priority="39" Name="toc 7"/>
<w:LsdException Locked="false" Priority="39" Name="toc 8"/>
<w:LsdException Locked="false" Priority="39" Name="toc 9"/>
<w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/>
<w:LsdException Locked="false" Priority="10" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Title"/>
<w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/>
<w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/>
<w:LsdException Locked="false" Priority="22" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Strong"/>
<w:LsdException Locked="false" Priority="20" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/>
<w:LsdException Locked="false" Priority="59" SemiHidden="false"
UnhideWhenUsed="false" Name="Table Grid"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/>
<w:LsdException Locked="false" Priority="1" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 1"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 1"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 1"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/>
<w:LsdException Locked="false" Priority="34" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/>
<w:LsdException Locked="false" Priority="29" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Quote"/>
<w:LsdException Locked="false" Priority="30" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 1"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 1"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 2"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 2"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 2"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 2"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 2"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 3"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 3"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 3"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 3"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 3"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 4"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 4"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 4"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 4"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 4"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 5"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 5"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 5"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 5"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 5"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 6"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 6"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 6"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 6"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 6"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/>
<w:LsdException Locked="false" Priority="19" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/>
<w:LsdException Locked="false" Priority="21" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/>
<w:LsdException Locked="false" Priority="31" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/>
<w:LsdException Locked="false" Priority="32" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/>
<w:LsdException Locked="false" Priority="33" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Book Title"/>
<w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
<w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
</w:LatentStyles>
</xml><![endif]--><!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:11.0pt;
font-family:"Calibri","sans-serif";
mso-ascii-font-family:Calibri;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Calibri;
mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->
<br />
<div class="MsoPlainText">
Hi Ken,</div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Power calculations require that you make some assumptions
about the experiment.<span style="mso-spacerun: yes;"> </span>Ideally, you have
done some sort of pilot experiment first, so you have an estimate of the total
number of expressed genes (RPKM>1), fold change, variability between samples
within each treatment, and how many genes are going to be differentially
expressed.<span style="mso-spacerun: yes;"> </span>The variability of your
samples is probably the single most important issue - humans tend to vary a lot
in gene expression, cultured cell lines not so much. You can reduce variability
somewhat by choosing a uniform patient group - age, gender, body mass index,
ethnicity, diet, current and previous drug use, etc. </div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
Have a look at this web page for an example of an
RNA-seq<span style="mso-spacerun: yes;"> </span>power calculator. </div>
<div class="MsoPlainText">
<a href="https://cqs.mc.vanderbilt.edu/shiny/RNAseqPS/">https://cqs.mc.vanderbilt.edu/shiny/RNAseqPS/</a></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
I plugged in the following data:<span style="mso-spacerun: yes;"> </span>FDR=0.05, ratio of reads between groups=1,
total number of relevant genes 10,000 (ie. you will remove about half of all
genes due to low overall expression prior to differential expression
testing).<span style="mso-spacerun: yes;"> </span>Expected number of DE
genes=500, fold change for DE genes=2, read count (RPKM) for DE genes =10,
dispersion (Standard Dev) 0.5.<span style="mso-spacerun: yes;"> </span>With
these somewhat reasonable values, you get sample size of 45.<span style="mso-spacerun: yes;"> </span>So, to get a smaller sample size, you can
play with all of the parameters.<span style="mso-spacerun: yes;"> </span></div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
The estimated Sample Size:</div>
<div class="MsoPlainText">
45</div>
<div class="MsoPlainText">
Description:</div>
<div class="MsoPlainText">
"We are planning a RNA sequencing experiment to
identify differential gene expression between two groups. Prior data indicates
that the minimum average read counts among the prognostic genes in the control
group is 10, the maximum dispersion is 0.5, and the ratio of the geometric mean
of normalization factors is 1. Suppose that the total number of genes for
testing is 10000 and the top 500 genes are prognostic. If the desired minimum
fold change is 2, we will need to study 45 subjects in each group to be able to
reject the null hypothesis that the population means of the two groups are
equal with probability (power) 0.9 using exact test. The FDR associated with
this test of this null hypothesis is 0.05."</div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
To improve power (other than larger samples size or less
variability among your patients), you can sequence deeper (which allows a more
accurate and presumably less variable measure of expression for each gene),
only look at the most highly expressed genes, or only look at genes that have
large fold change. Again, it helps to have prior data to estimate these things.
</div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
When I do an actual RNA-seq data analysis, we can improve
on the 'expected power' by cheating a bit on the estimate of variance
(dispersion). We calculate a single variance estimate for ALL genes, then
modify this variance for each individual gene (sort of a Bayesian approach).
This allows for a lower variance than would happen if you just calculate StdDev
for each gene in each treatment.<span style="mso-spacerun: yes;"> </span>This
rests on an assumption that MOST genes are not differentially expressed in your
experiment, and the variance of all genes across all samples is a valid
estimate of background genetic variance. </div>
<div class="MsoPlainText">
<br /></div>
<div class="MsoPlainText">
<br /></div>
Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com3tag:blogger.com,1999:blog-4457216402399127579.post-30282405467573072712017-02-06T12:15:00.000-05:002017-02-06T12:15:59.024-05:00Oxford MinION 1D and 2D readsWe have been testing out the <a href="https://nanoporetech.com/products/minion" target="_blank">Oxford MinION</a> DNA sequencing machine to see what it can contribute to larger <i>de novo</i> sequencing projects. Most of the posted data so far come from small genomes where moderate coverage is more easily obtained. <a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1103-0" target="_blank">Recent publications</a> claim improved sequence quality for Oxfort MinION.<br />
<br />
We are working on a new <i>de novo</i> sequence for the little skate shark (<a href="http://eol.org/pages/217228/overview" style="font-family: Arial, Helvetica, Verdana, "Bitstream Vera Sans", sans-serif; font-size: 12px; margin: 0px; padding: 0px; text-align: justify; text-decoration: none;">(<i>Leucoraja erinacea</i>)</a>, an interesting system to study developmental biology. The skate has a BIG genome, estimated at 4 Gb (bigger than human), so this is going to be a difficult project. The existing skate genome in <a href="https://www.ncbi.nlm.nih.gov/assembly/GCA_000238235.1" target="_blank">Genbank </a>and the <a href="http://skatebase.org/" target="_blank">SkateBase </a>website is not in very good shape (3 million contigs with N50 size of 665 bp).<br />
<br />
We got a couple of preliminary Oxford MinION reads from skate DNA - not nearly enough coverage to make a dent in this project, just having a look at the data. Oxford produces two kinds of data (in their own <a href="https://porecamp.github.io/2016/tutorials/PoreCamp2016-02-MinIONData.pdf" target="_blank">annoying FAST5 format,</a> but I won't rant about that right now), single pass 1D and double pass 2D. [My Bioinformatics programmer Yuhan Hao did this analysis.] Here is what our data looks like.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjT_HrG6SPvhBc-ECwDRQvHUIhUMWpFMSAJmjapxHVmt3ozcGKbwp0P50jnj_waMeo5tkx4b17fX_zwgRRRMq5p7ASdQMoXKQggjvqLd9BqG57FUa-568vLiPn5cxNhVSTnm_Ar2rEeVb6s/s1600/Slide1.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjT_HrG6SPvhBc-ECwDRQvHUIhUMWpFMSAJmjapxHVmt3ozcGKbwp0P50jnj_waMeo5tkx4b17fX_zwgRRRMq5p7ASdQMoXKQggjvqLd9BqG57FUa-568vLiPn5cxNhVSTnm_Ar2rEeVb6s/s640/Slide1.JPG" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: left;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhk4RKQtlxN6Ow9d-FlCy0U1_N1l8jjhpul52tOE69s718lE0J1CVzqMK5Qm-3VtEwQpdDAgdBNBjyMjgvJt6iCVS6WrdXmq-ubL6rUT3PCDs9YvySRwxV_GDRPJjgEmlsaL7AW6mTaqfi/s1600/Slide2.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhk4RKQtlxN6Ow9d-FlCy0U1_N1l8jjhpul52tOE69s718lE0J1CVzqMK5Qm-3VtEwQpdDAgdBNBjyMjgvJt6iCVS6WrdXmq-ubL6rUT3PCDs9YvySRwxV_GDRPJjgEmlsaL7AW6mTaqfi/s640/Slide2.JPG" width="640" /></a>So the 1D reads are really long - some more than 50 kb. The 2D reads are mostly 2-10 kb. The SkateBase has a complete contig of the mitochondrial genome, so we were able to align the Oxford sequences to this as a reference. Coverage was low, but we do have some regions where both 1D and 2D reads match the reference. What we can see is that the 1D reads have a lot of SNPs vs the reference, while the 2D reads have very few SNPs- so it is clear that the 2D reads have been successfully error corrected. Strangely, both the 1D and 2D reads have a lot of insertion-deletion errors (several per hundred bases) compared to the reference, and in fact they do not match each other - so we consider these to all be novel, uncorrected errors.</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsVbxJlfRxlS11SZyfaguAfZjvz2Jah5hBsl-V0nzqn66iOOqUtMDuP1bCzo2DA-lbnJU6P_CCpG0NlokAGL2Tt96_ENO0KctOVA5peA_w7k15kn8Vu70coTQxGlcsz5u96oOoJ_DFEzlQ/s1600/Slide5.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsVbxJlfRxlS11SZyfaguAfZjvz2Jah5hBsl-V0nzqn66iOOqUtMDuP1bCzo2DA-lbnJU6P_CCpG0NlokAGL2Tt96_ENO0KctOVA5peA_w7k15kn8Vu70coTQxGlcsz5u96oOoJ_DFEzlQ/s640/Slide5.JPG" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
We also ran a standard Illumina whole genome shotgun sequencing run for the skate genome, which we aligned to the mitochondrial reference. With this data, we can see a small number of Oxford 2D SNPs shared by hundreds of Illumina reads, others not. None of the indels are supported by our Illumina reads.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuz5x8TPN5ua-2HjsEAA6kcP6gackwrrov9y3nElSzLpDMGMbgAt5GFhsyWq2Faq3OEAHF_LURvyIFXLDmPmtl3DbCfWbJrPQYrwJoZwihy-z8OxztRr5pOkK6fPoBotb3Z-6XYjCczry9/s1600/Slide7.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuz5x8TPN5ua-2HjsEAA6kcP6gackwrrov9y3nElSzLpDMGMbgAt5GFhsyWq2Faq3OEAHF_LURvyIFXLDmPmtl3DbCfWbJrPQYrwJoZwihy-z8OxztRr5pOkK6fPoBotb3Z-6XYjCczry9/s320/Slide7.JPG" width="320" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJKW3wx6Kh8o9p4ydaqkfkPLELrSN3jb2zHNEKQE4vGClrzrdoGTrF4F6KEYQXu2sTYimNAgF6y6wpGC1oaC3Gi3ybYj4LcimBbUKKsz-RzLlx3jltVzDQu-uUi3O0UcE5fTcjR2MzMVK6/s1600/Slide8.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJKW3wx6Kh8o9p4ydaqkfkPLELrSN3jb2zHNEKQE4vGClrzrdoGTrF4F6KEYQXu2sTYimNAgF6y6wpGC1oaC3Gi3ybYj4LcimBbUKKsz-RzLlx3jltVzDQu-uUi3O0UcE5fTcjR2MzMVK6/s320/Slide8.JPG" width="320" /></a></div>
Other investigators have had <a href="http://datadryad.org/resource/doi:10.5061/dryad.5p0c3" target="_blank">poor quality Oxoford sequences</a>. With more coverage, we may be able to use the Oxoford reads as scaffolds for our de novo assembly. It may be possible to use Illumina reads for error correction, and mark all uncorrected areas of the Oxford sequences as low quality, but that is not the usual method for submitting draft genomes to Genbank.<br />
<br />Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com4tag:blogger.com,1999:blog-4457216402399127579.post-90285842500624441722017-01-17T16:21:00.001-05:002017-01-17T16:21:05.143-05:00GenomeWeb reports on updated Coffee Beetle genomeA nice review of my 2015 <a href="http://www.nature.com/articles/srep12525" target="_blank">Coffee Beetle paper </a>in <a href="https://www.genomeweb.com/sequencing/coffee-pest-plant-genomes-presented-pag-conference" target="_blank">GenomeWeb</a> today.<br />
My genome had 163 million bases and 19,222 predicted protein-coding genes. I am very pleased to learn that a revised version of the draft genome sequence (from a group in Columbia) contains 160 million bases<b> </b>and 22,000 gene models. They also confirm the 12 horizontally transferred genes that I identified.<br />
<br />
<a href="https://www.genomeweb.com/sequencing/coffee-pest-plant-genomes-presented-pag-conference" target="_blank">Coffee Pest, Plant Genomes Presented at PAG Conference</a><br />
<br />
Researchers from the USDA's Agricultural Research Service, New York
University, King Abdullah University of Science and Technology, and
elsewhere published information on a <a href="http://www.nature.com/articles/srep12525" target="_blank">163 million base draft genome</a> for the coffee berry borer in the journal <em>Scientific Reports</em> in 2015.<br />
That
genome assembly, produced with Illumina HiSeq 2000 reads, housed
hundreds of small RNAs and an estimated 19,222 protein-coding genes,
including enzymes, receptors, and transporters expected to contribute to
coffee plant predation, pesticide response, and defense against
potential pathogens. It also provided evidence of horizontal gene
transfer involving not only mannanase, but several other bacterial genes
as well.<br />
At the annual Plant and Animal Genomes meeting here this week, National
Center for Coffee Research (Cenicafe) scientist Lucio Navarro provided
an update on efforts to sequence and interpret the coffee berry borer
genome during a session on coffee genomics. For their own recent analyses, Navarro and his
colleagues upgraded an earlier version of a coffee berry borer genome
that had been generated by Roche 454 FLX sequencing, using Illumina
short reads from male and female coffee berry borers to produce a
consensus assembly spanning around 160 million bases. The assembly is
believed to represent roughly 96 percent of the insect's genome.<br />
In
addition to producing a genome with improved contiguity levels, he
reported, members of that team also combined 454 and Illumina reads to
get consensus transcriptomes for the beetle. With these and other data,
they identified almost 22,000 gene models, novel transposable element
families, and their own evidence of horizontal gene transfer.Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com1tag:blogger.com,1999:blog-4457216402399127579.post-30743349763024090912016-12-08T14:56:00.001-05:002016-12-08T14:56:54.306-05:00Finding differences between bacterial strains 100 bases at a time. This work was conducted mostly by my research assistant <a href="http://www.biotechniques.org/students/2015/yuhan/" target="_blank">Yuhan Hao</a>.<br />
<br />
We have recently been using <a href="http://bowtie-bio.sourceforge.net/bowtie2/index.shtml" target="_blank">Bowtie2</a> for sequence comparisons related to <i>shotgun metagenomics, </i>where we directly sequence samples that contain mixtures of DNA from different organisms, with no PCR. Bowtie2 alignments can be made very stringently (>90% identity), and computed very rapidly for large data files with hundreds of millions of sequence reads. This allows us to identify DNA fragments by species and by gene function, provided we have a well-annotated database that contains DNA sequences from similar microbes. I know that <a href="https://bitbucket.org/biobakery/metaphlan2" target="_blank">MetaPhlan</a> and other tools already do this, but I want to focus on the difference between bacteria (and viruses) at the <b>strain </b>level.<br /><br />
I have been playing around with the idea of constructing a database that will give strain-specific identification for human-associated microbes using Bowtie2 alignments with shotgun metagenomic data. There are plenty of sources for bacterial genome sequences, but the <a href="https://www.patricbrc.org/" target="_blank">PATRIC</a> database has a very good collection of human pathogen genomes. However the database is very redundant - with dozens or even hundreds of different strains for the same species, and the whole thing is too large to build into a Bowtie database (>350 Gb). So I am looking at ways to eliminate redundancy from the database while retaining high sensitivity at the species level to identify any sequence fragment, <b>not just specific marker genes</b>, and the ability to make <b>strain-specific alignments</b>. So I want to identify which genome fragments have shared sequences among a group of strains within a single species and which fragments are unique to each strain.<br />
<br />
Since our sequence reads are usually ~100 bases, We are looking the similarity between bacterial strains when the genome is chopped into 100 bp pieces.<br />
<br />Bowtie2 can be limited to only perfect alignments (100 bases with zero mismatches) using the parameters <i>--end-to-end</i> and <i>--scoremin C, 0, -1</i> and to something similar to 99% identity with <span style="font-family: Calibri, sans-serif; font-size: 11pt;"> --scoremin ‘L,0, -0.06’ </span><br />
<span style="font-family: Calibri, sans-serif; font-size: 11pt;"><br /></span>
<span style="font-family: Calibri, sans-serif;"><span style="font-size: 14.6667px;">Between two strains of <i>Streptococcus agalactiae</i>, if we limit the Bowtie2 alignments to perfect matches, half of the fragments align. So at the 100 base level, half of the genome is identical, and the other half has at least one variant. At 99% similarity (one mismatch per read), about 2/3 of the fragments align, and the other 1/3 has more than one variant, or in some cases no-similarity at all to the other genome. Yuhan Hao extended this experiment to another species (</span></span><span style="font-family: Calibri, sans-serif; font-size: 14.6667px;"><i>Strep pyogenes</i>),</span><span style="font-family: Calibri, sans-serif; font-size: 14.6667px;"> where we see almost no alignment to <i>Strep ag.</i> of genome fragments at 100%, 99%, or ~97% identity (see the Table below). I have previously used the 97% sequence identity threshold to separate bacterial species, but I thought it was only for 16S rDNA sequences - here it seems to apply to almost every 100 base fragment in the whole genome. </span><br />
<span style="font-family: Calibri, sans-serif; font-size: 14.6667px;"><br /></span>
<span style="font-family: Calibri, sans-serif; font-size: 14.6667px;">So we can build a database with one reference genome per species and safely eliminate the 2/3 of fragments that are nearly identical between strains, and retain only those portions of the genome that are unique to each strain. I'm going to think about how to apply this iteratively (without creating a huge computing task), so we add only the truly <u>unique </u>fragments for EACH strain, rather than just testing if a fragment differs from the Reference for that species. With such a database, stringent Bowtie2 alignments will identify each sequence read by species and by strain and have very low false matches across different species. </span><br />
<span style="font-family: Calibri, sans-serif; font-size: 14.6667px;"><br /></span>
<span style="font-family: Calibri, sans-serif; font-size: 14.6667px;">We can visualize the 100 base alignments between two strains at the 100% identity level, 99% identity, and at the least stringent Bowtie2 setting (very fast local) to see where the strains differ. This makes a pretty picture (below), which reveals some fairly obvious biology: There are parts of the genome that are more conserved and parts that are more variable between strains. The top panel shows only the 100% perfect matches (grey bars), the second panel shows that we can add in some 99% matching reads (white bars with a single vertical color stripe to mark the mismatch base), and the lowest panel shows reads that are highly diverged (lots of colored mismatch bases and some reads that are clearly mis-aligned from elsewhere on the genome). So if we choose only the reads that differ by more than 99%, we can build a database that can use Bowtie2 to identify different strains and have few multiple matches or incorrectly matched reads. </span><br />
<span style="font-family: Calibri, sans-serif; font-size: 14.6667px;"><br /></span>
<span style="font-family: Calibri, sans-serif; font-size: 14.6667px;"><br /></span>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_V-nhdcsYNEKWdvD983-xVhrROnPOGeh6fZe6WV15nJqX8kTmh7E9VFWBjG7Wf_DKS5V7af5BNkirvU9ksUcgl6QxSc0KtIC8GrEpN65M8OuLNW1coDdVcefkS093OL5TxsMA8e9-KQJm/s1600/igv_kmer-quality.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="304" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_V-nhdcsYNEKWdvD983-xVhrROnPOGeh6fZe6WV15nJqX8kTmh7E9VFWBjG7Wf_DKS5V7af5BNkirvU9ksUcgl6QxSc0KtIC8GrEpN65M8OuLNW1coDdVcefkS093OL5TxsMA8e9-KQJm/s640/igv_kmer-quality.jpg" width="640" /></a></div>
<span style="font-family: Calibri, sans-serif; font-size: 14.6667px;"><br /></span>
<span style="font-family: Calibri, sans-serif; font-size: 14.6667px;"><br /></span>
<span style="font-family: Calibri, sans-serif; font-size: 14.6667px;">This chart lists the number of UNMATCHED fragments after alignment of all non-overlapping 100 bp fragments from the genomes of each strain to the <i>Strep.</i> </span><i style="font-family: Calibri, sans-serif; font-size: 14.6667px;">agalactiae </i><span style="font-family: Calibri, sans-serif; font-size: 14.6667px;">genome that we arbitrarily chose as the Reference. C 0 -1 corresponds to perfect matches only, L 0 -0.06 is approximately 99% identity (by my calculation), and L 0, -0.2 is about 97% identity. The lower the stringency of alignment, the fewer fragments end up in the "unmatched" file. </span><br />
<span style="font-family: Calibri, sans-serif; font-size: 14.6667px;"><br /></span>
<div class="MsoNormal">
<span style="font-size: 9.0pt; line-height: 115%;"> S.ag Ref S.ag st2 S.ag
st3 S.pyo st1 S.pyo st2<o:p></o:p></span></div>
<div class="MsoNormal">
<u><span style="background: aqua; font-size: 9.0pt; line-height: 115%; mso-highlight: aqua;"># 100b kmers 20615 22213 21660 17869 17094</span></u><u><span style="font-size: 9.0pt; line-height: 115%;"><o:p></o:p></span></u></div>
<div class="MsoNormal">
<span style="font-size: 9.0pt; line-height: 115%;">C 0 -1 0 <span style="background: yellow; mso-highlight: yellow;">0 11243 10571 17718 16934</span> (perfect matches)<o:p></o:p></span></div>
<div class="MsoNormal">
<span style="font-size: 9.0pt; line-height: 115%;">L 0 -0.06 0 <span style="background: yellow; mso-highlight: yellow;">0 6605 5918 17533 16762</span> (99% ident)<o:p></o:p></span></div>
<div class="MsoNormal">
<span style="font-size: 9.0pt; line-height: 115%;">L 0 -0.1 0 <span style="background: yellow; mso-highlight: yellow;">0 6534 5839 17533 16762</span><o:p></o:p></span></div>
<div class="MsoNormal">
<span style="font-size: 9.0pt; line-height: 115%;">L 0 -0.12 0 <span style="background: yellow; mso-highlight: yellow;">0 4812 4058 17371 16600</span><o:p></o:p></span></div>
<div class="MsoNormal">
<span style="font-size: 9.0pt; line-height: 115%;">L 0 -0.15 0 <span style="background: yellow; mso-highlight: yellow;">0 4767 4006 17369 16600</span><o:p></o:p></span></div>
<div class="MsoNormal">
<span style="font-size: 9.0pt; line-height: 115%;">L 0 -0.17 0 <span style="background: yellow; mso-highlight: yellow;">0 4756 3997 17369 16600</span><o:p></o:p></span></div>
<br />
<div class="MsoNormal">
<span style="font-size: 9.0pt; line-height: 115%;">L 0 -0.2 0 <span style="background: yellow; mso-highlight: yellow;">0 4156 3419 17237 16441</span>
(97% ident)<o:p></o:p></span></div>
Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com2tag:blogger.com,1999:blog-4457216402399127579.post-11004839951222231702016-05-27T12:13:00.002-04:002016-05-27T12:13:46.146-04:00Functional Metagenomics from shotgun sequences<div class="MsoNormal">
A number of groups at our research center have recently become
interested in metagenomic shotgun sequencing (MGS), which is simply taking
samples that are presumed to contain some microbes, extracting DNA and
sequencing all of it, shotgun style. This is seen as an improvement over
metagenomics methods that amplify 16S ribosomal RNA genes using bacteria
specific PCR primers, and then sequence these PCR products. The 16S approach
has had a lot of success, in that essentially all of the important
"<a href="http://hmpdacc.org/pubs/publications.php" target="_blank">microbiome" publications</a> over the past 5 years have been based on the
<a href="http://link.springer.com/chapter/10.1007%2F978-90-481-9039-3_28" target="_blank">16S method</a>. The 16S approach has a number of advantages – the amount of
sequencing effort per sample is quite small (1,000 to 10,000 sequences is
usually considered adequate) and fairly robust computational methods have been
developed to process the sequence data into abundance counts for taxonomic
groups of bacteria. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
There are a number of drawbacks for the 16S method. The data
is highly biased by DNA extraction methods, PCR primers and conditions, DNA
sequencing technology, and computational methods used to clean, trim, cluster,
and identify the sequences. What I mean by biased is that if you change any of these factors, then you get a different set of taxa and abundances from the samples. Even when these biases are carefully addressed, the
accuracy of the taxonomic calls are not very good or reliable. It is simply not
possible to identify with high precision and accuracy all bacterial species (or strains) present in a DNA sample with
just ~400 bp of 16S sequence data. Many 16S microbiome studies report differences
in bacterial abundance at the genus or even the phylum level. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Two samples may be discovered to have<a href="http://aem.asm.org/content/71/12/8228.full" target="_blank"> reproducibledifferences in 16S sequence content</a>, but the actual bacterial species or
strains that differ are not confidently identified. Even more important, the low resolution of 16S
studies may be missing a lot of important biology. Huge changes in environment can
perhaps favor anaerobes over aerobes, but smaller changes in pH, nutrient
abundance, or immune call populations and function may cause a shift from one species in a genus to another. And this
difference in species or strain may bring important changes in metabolite flux,
immune system interaction, etc. <o:p></o:p><br />
<br /></div>
<div class="MsoNormal">
It has been proposed that <a href="http://journal.frontiersin.org/article/10.3389/fpls.2014.00209/full" target="_blank">bulk metagenomic shotgun sequencing(MGS) </a>of all of the DNA in a biosample, rather than just PCR amplified 16S
sequences, would allow for more precise species and strain identification, and
quantification of the actual microbial genes present. Some MGS methods also
attempt to count genes in specific functional groups such as within the Gene
Ontology, or KEGG pathways. Other people would like to discover completely
novel genes that innovative bacteria have developed to do interesting metabolic
things. This leads to a computationally hard problem. <a href="http://www.illumina.com/areas-of-interest/microbiology/microbial-sequencing-methods/shotgun-metagenomic-sequencing.html" target="_blank">MGS data tend to be very large</a> (200 million to 1 billion reads per sample), and databases of bacterial
genes are incomplete. In fact, we
probably have complete genome sequences for very much less than 0.1 percent of
all bacteria in the world. We might also like to identify DNA from archea,
viruses, and small eukaryotes in our samples. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Each fragment of DNA in an MGS data file has to be
identified based on some type of inexact matching to some set of reference
genes or genomes (a sequence alignment problem), which is computationally very
demanding. In my experience, <a href="http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHomeNew" target="_blank"><b>BLAST</b></a> is the most sensitive tool to match diverged
DNA sequences, but depending on the size of the database, it takes from 0.1 to 10
cpu seconds for each sequence to be aligned by BLAST. 100,000 sequences takes
at least overnight (on 32 CPUs, 128 GB RAM), if not all week. I have never
tried a billion. Multiply that by a
billion reads per sample and you can see that we have a serious compute
challenge. We have at least 200 samples with FASTQ files queued up for analysis,
more being sequenced, and more investigators are preparing to start new studies.
So we need a scalable solution that cannot be solved just by brute force BLAST
searching on ever bigger collections of computers.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
As more investigators have become interested in MGS, computing
methods for processing this data have popped up like weeds. It's really
difficult to review/benchmark all of these tools, since there are no clear
boundaries for what analysis results they should deliver, and what methods they
should be using. The <a href="http://omictools.com/" target="_blank"><b>omictools.com</b> </a>website lists 12 tools for "metagenomics
gene prediction" and 5 for "metagenomics functional annotation"
however these categories might be defined. The best benchmark paper I could find (<b><a href="http://www.ncbi.nlm.nih.gov/pubmed/26778510" target="_blank"><span id="goog_358110505"></span>Gardner et al 2015</a></b>)<span id="goog_358110506"></span> looks at about 14. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Should the data be primer and quality trimmed (YES!), human
and other contaminants removed (YES!), duplicates removed (maybe not???),
clustered (????), assembled into contigs or complete genomes. Then what sort of
database should the fragments be aligned to? The PFAM library of protein
motifs, a set of complete bacterial genomes, some set of candidate genes? <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
So far, the most successful tools focus on the
taxonomy/abundance problem – choosing some subset of the sequence data and
comparing it to some set of reference sequences. I have chosen <a href="http://huttenhower.sph.harvard.edu/metaphlan" target="_blank"><b>MetaPhalAn</b> </a><<a href="http://www.ncbi.nlm.nih.gov/pubmed/22688413">http://www.ncbi.nlm.nih.gov/pubmed/22688413</a>> to
process our data because it does well in benchmark studies, runs quickly on our
data, and <a href="http://huttenhower.sph.harvard.edu/" target="_blank"><b>Curtis Huttenhower</b></a> has a superb track record of producing excellent bioinformatics
software. In addition, we removed primers and quality trimmed using
<a href="http://www.usadellab.org/cms/?page=trimmomatic" target="_blank"><b>Trimmomatic</b></a>, then removed human sequences by using <b><a href="http://bowtie-bio.sourceforge.net/bowtie2/index.shtml" target="_blank">Bowtie2</a> </b>to align to the
human reference genome (as much as 90% of the data in some samples). Using
MetaPhalAn on the cleaned data files, we got species/abundance counts for our
~200 WGS samples in less than a week. [Note, when I say 'we', I actually mean
that all the work was done by <a href="https://www.linkedin.com/in/hao-chen-88aa9b22" target="_blank"><b>Hao Chen</b>,</a> my excellent Bioinformatics Programmer ]<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsGrcSIVqjq_vGlXXTvVhr33u6bjHlJsMKALffdzhoIvkO_7UEgPsL9g-Yb1wHgnTXV_FgLCa6QEJPNzvKeeNAweBProEE7YV7Y9gGggrf4yQlBX1eS2s4BxhK0uB6gSMsDDquDlv6Aix0/s1600/blurryheatmap.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsGrcSIVqjq_vGlXXTvVhr33u6bjHlJsMKALffdzhoIvkO_7UEgPsL9g-Yb1wHgnTXV_FgLCa6QEJPNzvKeeNAweBProEE7YV7Y9gGggrf4yQlBX1eS2s4BxhK0uB6gSMsDDquDlv6Aix0/s320/blurryheatmap.png" width="235" /></a></div>
<h4>
<div class="MsoNormal">
<br />
<ol>
<li><span style="background-color: #cccccc; font-weight: normal;">An intentionally fuzzy top species abundance heatmap of
pre-publication data for some MGS samples processed with MetaPhalAn. If we added
info about which samples came from which patients, this might be considerably
more interesting. </span></li>
</ol>
<o:p></o:p></div>
</h4>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Moving on, our next objective is to identify microbial protein coding
genes and metabolic pathways. Here, the methods become much more challenging, and difficult to
benchmark. I basically don’t like the idea of assembling reads into contigs.
This adds a lot of compute time, huge inconsistency among different samples
(some will assemble better than others), and creates all kinds of bias for
various species (high vs low GC genomes, repeats, etc.) and genes. Also, I'm having a hard time coming up with a solid data analysis plan that meets all of our objectives. Bacteria are diverse. A specific enzyme that fulfills a metabolic function (for example in a <a href="http://metacyc.org/META/class-tree?object=Pathways" target="_blank"><b>MetaCyc </b>pathway</a>), could differ by 80% of its DNA sequence from one type of bacteria to another, but still do the job. Alternately, there are plenty of multi-gene families of enzymes in bacteria with paralogs that differ in DNA sequence by 20% or less within the genome of a single organism, but perform different metabolic functions. And of course there are sequence variants in individual strains that inactivate an enzyme, or just modifiy one of its functions. There is no way we can compute such subtle stuff on billions of raw Illumina reads from a mixture of DNA fragments from unknown organisms (many of which have no reference data). So how do we compute up a report that realistically describes the functional differences in gene/pathway metabolic capacity between different sets of MGS samples? <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
If I have to design the gene content identifier myself, I
will probably translate all reads in 6 frames, then use<a href="https://svn.janelia.org/eddylab/eddys/src/hmmer/trunk/documentation/man/hmmscan.man" target="_blank"> hmmscan</a> vs. the <a href="ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release" target="_blank">PFAM library</a> of known protein functional domains. The upside is that I know how to
do this, and it is reasonably sensitive for most types of proteins. The downside
is that many proteins will receive a very general function (ie:
"kinase" or "7-TM domain"), which does not reveal exactly
what metabolic functions they are involved in. And of course some 30-40% of
proteins will come up with no known function – just like every new genome that
we sequence. Another way to do this would be to use BLASTx against the set of
<a href="http://www.uniprot.org/taxonomy/?query=*&fil=ancestor%3A%22Bacteria+%5B2%5D%22" target="_blank">bacterial proteins from UniProt.</a> I can speed this up a bit by taking the <a href="http://www.uniprot.org/help/uniref" target="_blank">UniRef 90% identity clusters</a> (which reduces the database size by about 25%). An
alternate proposal, which is much more clumsy, slower, and ad hoc, would be to
BLASTn against a large set of bacterial genomes. Either all the <a href="http://www.ncbi.nlm.nih.gov/genomes/MICROBES/microbial_taxtree.html" target="_blank">complete microbial genomes in GenBank</a>, or all the ones collected by the <a href="http://hmpdacc.org/reference_genomes/reference_genomes.php" target="_blank">Human Microbiome Project</a>,
or some taxonomically filtered set (one genome per genus???). The point of searching against many complete genomes is that we have some chance of catching rare or unusual genes that are not well annotated as a <a href="http://www.ncbi.nlm.nih.gov/pubmed/25428365" target="_blank">COG</a>, or <a href="http://www.genome.jp/kegg/pathway.html" target="_blank">KEGG</a> pathway member. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
According to <a href="http://www.ncbi.nlm.nih.gov/pubmed/26778510" target="_blank">Gardner et al (Sci. Rep. 2015),</a> the best tools
for identifying gene content in MGS samples are the free public servers <b><a href="http://metagenomics.anl.gov/" target="_blank">MG-RAST</a></b>
and the <b><a href="https://www.ebi.ac.uk/metagenomics/" target="_blank">EBI Metagenomics portal</a></b>. So we have submitted some test samples to
these. However, the queue is at least 10 days for processing by MG-RAST, so
this is maybe not going to satisfy my backlog of metagenomics investigators who are rapidly
cranking out more MGS samples. We probably need a local tool that will be under our own control in terms of compute time and more amenable to tweaking to the goals of each project. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
At this point I'm asking blog readers for suggestions for a
decent method or tool to identity gene content in big MGS FASTQ files which will be reasonably
accurate in terms of protein/pathway function, and not crush my servers. Some support from a review or benchmark paper (other than by the tool's own authors...) would be nice also. <o:p></o:p></div>
Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com2tag:blogger.com,1999:blog-4457216402399127579.post-48471274607572367392016-03-02T10:53:00.001-05:002016-03-03T16:16:38.583-05:00Irreproducible resultsI get frustrated by the lack of stability in genomics data collection and analysis, which of course leads to irreproducible results. I imagine (naively I'm sure) that in physics one measures a quant or a nanode of some particle and it stays measured that way for years and decades. I do accept the inevitable technology changes that lead us to measure similar things, such as gene expression, in different ways (Northern blots, microarrays, RNA-seq). However, my lab-based collaborators become very frustrated when the exact same data (such as RNA-seq FASTQ files) produce different results, such as changes in the list of top differentially expressed (DE) genes with different p-values, when analyzed with different software. This frustration grows even more severe when the different results come from a different version of the same software!<br />
<br />
I was working through my RNA-seq tutorial with a group of students this week and they pointed out that my tutorial worksheet was wrong.<a href="http://cole-trapnell-lab.github.io/cufflinks/getting_started/" target="_blank"> Cufflinks </a>did not produce any significant DE genes with our test data comparing two small RNA-seq data files. This was surprising to me, since I did the exact same workflow with the same data files last year and it worked out fine. So <i>golly</i> and <i>darn </i>it, I got hit with the irreproducible results bug. We keep past versions of software available on our computing server with an <a href="http://modules.sourceforge.net/" target="_blank">Environment Modules</a> system, so I was able to quickly run a test of the exact same data files (aligned with the same Tophat version to the same reference genome) using different versions of Cufflinks. We have the following versions installed:<br />
cufflinks/2.0.2 (July 2012)<br />
cufflinks/2.1.1 (April 2013)<br />
cufflinks/2.2.0 (March 2014)<br />
cufflinks/2.2.1 (May 2014)<br />
<br />
I just used the simple <a href="http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/index.html" target="_blank">Cuffidiff </a>workflow and looked at the gene_exp.diff output file for each software version. The results are quite different. Version 2.0.2 has 46 genes called "significant=yes" (multi-test adjusted q-value less than 0.05) with q-values running as low as 4.14E-10 (ok, one has a q-value of zero). Now this is not a great result from a biostatistics standpoint, since how can you expect to get significant p-values from RNA expression levels with two samples an no replicates? But it did make for an expedient exercise, since we could take the DE genes into <a href="https://david.ncifcrf.gov/" target="_blank">DAVID</a> and look for enriched biological functions and pathways.<br />
<br />
Then in version 2.1.1 we have two "significant=yes" genes. In version 2.2.0 we have zero significant genes, and in version 2.2.1 also zero.<br />
<br />
The top 10 genes, ranked by q-value also differ. There are <b><span style="color: red;">no genes in common</span> </b>between the top 10 list for version 2.0.2 and 2.1.1, and only the top 2 genes are shared by version 2.2.0. Thankfully, there are no differences in top genes or q-values between 2.2.0 and 2.2.1 (versions released only 2 months apart). I'm sure that <a href="http://cole-trapnell-lab.github.io/images/team/cole-trapnell.png" target="_blank">Cole Trapnell</a> et al. are diligently improving the software, but the consequences for those of us trying to use the tools to make some sense out of biology experiments can be unsettling.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheeIpsIj3Vc2TfvxbhRazTdaWB-2UJ5XiaktssOMsybFVcPi0a1L32mWqnLUWh5tHh_D_AIXF7rCQl09QskHMyGFifBKonNGvx1XdHcHSCd85hUu2kbPYWDK8Tivj1ZSbGvd3TqNnj-CuD/s1600/Cuff_versions.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="137" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheeIpsIj3Vc2TfvxbhRazTdaWB-2UJ5XiaktssOMsybFVcPi0a1L32mWqnLUWh5tHh_D_AIXF7rCQl09QskHMyGFifBKonNGvx1XdHcHSCd85hUu2kbPYWDK8Tivj1ZSbGvd3TqNnj-CuD/s320/Cuff_versions.PNG" width="320" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div>
<br /></div>
<br />
<br />Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com6tag:blogger.com,1999:blog-4457216402399127579.post-8371857222332038672016-02-14T10:46:00.002-05:002016-02-14T10:46:30.136-05:00RNA-seq Workshop<br />
We ran a tutorial at NYU Med Center last week on the basics of RNA-seq data analysis. This tutorial was based on the use of our High-Performance Linux cluster, so we actually presented the class as 2 sessions: A 2-hour session on basic Linux commands (for the complete novice) plus writing and submitting Sun Grid Engine scripts. Then in the 2nd 2-hour session we focus on TopHat/Cufflinks data processing with some sample data files. So this tutorial has some parts that are rather specific to the NYUMC computing system, but it may be quit similar to what computing resources might be available at other schools and research centers.<br />
<br />
The YouTube links to the Linux session (screencast):<br />
<a href="https://youtu.be/M3RVfv6lUtc">https://youtu.be/M3RVfv6lUtc</a><br />
<br />
and the RNA-seq session (screencast):<br />
https://youtu.be/hksQlJLwKqo<a href="https://youtu.be/hksQlJLwKqo" target="_blank">https://youtu.be/hksQlJLwKqo</a><br />
<br />
A wiki website with the Powerpoint slides and a collection of other resources:<br />
<a href="https://genome.med.nyu.edu/hpcf/wiki/RNA-seq_tutorial_2016">https://genome.med.nyu.edu/hpcf/wiki/RNA-seq_tutorial_2016</a><br />
<br />Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com1tag:blogger.com,1999:blog-4457216402399127579.post-76810864030520802472016-01-28T12:06:00.001-05:002016-01-28T12:06:36.485-05:00The Tardigrade Miscalculation<span style="font-family: inherit;"><span style="font-family: "Calibri","sans-serif"; font-size: 11pt; line-height: 115%; mso-ansi-language: EN-US; mso-ascii-theme-font: minor-latin; mso-bidi-font-family: "Times New Roman"; mso-bidi-language: AR-SA; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-latin; mso-hansi-theme-font: minor-latin;"><span style="font-size: small;">There
was a lot of publicity back in November about the <a href="http://dx.doi.org/10.1073/pnas.1510461112" target="_blank">genome sequence of theTardigrade</a> </span>(</span><em><span style="color: #663333; font-family: "Verdana","sans-serif"; line-height: 115%; mso-ansi-language: EN-US; mso-bidi-font-family: "Times New Roman"; mso-bidi-language: AR-SA; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-latin;"><a href="https://en.wikipedia.org/wiki/Hypsibius_dujardini"><span style="color: #990000; text-decoration: none; text-underline: none;"><span style="font-size: x-small;">Hypsibius
dujardini</span></span></a>), </span></em><span style="color: black;"><em><span style="color: #663333; font-family: "Verdana","sans-serif"; font-style: normal; line-height: 115%; mso-ansi-language: EN-US; mso-bidi-font-family: "Times New Roman"; mso-bidi-font-style: italic; mso-bidi-language: AR-SA; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-latin;">a small animal <span style="font-size: x-small;">(</span></span></em><span style="font-size: x-small;"><span style="color: #663333; font-family: "Verdana","sans-serif"; line-height: 115%; mso-ansi-language: EN-US; mso-bidi-font-family: "Times New Roman"; mso-bidi-language: AR-SA; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-latin;">0.05 – 1mm</span>)</span></span></span><em><span style="font-family: "Verdana","sans-serif"; font-style: normal; mso-bidi-font-family: "Times New Roman"; mso-bidi-font-style: italic; mso-bidi-theme-font: minor-bidi;"><span style="font-family: inherit;"><span style="color: black;">
th</span>at is somewhat similar to nematodes. These are fascinating little creatures
that have been described as incredibly resistant to all manner of physical
stress – high and low temperatures (reportedly from -272oC to +151oC), high
pressure, complete vacuum (Tardigrades in Space = </span><a href="http://tardigradesinspace.blogspot.com/" target="_blank"><span style="font-family: inherit;">TARDIS</span></a><span style="font-family: inherit;"> {<em>I kid you not</em>}), ionizing
radiation, and can survive without food or water for more than 10 years as kind of a dehydrated little lump.</span> </span></em><br />
<em><br />
<span style="font-family: "Verdana","sans-serif"; font-style: normal; mso-bidi-font-family: "Times New Roman"; mso-bidi-font-style: italic; mso-bidi-theme-font: minor-bidi;"><div style="text-align: center;">
<a href="http://www.sciencealert.com/images/articles/processed/22318923511_af450b6df7_o_web_1024.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://www.sciencealert.com/images/articles/processed/22318923511_af450b6df7_o_web_1024.jpg" height="257" width="640" /></a><a href="https://www.flickr.com/photos/katexic/22318923511/" style="-webkit-text-stroke-width: 0px; background-color: #005588; color: white; font-size-adjust: none; font-stretch: normal; font: 12px/20.4px "Open Sans", Helvetica, Arial, sans-serif; letter-spacing: normal; text-align: right; text-decoration: none; text-indent: 0px; text-transform: none; transition: color 0.2s ease-out; white-space: normal; widows: 1; word-spacing: 0px;" target="_blank">Tippett Studio/Cosmos A Spacetime Odyssey</a></div>
<div style="text-align: center;">
</div>
<div style="text-align: left;">
<span style="font-family: Calibri;">The reason the genome of the Tardigrade was such big news in November is that the group doing the bioinformatics analysis claimed that the genome contained <span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-size-adjust: none; font-stretch: normal; font: 14px/21px Arial, sans-serif; letter-spacing: normal; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">6,663 genes from bacteria, a full sixth of the genome, and twice as many horizontally transferred genes as have ever been seen in any other organism (Boothby et al, PNAS <span style="color: black; font-size: small;">112(52):15976-81. doi:
10.1073/pnas.1510461112. PMID: 26598659). This "weird science" observation was covered by <a href="http://news.nationalgeographic.com/2015/11/151128-animals-tardigrades-water-bears-science-dna/" target="_blank">National Geographic</a>, Science News, <a href="http://phys.org/news/2015-11-huge-chunk-tardigrade-genome-foreign.html" target="_blank">Phys.org</a>, <a href="http://news.meta.com/2015/11/23/waterbear/" target="_blank">Meta Science News</a>, and of course the Univ. of North Carolina press site.</span></span></span></div>
<div style="text-align: left;">
<span style="font-family: Calibri;"><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-size-adjust: none; font-stretch: normal; font: 14px/21px Arial, sans-serif; letter-spacing: normal; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"></span></span> </div>
<div style="text-align: left;">
<span style="font-family: Calibri;"><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-size-adjust: none; font-stretch: normal; font: 14px/21px Arial, sans-serif; letter-spacing: normal; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">However, it seems quite clear now that this claim about horizontal DNA from bacteria (and maybe other phyla) in the genome of the Tardigrade was wrong. In fact, another group <span style="font-size: x-small;">(<span class="highwire-citation-authors"><span class="highwire-citation-author first article-author-popup-processed has-tooltip" data-delta="0" rel="#hw-article-author-popups-node9195 .author-tooltip-0" title=""><span class="nlm-given-names"><span style="color: black;">Georgios</span></span><span style="color: black;"> </span><span class="nlm-surname"><span style="color: black;">Koutsovoulos</span></span></span><span style="color: black;">, </span><span class="highwire-citation-author article-author-popup-processed has-tooltip" data-delta="1" rel="#hw-article-author-popups-node9195 .author-tooltip-1" title=""><span class="nlm-given-names"><span style="color: black;">Sujai</span></span><span style="color: black;"> </span><span class="nlm-surname"><span style="color: black;">Kumar</span></span></span><span style="color: black;">, </span><span class="highwire-citation-author article-author-popup-processed has-tooltip" data-delta="2" rel="#hw-article-author-popups-node9195 .author-tooltip-2" title=""><span class="nlm-given-names"><span style="color: black;">Dominik R</span></span><span style="color: black;"> </span><span class="nlm-surname"><span style="color: black;">Laetsch</span></span></span><span style="color: black;">, </span><span class="highwire-citation-author article-author-popup-processed has-tooltip" data-delta="3" rel="#hw-article-author-popups-node9195 .author-tooltip-3" title=""><span class="nlm-given-names"><span style="color: black;">Lewis</span></span><span style="color: black;"> </span><span class="nlm-surname"><span style="color: black;">Stevens</span></span></span><span style="color: black;">, </span><span class="highwire-citation-author article-author-popup-processed has-tooltip" data-delta="4" rel="#hw-article-author-popups-node9195 .author-tooltip-4" title=""><span class="nlm-given-names"><span style="color: black;">Jennifer</span></span><span style="color: black;"> </span><span class="nlm-surname"><span style="color: black;">Daub</span></span></span><span style="color: black;">, </span><span class="highwire-citation-author article-author-popup-processed has-tooltip" data-delta="5" rel="#hw-article-author-popups-node9195 .author-tooltip-5"><span class="nlm-given-names"><span style="color: black;">Claire</span></span><span style="color: black;"> </span><span class="nlm-surname"><span style="color: black;">Conlon</span></span></span><span style="color: black;">, </span><span class="highwire-citation-author article-author-popup-processed has-tooltip" data-delta="6" rel="#hw-article-author-popups-node9195 .author-tooltip-6"><span class="nlm-given-names"><span style="color: black;">Habib</span></span><span style="color: black;"> </span><span class="nlm-surname"><span style="color: black;">Maroon</span></span></span><span style="color: black;">, </span><span class="highwire-citation-author article-author-popup-processed has-tooltip" data-delta="7" rel="#hw-article-author-popups-node9195 .author-tooltip-7" title=""><span class="nlm-given-names"><span style="color: black;">Fran</span></span><span style="color: black;"> </span><span class="nlm-surname"><span style="color: black;">Thomas</span></span></span><span style="color: black;">, </span><span class="highwire-citation-author article-author-popup-processed has-tooltip" data-delta="8" rel="#hw-article-author-popups-node9195 .author-tooltip-8" title=""><span class="nlm-given-names"><span style="color: black;">Aziz</span></span><span style="color: black;"> </span><span class="nlm-surname"><span style="color: black;">Aboobaker</span></span></span><span style="color: black;">, </span><span class="highwire-citation-author article-author-popup-processed has-tooltip" data-delta="9" rel="#hw-article-author-popups-node9195 .author-tooltip-9" title=""><span class="nlm-given-names"><span style="color: black;">Mark</span></span><span style="color: black;"> </span><span class="nlm-surname"><span style="color: black;">Blaxter)</span></span></span></span></span> also working on the sequence of the exact same species has rapidly published a preprint manuscript on the bioRxiv preprint server "<span style="color: black;"><a href="http://biorxiv.org/content/early/2015/12/13/033464" target="_blank">The genome of the tardigrade <em>Hypsibius dujardini</em>"</a> </span> that clearly refutes the claims of Boothby et al. and points out their mistakes in genome analysis: <span style="color: red;">"<span style="font-size: small;">Cross-comparison of the assemblies, using raw read and RNA-Seq data, confirmed that the overwhelming majority of the putative HGT candidates in the previous genome were predicted from scaffolds at very low coverage and were not transcribed."</span></span></span></span></div>
<div style="text-align: left;">
<span style="font-family: Calibri;"><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-size-adjust: none; font-stretch: normal; font: 14px/21px Arial, sans-serif; letter-spacing: normal; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"></span></span> </div>
<div style="text-align: left;">
<span style="font-family: Calibri;"><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-size-adjust: none; font-stretch: normal; font: 14px/21px Arial, sans-serif; letter-spacing: normal; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">It is quite easy to get contaminants when you are doing whole genome sequencing for a multicellular organism. You grind up your target species, extract DNA and put it into the sequencing machine. Any bacteria and other small organisms on the surface or in the gut come along for the ride and can contribute their DNA to the sequencing library. Surprisingly, a small amount of bacterial contaminating DNA (perhaps just 1%) can lead to a large number of bacterial contigs in the final genome assembly. I can think of a couple of reasons for this, based on the small size of bacterial genomes (~1 MB), vs metazoan genomes (most >100 MB). First, relative genome coverage of a contaminant bacteria will be much higher for each KB of sequence data, so the 1% of contaminating DNA may have deep coverage of a bacterial genome. Second, any two bacterial DNA fragments randomly selected from a library have a much higher chance to overlap (less complex genome), so they will assemble better. </span></span></div>
<div style="text-align: left;">
<span style="font-family: Calibri;"><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-size-adjust: none; font-stretch: normal; font: 14px/21px Arial, sans-serif; letter-spacing: normal; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"></span></span> </div>
<div style="text-align: left;">
<span style="font-family: Calibri;"><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-size-adjust: none; font-stretch: normal; font: 14px/21px Arial, sans-serif; letter-spacing: normal; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">There are a few QC steps that one can take on the raw data. </span></span><span style="font-family: Calibri;"><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-size-adjust: none; font-stretch: normal; font: 14px/21px Arial, sans-serif; letter-spacing: normal; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">There is a nice tool called <a href="http://ccb.jhu.edu/software/kraken/" target="_blank">Kraken </a> <span style="font-size: x-small;">(<span style="color: black;">Wood DE, Salzberg SL</span><span style="color: black;"> </span><i><span style="color: black;">Genome Biology</span></i><span style="color: black;"> 2014, </span><b><span style="color: black;">15</span></b></span><span style="color: black; font-size: small;"><span style="font-size: x-small;">:R46)</span> that can quickly run through an entire FASTQ file (4 million reads per minute on a single core) and mark each read according to a set of reference genomes based on exact matching of 31 base k-mers. The Kraken team also make available a pre-built 4 GB database constructed from complete bacterial, archaeal, and viral genomes in RefSeq. <a href="http://edwards.sdsu.edu/cgi-bin/deconseq/deconseq.cgi" target="_blank">DeconSeq</a> is another good tool to find contaminants with an easy web interface. Of course, some legitimate reads from any target organism will share k-mer sized chunks with some bacteria, viruses, etc. (and some sequences from contaminating bacteria will not be in any database), so one has to make some tough choices about what to remove from the data before assembly. </span></span></span></div>
<div style="text-align: left;">
<span style="color: black; font-family: Calibri; font-size: small;"><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-size-adjust: none; font-stretch: normal; font: 14px/21px Arial, sans-serif; letter-spacing: normal; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"></span></span> </div>
<div style="text-align: left;">
<span style="color: black; font-family: Calibri; font-size: small;"><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-size-adjust: none; font-stretch: normal; font: 14px/21px Arial, sans-serif; letter-spacing: normal; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">After assembly, there are some additional steps one can take to flag contaminants. It is extremely helpful (I would now say required) to have some RNA-seq data from the same organism. RNA-seq data is prepared using a poly-A protocol, so no bacterial RNA contaminants should be present. Any contigs (with predicted genes) that do not contain a reasonable amount of aligned RNA-seq reads are highly suspect. Any contig that has predicted genes only from a different species is clearly a red flag. </span></span></div>
<div style="text-align: left;">
<span style="font-family: Calibri;"><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-size-adjust: none; font-stretch: normal; font: 14px/21px Arial, sans-serif; letter-spacing: normal; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"></span></span> </div>
<div style="text-align: left;">
<span style="font-family: Calibri;"><span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-size-adjust: none; font-stretch: normal; font: 14px/21px Arial, sans-serif; letter-spacing: normal; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">While the authors of the original have not (yet) published a retraction, the citation in PubMed does carry a link to the <a href="http://dx.doi.org/10.1101/033464" target="_blank">refuting article </a> provided by author <a href="http://www.ncbi.nlm.nih.gov/myncbi/sujai.kumar.1/comments/" target="_blank">Sujai Kumar</a></span></span></div>
<span style="font-family: Calibri;"><div style="text-align: left;">
</div>
<span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-size-adjust: none; font-stretch: normal; font: 14px/21px Arial, sans-serif; letter-spacing: normal; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"><div style="text-align: left;">
Rather than rant on about proper workflows for genome annotation (a best practices document does exist: <a href="http://www.marcottelab.org/users/BIO337_2014/EukGeneAnnotation.pdf" target="_blank">Mark Yandell & Daniel Ence, <span class="journalname">Nature Reviews Genetics</span> <span class="journalnumber">13</span>, <span class="cite-pages">329-342</span> </a><span class="cite-doi"><span class="doi"><a href="http://www.marcottelab.org/users/BIO337_2014/EukGeneAnnotation.pdf" target="_blank"><abbr title="Digital Object Identifier">doi</abbr>:10.1038/nrg3174</a>) </span></span>let me just say to the authors, the reviewers and the editors at PNAS that "EXTRODINARY CLAIMS REQUIRE EXTRODINARY EVIDENCE" (<a href="https://carm.org/extraordinary-claims-require-extraordinary-evidence" target="_blank">Carl Sagan</a>). Or as said by Laplace: <span style="color: black; font-size: small;"><a href="http://todayinsci.com/L/Laplace_PierreSimon/LaPlacePierreSimon-Quotations.htm" target="_blank">“The weight of evidence for an extraordinary claim must be proportioned to its strangeness.”</a></span></div>
</span><div style="text-align: left;">
</div>
</span><div style="text-align: left;">
</div>
<span style="font-family: Calibri;"><div style="text-align: left;">
</div>
<span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-size-adjust: none; font-stretch: normal; font: 14px/21px Arial, sans-serif; letter-spacing: normal; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;"><div style="text-align: left;">
<span style="color: black; font-family: Times New Roman; font-size: small;">
</span></div>
</span><div class="separator" style="clear: both; text-align: center;">
<a href="https://whyevolutionistrue.files.wordpress.com/2015/11/osos_de_agua_puede_sobrevivir_sin_comida_ni_agua_durante_m_s_de_una_d_cada3wodo1_400.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="197" src="https://whyevolutionistrue.files.wordpress.com/2015/11/osos_de_agua_puede_sobrevivir_sin_comida_ni_agua_durante_m_s_de_una_d_cada3wodo1_400.gif" width="320" /></a></div>
<div style="text-align: left;">
</div>
</span><div style="text-align: left;">
</div>
</span></em>Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com13tag:blogger.com,1999:blog-4457216402399127579.post-32656685209288222252016-01-20T11:25:00.000-05:002016-01-25T11:35:56.498-05:00Cancer Moonshot in the CloudI've been reading a bit about the "Cancer Moonshot" discussion at the Davos economics conference. <br />
<br />
<a href="http://www.weforum.org/events/world-economic-forum-annual-meeting-2016/sessions/cancer-moonshot-a-call-to-action?utm_content=buffer7fc44&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer">http://www.weforum.org/events/world-economic-forum-annual-meeting-2016/sessions/cancer-moonshot-a-call-to-action</a><br />
<br />
Naturally I'm interested in the possible <a href="http://www.cancerletter.com/articles/20160122_1" target="_blank">increase in funding for genomics and bioinformatics research</a>, but also the discussion of 'big data' and sharing of genomics data are issues that I bump into all the time. It is almost impossible to overstate the amount of hoops an ordinary scientist has to jump through to obtain access or to share human genomic data that has already been published. There is an entire system of "<a href="https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login" target="_blank">authorized access</a>" that requires not only that scientists swear to handle genomic data securely and make no attempt to connect genomic data back to patient identities, but also that the University (or research institute) where they work must monitor and enforce these rules. I have had to deal with this system to upload human microbiome data (DNA sequences from bacteria found in or on the human body) that are contaminated with some human DNA. [But not with the <a href="https://genome.med.nyu.edu/coffee-beetle/cbb.html" target="_blank">coffee beetle genome!</a>] Then I had to apply again for authorization to view my own data to make sure it had been loaded properly.<br />
<br />
<b>Why is cancer genomic data protected? </b>Unfortunately, some annoyingly clever people such as <a href="http://www.nature.com/news/privacy-protections-the-genome-hacker-1.12940" target="_blank">Yaniv Erlich</a> have shown that it is possible (fairly easy in his hands) to identify people by name and address just from some of their genome sequence. Patients who agree to participate in research are supposed to be guaranteed privacy - they wanted to share information with scientists about the genetic nature of their tumors, not to share their health care records with nosy neighbors, privacy hackers and identity thieves.<br />
<br />
<b>Why do we need thousands of cancer genomes? </b>One key goal of cancer research is personalized medicine - matching up people with customized treatments based on the genetics of their cancer. Current technology is pretty good for DNA sequencing of tumors - for a single cancer patient we can come up with a list of somatic mutations (found only in the tumor) for a few thousand dollars worth of sequencing effort (and a poorly measured amount of bioinformatician and oncologist time). One of the biggest challenges right now is sorting through the list of mutations to figure out which ones are important drivers of cancer growth and disease severity - and should therefore be targeted by drugs or other therapy. Some mutations are well known to be bad actors, others are <a href="http://cancer.sanger.ac.uk/census" target="_blank">new mutations in genes that have been found to be mutated in other cancers</a>, others are complete unknowns. Data is needed from (hundreds of) thousands of tumors together with records of treatment response and other medical outcomes in order to build strong predictive models that will reliably advise the doctor about the medical importance of each observed mutation. Another challenge is the <a href="http://www.nature.com/subjects/tumour-heterogeneity" target="_blank">heterogeneity of cells within a single person's cancer</a>. As DNA sequencing technology improves, investigators have started to sequence small bits of tumors, or even single cells. They observe different mutations in different cells or sub-clones. Now a key question is if the common resistance to drug treatment is a result of new mutations that occur during (or after) treament, or if the resistant cells already exist in the tumor, but are selected for growth by drug treatment. Overall, this means that precision cancer treatment may require a large number of different genome sequences from each patient, both during diagnosis and to monitor the course of treatment and post-treatment. <br />
<br />
So cancer genomic research requires thousands of genomes (deeply sequenced for accuracy and control of artifacts), which means that each authorized investigator must download terabytes of data, and then come up with the data storage and compute power to run his or her clever analysis. In addition to the strictly administrative hurdles of applying for and
maintaining an authorized access to cancer genomic data, there are the
problems of data transfer, data storage, and big computing power. So the NIH (or other funding agency) has to pay once to generate the cancer genomic data, then again to store it and provide a high bandwidth web or FTP data sharing system, then again to administer the authorized access system, then again for each interested scientist to build a local computing system powerful enough to download, store, and analyze the data (and for University administrators to triple check that they are doing it properly, and again for the NIH administrators to check up on the University administrators to insure they are doing their checking properly). This is an impressive amount of redundancy and wasted effort, even for the US Government.<br />
<br />
There is an obvious solution to this problem: <a href="http://image.slidesharecdn.com/picmix-dorkbot-090327044549-phpapp01/95/the-picmix-experiment-6-728.jpg?cb=1238129288" target="_blank">'Use the Cloud, Luke'</a>. A single Cloud computing system can store all cancer genomic data in a central location, together with a sufficiently massive amount of compute power so that authorized investigators can log in and run their analysis remotely. This technology already exists; Google, Amazon, Microsoft, IBM, Verizon, and at least a dozen other companies already have data centers large enough to handle the necessary data storage and compute tasks. It would be handy to build a whizz bang compute system with all kinds of custom software designed for cancer genomics, but that would take time (and <a href="https://cbiit.nci.nih.gov/ncip/nci-cancer-genomics-cloud-pilots" target="_blank">government contractors</a>). A better, faster, simpler system would just stick the genomic data in a central location and let researchers <a href="https://aws.amazon.com/products/?nc2=h_ql_ny_livestream_blu" target="_blank">launch virtual machines </a>with whatever software they want (or design for themselves). <a href="http://www.zdnet.com/article/amazon-ec2-cloud-is-made-up-of-almost-half-a-million-linux-servers/" target="_blank">Amazon EC2 has this infrastructure already in place.</a> It could be merged with the NIH authorized access system in a week-long hackathon. Cancer research Funding agencies could award Cloud compute credits (or just let people budget for Cloud computing in the standard grant application).<br />
<br />
<br />
Cancer Moonshot: <br />
<br />
<img alt="Use the force luke - use the cloud luke " class="border instance_large_img" src="http://cdn.meme.am/instances/500x/65639707.jpg" style="width: 500px;" /><br />
<br />
<br />Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com6tag:blogger.com,1999:blog-4457216402399127579.post-11315137507141197002015-10-23T09:56:00.001-04:002015-10-23T09:56:20.156-04:00Masters in Biomedical Informatics at NYU School of MedicineWe are starting a new Masters program in Biomedical Informatics at NYU School of Medicine in 2016. We currently have about a dozen PhD students, but the Masters program is intended to serve a wider group with more diverse backgrounds.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgpk9i5HxtyLjZZdlQ-T4x4vRaJH53vaEwygbEhzc9anIxGGedoroQOgJs9b2z8twxMHGfOPHzSIKFvQTzcWncUO4gWo5IiU5b0qwcA9K7ijftaG7wE2V16ozByo-2Uo2a4HBDzlkIskhpE/s1600/BMI1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgpk9i5HxtyLjZZdlQ-T4x4vRaJH53vaEwygbEhzc9anIxGGedoroQOgJs9b2z8twxMHGfOPHzSIKFvQTzcWncUO4gWo5IiU5b0qwcA9K7ijftaG7wE2V16ozByo-2Uo2a4HBDzlkIskhpE/s320/BMI1.png" width="240" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiD-vD6PdZesQclqaGbwAXpJzRHGwaEyhZ1Ztclnwe32AZUbwOr1i12VqNWHCSLbGGGtLZsz_G-hvg9GigUFhQyIcIvggBqIzmfkNEfBHR58g4ZWOaz5QDh9FCXTANLFkU2irh63F3wEHn4/s1600/BMI2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiD-vD6PdZesQclqaGbwAXpJzRHGwaEyhZ1Ztclnwe32AZUbwOr1i12VqNWHCSLbGGGtLZsz_G-hvg9GigUFhQyIcIvggBqIzmfkNEfBHR58g4ZWOaz5QDh9FCXTANLFkU2irh63F3wEHn4/s320/BMI2.png" width="240" /></a></div>
<br />Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com7tag:blogger.com,1999:blog-4457216402399127579.post-29882774858213354412015-09-04T11:11:00.000-04:002015-09-04T11:11:29.765-04:00Research Adventure with ENCODE DataAt NYU, first-year PhD students in the <a href="http://www.med.nyu.edu/sackler/phd-program" target="_blank">Sackler Institute</a> start their first semester with a week-long full-time "Research Adventure" workshop. I was asked (at short notice) to mentor a group of students for something in Bionformatics. Since I had recently attended the <a href="https://www.encodeproject.org/tutorials/encode-users-meeting-2015/" target="_blank">2015 ENCODE Users Meeting</a>, I decided to make the workshop all about working with ENCODE data.<br />
<br />
I included tutorials about access to ENCODE data, an Intro to Linux for complete computing novices (quite a few of our students), Genomic Intervals in the UCSC Genome Browser, use of BEDTools to compare genomic intervals for various factors, and an a tutorial in R for data display. Later in the week we looked at gene expression with RNA-seq using TopHat and Cufflinks. The general plan for the 5-day workshop (for 6 students) was as follows:<br />
<br />
<div class="MsoNormal" style="text-align: justify; text-justify: inter-ideograph;">
<b><span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">Monday<o:p></o:p></span></b></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<u><span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">9-11:00 am Lecture (2 hr): Introduction to Gene
Regulation and Epigenetics <o:p></o:p></span></u></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<u><span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">11-12:00 am Lecture (1 hr): <a href="https://genome.nyumc.org/hpcf/wiki/Manual:Cluster_User_Guide" target="_blank">Use of the HighPerformance Computing Cluster </a><o:p></o:p></span></u></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">12:00-2 pm Working Lunch with HPC System Manager
(2 hr): Set up HPC account for each student, practice Linux commands, move
files from laptop to HPC account <o:p></o:p></span></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">2-4 pm <i>Exercise 1: Tutorials for Accessing ENCODE data through the <a href="https://www.encodeproject.org/documents/91d43bca-5647-4e16-b051-c7e215459611/@@download/attachment/Users-Meeting-Portal-Exercises.pdf" target="_blank">ENCODEPortal</a>, <a href="https://www.encodeproject.org/documents/92228d7b-f959-4dcd-94d3-39f5173fd92a/@@download/attachment/UsersMtg-UCSCbrowser.pdf" target="_blank">UCSC Genome Browser</a> and <a href="https://www.encodeproject.org/documents/fe85d457-efb7-40d9-8e28-5246784117e3/@@download/attachment/ENCODE_Ensembl_workbook.pdf" target="_blank">ENSEMBL Browser</a></i><o:p></o:p></span></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<br /></div>
<div class="MsoNormal" style="text-align: justify; text-justify: inter-ideograph;">
<b><span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">Tuesday<o:p></o:p></span></b></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<u><span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">9-11:00 am Lecture & Demo: (2 hr): The UCSC
Genome Browser, BED file format, and BEDTools software<o:p></o:p></span></u></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">11-12:00 am <i>Exercise
2: <a href="http://quinlanlab.org/tutorials/cshl2014/bedtools.html" target="_blank">BEDTools Tutorial</a></i><o:p></o:p></span></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">12-1:00 pm Lunch <o:p></o:p></span></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">1-3:00 pm <i>Exercise
3: Use of ENCODE Data and BEDTools to compute the Intersection of DNAse
hypersensitive sites with promoters of all RefSeq genes</i></span><span style="font-family: "Calibri","sans-serif"; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;"><o:p></o:p></span></div>
<div class="MsoNormal" style="tab-stops: 1.0in; text-align: justify; text-justify: inter-ideograph;">
<br /></div>
<div class="MsoNormal" style="text-align: justify; text-justify: inter-ideograph;">
<b><span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">Wednesday <o:p></o:p></span></b></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">9-10:30 am <u>Lecture: Computing Gene Expression
with RNA-Seq (1.5 hr)<o:p></o:p></u></span></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">10:30-12 am <i>
Exercise 4: Align ENCODE RNA-seq data to hg19 reference genome with <a href="http://tophat%20manual/" target="_blank">TopHat</a></i></span><span style="font-family: "Calibri","sans-serif"; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;"><o:p></o:p></span></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">12-1:00 pm Lunch <o:p></o:p></span></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">1-4 pm <i>Continue work on Exercise 4</i><o:p></o:p></span></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<br /></div>
<div class="MsoNormal" style="text-align: justify; text-justify: inter-ideograph;">
<b><span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">Thursday<o:p></o:p></span></b></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<u><span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">9-10:00 am Lecture (1 hr): Intro to data
visualization with R<o:p></o:p></span></u></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<i><span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">10-12:00 am Exercise 5: <a href="http://tryr.codeschool.com/" target="_blank">TryRCodeschool tutorial</a>.<o:p></o:p></span></i></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">12-1:00 pm Lunch <o:p></o:p></span></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<u><span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">1-2:00 pm Lecture (1 hr): Differential Gene
Expression with <a href="http://cole-trapnell-lab.github.io/cufflinks/" target="_blank">Cufflinks</a><o:p></o:p></span></u></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">2-4:00 pm <i>Planning
for Research Project – choose ENCODE data for transcription factors, gene
expression, and epigenetic markers. Literature search.</i></span><span style="font-family: "Calibri","sans-serif"; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;"><o:p></o:p></span></div>
<div class="MsoNormal" style="tab-stops: 1.0in; text-align: justify; text-justify: inter-ideograph;">
<br /></div>
<div class="MsoNormal" style="text-align: justify; text-justify: inter-ideograph;">
<b><span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">Friday<o:p></o:p></span></b></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">9-12:00 am <i>Work
on Research Project</i><o:p></o:p></span></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">12-1:00 pm Lunch <o:p></o:p></span></div>
<br />
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; mso-ascii-theme-font: major-latin; mso-bidi-font-family: Arial; mso-hansi-theme-font: major-latin;">1-4:00 pm <i>Work
on Data analysis and prepare presentation<u><o:p></o:p></u></i></span></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<br /></div>
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<br /></div>
<br />
I had six students in our Research team: Elaine Fisher, Reuben Moncada, Shushan Sargsian, Beny Shapiro, Jong Shin, and Bo Xia, I have pasted images from their final presentation below (can't upload PowerPoint or PDF in this Blogger).<br />
<br />
My overall impression of the week was that the students learned a huge amount of computing skills, but it was a bit bumpy when we got to the RNA-seq methods. They had really good success comparing various Transcription Factor binding sites to known genes (promoter region, TSS, 3'UTR, exons, introns, 5'UTR), finding interactions between TF's by finding overlapping or nearby binding sites, We also found nice overlaps between ChIP-seq TF binding sites and DNAse sensitive sites, histone modification sites, and computationally predicted TF binding sites. Also, the students did a nice job of measuring overlapping vs. nearby binding sites (bedtools slop), and measuring the significance of intersections using bedtools shuffle to create a statistical model of random intersections as a control.<br />
<div class="MsoNormal" style="margin-left: 1.0in; tab-stops: 1.0in; text-align: justify; text-indent: -1.0in; text-justify: inter-ideograph;">
<span style="text-indent: -1in;"><br /></span></div>
<div>
FASTQ data download and alignment is slow and error prone (we had a lot of trouble making SGE scripts that would run correctly on our compute cluster). I should have shown TopHat just as a demo and used a small local FASTQ data file as an example rather than download and re-align ENCODE data. Using Cufflinks/Cuffdiff to compare gene expression from different cell lines was feasible with real ENCODE BAM files, but we had to learn this earlier in the week and spend more time to create SGE scripts that would run nicely with multithreading (to complete in a reasonable amount of time).<br />
<div>
<br /></div>
<div>
If I did this sort of tutorial again, I would figure out a way for the students to measure differential gene expression between cell lines from pre-computed ENCODE RNA-seq quantified data (wig files). </div>
<div>
<br /></div>
<div>
<br />
<div class="MsoNormal" style="margin-left: 1in; text-align: left; text-indent: -1in;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgfTO34hG_BzYMEHqkk1LRvzcx4VfyidmeJTbKKta1p23P5iDq4MyW9w2MySkgLncBCZllBWU1y5x5vhBGd_q9CZ9gy8XsOUuQZQoCPfHkGouP0oLNXN9DoIcl4V83f2lyliGDksplTJqzj/s1600/Slide6.PNG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipGQ5ZMZ37BrBgcAdnzf2i2j7zyO5EtmjYI8oTqRbgpbJtg-Em65L6evCoX3PtOfM2mYntD_MUyIoBuC9jtwQc-t5o3rrnPnSA_RGY9zTxMKKiXoYw6VPC49E5l5WWgA6X9EKcTw0vBu0W/s1600/Slide1.PNG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipGQ5ZMZ37BrBgcAdnzf2i2j7zyO5EtmjYI8oTqRbgpbJtg-Em65L6evCoX3PtOfM2mYntD_MUyIoBuC9jtwQc-t5o3rrnPnSA_RGY9zTxMKKiXoYw6VPC49E5l5WWgA6X9EKcTw0vBu0W/s320/Slide1.PNG" width="320" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjRwREWXN5oFb8KBHDOTCz7S-EcRshHt33jRLmoXN4AYKMSyCZjF9EBpCoElGPhQtW94KN8HB801is8f2Wn3KJqa8uXgQQlWpkHuQRkNCHfhgmzDyyv0cy3iIK-djIX1mckINNKYFDpUgll/s1600/Slide3.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjRwREWXN5oFb8KBHDOTCz7S-EcRshHt33jRLmoXN4AYKMSyCZjF9EBpCoElGPhQtW94KN8HB801is8f2Wn3KJqa8uXgQQlWpkHuQRkNCHfhgmzDyyv0cy3iIK-djIX1mckINNKYFDpUgll/s320/Slide3.PNG" width="320" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0jc7eaoJmCTBxqWV86X_odgSR2lAEjSl-tIgKbxPbvk8F1tYReNPA1axJ2xQRKhyphenhyphenpDVM03JbgIkjoXLfWcnYNvXXVCqdpTsqQ77eTOVoBAQ79K5WZcabt4tUYZ6Y8AiFH4ZaBadknwRMP/s1600/Slide4.PNG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0jc7eaoJmCTBxqWV86X_odgSR2lAEjSl-tIgKbxPbvk8F1tYReNPA1axJ2xQRKhyphenhyphenpDVM03JbgIkjoXLfWcnYNvXXVCqdpTsqQ77eTOVoBAQ79K5WZcabt4tUYZ6Y8AiFH4ZaBadknwRMP/s320/Slide4.PNG" width="320" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLvKG8VBdSI3vQusKUhFOAnWyuT4rPsLMV9lECaQYF1SN4HIXqtGfI7elff0Lf9xNgWmDxJoXnmk51TBPQGyMBQzlhqCAJ0JtdxvskEET1Y98gEmtMw1L6ZPScOoEqT4fkS9TAE1uItRdG/s1600/Slide5.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgLvKG8VBdSI3vQusKUhFOAnWyuT4rPsLMV9lECaQYF1SN4HIXqtGfI7elff0Lf9xNgWmDxJoXnmk51TBPQGyMBQzlhqCAJ0JtdxvskEET1Y98gEmtMw1L6ZPScOoEqT4fkS9TAE1uItRdG/s320/Slide5.PNG" width="320" /></a></div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhp1h4IXVXUVFSnB23-4JdePDq1mfrDsHhPqXkIOHKkJCCsqK_mrgOZuol_4frSYpYowU0uRgdofMXum6jIkIfD1dCmDMQp0wnkPCYCSLmfNf5u1KcgYA_Rt2VeHjsroQh6_e9xrWp8ehMF/s1600/Slide10.PNG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhp1h4IXVXUVFSnB23-4JdePDq1mfrDsHhPqXkIOHKkJCCsqK_mrgOZuol_4frSYpYowU0uRgdofMXum6jIkIfD1dCmDMQp0wnkPCYCSLmfNf5u1KcgYA_Rt2VeHjsroQh6_e9xrWp8ehMF/s320/Slide10.PNG" width="320" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLO59Cz5TJH82KfO2xrRcW2i5KzZ3csuu_svaAaKzQ6YJfIO1ZAuoWfxMpBlxwxcO8ldpGsmaEyYSaioiQ8sRO_C-UGRZQangQoGxFUl5vgbkx60bSg082laeZ-CK7eI2QSPOIZAiMKvsl/s1600/Slide7.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLO59Cz5TJH82KfO2xrRcW2i5KzZ3csuu_svaAaKzQ6YJfIO1ZAuoWfxMpBlxwxcO8ldpGsmaEyYSaioiQ8sRO_C-UGRZQangQoGxFUl5vgbkx60bSg082laeZ-CK7eI2QSPOIZAiMKvsl/s320/Slide7.PNG" width="320" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi52Q_gM6unMpav3L14LaMC3KGBUWXKt6c8J958XCznr1rqDpkcOMyxBl4oRS17RUFSqWXzr5VMsvDsfpS1FnKt-Uho3sDpdhR0qc7QzZnVPOHdNHVCgBBOFp-Mk-mvjWANfu7iwGIVKQuO/s1600/Slide8.PNG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi52Q_gM6unMpav3L14LaMC3KGBUWXKt6c8J958XCznr1rqDpkcOMyxBl4oRS17RUFSqWXzr5VMsvDsfpS1FnKt-Uho3sDpdhR0qc7QzZnVPOHdNHVCgBBOFp-Mk-mvjWANfu7iwGIVKQuO/s320/Slide8.PNG" width="320" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhcjmwOGGfX1adWRzFNVZLRirVUiMLF7y11BNng_7FCbynSN-ekaVcycieEkn9QmUSbm032tsHCDnoy4y9hWHjXCqYxkIoTdXX6xlR1XK7cEKzebaa3tB3IJ2j2RzZX93NHWSe_3k1dD9Vz/s1600/Slide9.PNG" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhcjmwOGGfX1adWRzFNVZLRirVUiMLF7y11BNng_7FCbynSN-ekaVcycieEkn9QmUSbm032tsHCDnoy4y9hWHjXCqYxkIoTdXX6xlR1XK7cEKzebaa3tB3IJ2j2RzZX93NHWSe_3k1dD9Vz/s320/Slide9.PNG" width="320" /></a></div>
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgTOKyKqpWIVq4OjGkSDc-BpGUYimwqlFUbSUdbpNOdmyyXUiTkjnh4mrOWMgbrInPE5RT1kMUKODuqhY4RFz2iiYS1fv56Q_D0U0x0tJlLbDQJ8feMrecASXoj2XNQr9KrBrKAeYPUsaPj/s1600/Slide12.PNG" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgTOKyKqpWIVq4OjGkSDc-BpGUYimwqlFUbSUdbpNOdmyyXUiTkjnh4mrOWMgbrInPE5RT1kMUKODuqhY4RFz2iiYS1fv56Q_D0U0x0tJlLbDQJ8feMrecASXoj2XNQr9KrBrKAeYPUsaPj/s320/Slide12.PNG" width="320" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxRxjM0YAaxGng2GPpi7We3LUn-NZSZcnR0mj6pPjrYe-YivDarKFv74Yl3xxd-vONvuO-IJ7IENuz7STxiuslsfts0nsnGgI5pq5KM09ePcbPp5x3TxI5Ap0A8_JELe018oSe_NAI8Dql/s1600/Slide11.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxRxjM0YAaxGng2GPpi7We3LUn-NZSZcnR0mj6pPjrYe-YivDarKFv74Yl3xxd-vONvuO-IJ7IENuz7STxiuslsfts0nsnGgI5pq5KM09ePcbPp5x3TxI5Ap0A8_JELe018oSe_NAI8Dql/s320/Slide11.PNG" width="320" /></a></div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidC4J9tugFk77ZkfUtU1gzg97KQdX_RdLguAeB1VYkwSx63eH2VqIHOjwc7Pf5HLIbmLHdBWGMnMt1tQ8zziZ3tvwjBimVmJUwTbBmqJjVJepdyWsdkqvwgnI_C8_3fD-SJAGyrz_2OOLE/s1600/Slide15.PNG" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidC4J9tugFk77ZkfUtU1gzg97KQdX_RdLguAeB1VYkwSx63eH2VqIHOjwc7Pf5HLIbmLHdBWGMnMt1tQ8zziZ3tvwjBimVmJUwTbBmqJjVJepdyWsdkqvwgnI_C8_3fD-SJAGyrz_2OOLE/s320/Slide15.PNG" width="320" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifQ215JVrAGHKGPGULdIeEfySfDxk-UN3Wb-Sca2lWznoBXr-pnmfztZCo1LSaK9nm-qX0RleTPasjlWNK48MU2T4QYTR0cyKsKilwv-EAUCs4ngfGhCtAkXURUFE7JA8gAy45xgTR-mao/s1600/Slide14.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifQ215JVrAGHKGPGULdIeEfySfDxk-UN3Wb-Sca2lWznoBXr-pnmfztZCo1LSaK9nm-qX0RleTPasjlWNK48MU2T4QYTR0cyKsKilwv-EAUCs4ngfGhCtAkXURUFE7JA8gAy45xgTR-mao/s320/Slide14.PNG" width="320" /></a></div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjmJHBAlX1q8AGULuTXr5QaoL30EcfLR0sCi6KYi31mmjvLTfS57vevCkEPMUOnvzV4lssl8qYN4n7P-2Vday1T0JQt_BdW1HgcjNf6yNRM_LfG_eox2HZMpv-LusZP5AMPOAtR_bZIJZv2/s1600/Slide17.PNG" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjmJHBAlX1q8AGULuTXr5QaoL30EcfLR0sCi6KYi31mmjvLTfS57vevCkEPMUOnvzV4lssl8qYN4n7P-2Vday1T0JQt_BdW1HgcjNf6yNRM_LfG_eox2HZMpv-LusZP5AMPOAtR_bZIJZv2/s320/Slide17.PNG" width="320" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmBCLPFu7RMoidvM1JrKMt6npoWsPujyFW_QegRPk1P1SW2qz0GExmdkDi9ysXRY9SMN91hhXUzx5Ad-dsdPr_SD5916z9zXqL-PMOYpIVkqti3hMKUXOxYTPCPpl0CkbIo7wzrBN8dhyphenhyphenE/s1600/Slide16.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmBCLPFu7RMoidvM1JrKMt6npoWsPujyFW_QegRPk1P1SW2qz0GExmdkDi9ysXRY9SMN91hhXUzx5Ad-dsdPr_SD5916z9zXqL-PMOYpIVkqti3hMKUXOxYTPCPpl0CkbIo7wzrBN8dhyphenhyphenE/s320/Slide16.PNG" width="320" /></a></div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidnYMkJ3l9_Fk3-OiyBMOxOtdqWkl57k7uiRVn_l6aWzCcxAXVFj73otI3Bh9_fsaT0wYq0S8PVxxn9EQ_wAiM4fll14pHhTc9VRxWzciyhm-f0Dly7FJcSmYk4tpnfDsADFkTgea8F0F-/s1600/Slide20.PNG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidnYMkJ3l9_Fk3-OiyBMOxOtdqWkl57k7uiRVn_l6aWzCcxAXVFj73otI3Bh9_fsaT0wYq0S8PVxxn9EQ_wAiM4fll14pHhTc9VRxWzciyhm-f0Dly7FJcSmYk4tpnfDsADFkTgea8F0F-/s320/Slide20.PNG" width="320" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg9gg7UK9pZZXeRumUJPxkr9J8EkNt9ifFmz3KZ4z83BW7iZCuQY45gcZ7dQQbRtYMJdvYz9XHkrU2abqcN3zy_O49yxY8jwCvvJBfADl5ljohrlAHlxg3v2xYcW4ytzvUhYZoQ07pu5y3e/s1600/Slide19.PNG" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg9gg7UK9pZZXeRumUJPxkr9J8EkNt9ifFmz3KZ4z83BW7iZCuQY45gcZ7dQQbRtYMJdvYz9XHkrU2abqcN3zy_O49yxY8jwCvvJBfADl5ljohrlAHlxg3v2xYcW4ytzvUhYZoQ07pu5y3e/s320/Slide19.PNG" width="320" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHYIsfsyiKdCW-eIoCArgRg5DPWsgGMS-NUE0bqf9nX6mWayJoJ73jrHGFLJ8CJlw7ITt_XFk2h4-j03BRbp4jNJcLxfgYEJhKhdgZZHUYaZ2KW7Wmdtp9ZWcUKhDZG6FWZIjm9T-QNJuT/s1600/Slide18.PNG" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHYIsfsyiKdCW-eIoCArgRg5DPWsgGMS-NUE0bqf9nX6mWayJoJ73jrHGFLJ8CJlw7ITt_XFk2h4-j03BRbp4jNJcLxfgYEJhKhdgZZHUYaZ2KW7Wmdtp9ZWcUKhDZG6FWZIjm9T-QNJuT/s320/Slide18.PNG" width="320" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOLMXCJle5cdYx04Xh5pOozCb2w5LjDH9591oj6vViwMyCDz1GxDrZIU2-wbP9JA6lVPd3cR3YM2rbP1Ja9146SXZovyxd_MZmgxxEkLi3RIa4wS1z9se-7oKyYuhVKxyutmFVDX7LMD3r/s1600/Slide21.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOLMXCJle5cdYx04Xh5pOozCb2w5LjDH9591oj6vViwMyCDz1GxDrZIU2-wbP9JA6lVPd3cR3YM2rbP1Ja9146SXZovyxd_MZmgxxEkLi3RIa4wS1z9se-7oKyYuhVKxyutmFVDX7LMD3r/s320/Slide21.PNG" width="320" /></a></div>
<br />
<br />
<br />
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br /></div>
</div>
</div>
Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com2tag:blogger.com,1999:blog-4457216402399127579.post-15677741527249212482015-07-31T11:05:00.000-04:002015-07-31T14:00:21.059-04:00Coffee Berry Borer genome published<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Our paper on the <u><i>de novo</i></u> genome sequence and annotation of the Coffee Berry Borer (a beetle) is published today in <a href="http://www.nature.com/srep/2015/150731/srep12525/full/srep12525.html" target="_blank">Nature Scientific Reports</a>. This was a really fun project, where I was pushed to do a lot more in-depth study of insect biology (such as antimicrobial and cytochrome P450 proteins). We also discovered that this beetle has captured a bunch of bacterial proteins into its genome (horizontal gene transfer) - which seems odd, but was actually previously reported for this insect and many others. Interestingly, most of these captured bacterial proteins provide starch digesting enzymes, which support the beetle's lifestyle of living entirely inside of the coffee bean and eating nothing but coffee! We are of course hoping that these genes can be used as some sort of target for control of the pest, which causes something like a billion $$ of annual damage worldwide to our beloved coffee. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://www.nature.com/srep/2015/150731/srep12525/full/srep12525.html">http://www.nature.com/srep/2015/150731/srep12525/full/srep12525.html</a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div style="text-align: center;">
<a href="http://www.nature.com/srep/2015/150731/srep12525/pdf/srep12525.pdf">http://www.nature.com/srep/2015/150731/srep12525/pdf/srep12525.pdf</a></div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCoaKDDUcpQRj3k-5hZaAbqSGUzKMSdP0B9qXqFP1tIMsD0_vSeaxwJcI2aPQiprZuryp4A2teIlpo0sa6y-Vu2cUFD4hppUmp49ukLnKK3kCco1tvbJyawtULdbDcVCzS7MhnHGxLkpE0/s1600/Vega+et+al+-+Draft+genome+of+the+most+devastating+insect+pest+of+coffee+-+Scientific+Reports+2015_Page_01.tiff" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCoaKDDUcpQRj3k-5hZaAbqSGUzKMSdP0B9qXqFP1tIMsD0_vSeaxwJcI2aPQiprZuryp4A2teIlpo0sa6y-Vu2cUFD4hppUmp49ukLnKK3kCco1tvbJyawtULdbDcVCzS7MhnHGxLkpE0/s200/Vega+et+al+-+Draft+genome+of+the+most+devastating+insect+pest+of+coffee+-+Scientific+Reports+2015_Page_01.tiff" width="151" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhrovyE8fuWZonlSuPkszeLHK7rqI-LmSC9BqPraqYMqGdxpg1m0JV3fbbfQ73-ivYUW7GJ963pFn30FyISZnmt-hKJPpc1o0IkhlxfKxHQkgZgEDT9cNIMBgSBw85jxHYh_-VZOSN8AHQF/s1600/Vega+et+al+-+Draft+genome+of+the+most+devastating+insect+pest+of+coffee+-+Scientific+Reports+2015_Page_04.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhrovyE8fuWZonlSuPkszeLHK7rqI-LmSC9BqPraqYMqGdxpg1m0JV3fbbfQ73-ivYUW7GJ963pFn30FyISZnmt-hKJPpc1o0IkhlxfKxHQkgZgEDT9cNIMBgSBw85jxHYh_-VZOSN8AHQF/s200/Vega+et+al+-+Draft+genome+of+the+most+devastating+insect+pest+of+coffee+-+Scientific+Reports+2015_Page_04.tiff" width="151" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilyWuKzyBORKZ5uIcVG2x4TRewPm5-RnUF6ONtUu2-ZS9_UPORdJw_UN6I7mDuEEFGC-tzlXE_TrrwqaTijNZBTMqvRX0azUs4K0W66rxTkMiRv0KeC0Z6rRuwHhXI6Kkg5DgXM7kRMA1s/s1600/Vega+et+al+-+Draft+genome+of+the+most+devastating+insect+pest+of+coffee+-+Scientific+Reports+2015_Page_06.tiff" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilyWuKzyBORKZ5uIcVG2x4TRewPm5-RnUF6ONtUu2-ZS9_UPORdJw_UN6I7mDuEEFGC-tzlXE_TrrwqaTijNZBTMqvRX0azUs4K0W66rxTkMiRv0KeC0Z6rRuwHhXI6Kkg5DgXM7kRMA1s/s200/Vega+et+al+-+Draft+genome+of+the+most+devastating+insect+pest+of+coffee+-+Scientific+Reports+2015_Page_06.tiff" width="151" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicv9-dCFbTDJc_cjD51R_AmGfe2gnP_TCXy34Ng-bD9dnj0vYWawopPp8B4O8rONhHwGbFMhMXeBoIUF7V3udIUFmRJjOVdEEWS6I2_3DFXcIzGMHrrQ7z_FyHwk3qG-NXN3fSak17pUfb/s1600/Vega+et+al+-+Draft+genome+of+the+most+devastating+insect+pest+of+coffee+-+Scientific+Reports+2015_Page_13.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicv9-dCFbTDJc_cjD51R_AmGfe2gnP_TCXy34Ng-bD9dnj0vYWawopPp8B4O8rONhHwGbFMhMXeBoIUF7V3udIUFmRJjOVdEEWS6I2_3DFXcIzGMHrrQ7z_FyHwk3qG-NXN3fSak17pUfb/s200/Vega+et+al+-+Draft+genome+of+the+most+devastating+insect+pest+of+coffee+-+Scientific+Reports+2015_Page_13.tiff" width="151" /></a></div>
<br />
<br />Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com6tag:blogger.com,1999:blog-4457216402399127579.post-32888156228768568182015-07-29T16:49:00.002-04:002015-07-29T16:50:43.990-04:00I am writing new lectures and organizing a lot of teaching material to teach 4 (!) classes this fall at two different universities (NYU and Fordham). I would like to keep the teaching materials in a nice easily accessible online location, and easily share with my students without a lot of hassle to sign them all up or whatever. I had a fairly good experience with Google Drive for a short course this Spring, so I'm trying it out now. Here is the master link to all of my 2015 teaching material:<br />
<br />
<a href="https://drive.google.com/open?id=0BzalvBlHvt6LfldpaWxZQXVLcTZxUmpWZFdqSTBGeWl0MlJHeXBFQmhTTHBaX3JHNXowVDg">https://drive.google.com/open?id=0BzalvBlHvt6LfldpaWxZQXVLcTZxUmpWZFdqSTBGeWl0MlJHeXBFQmhTTHBaX3JHNXowVDg</a><br />
<br />
<br />
Stuff will appear, change, possibly disappear from this location as I keep sorting and rewriting, up to and during the classes. Most of the material is my own, some journal articles that I provide as readings to my students, and some shameless theft of good lectures, exercises, and tutorials from other folks smarter or better at explaining stuff than I am.<br />
<br />
We are also planning to make Screencast type videos of most of the lectures, which get dumped on YouTube. I will try to find some sensible way of organizing them and sharing via this NGS blog.Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com0tag:blogger.com,1999:blog-4457216402399127579.post-85892985298519560082015-07-16T11:36:00.002-04:002015-07-16T11:36:38.097-04:00CSHL Press has made the RNA-seq chapter of my Next-Gen Seq book available free from their website: <a href="http://email.cshlpress.org/ct/uz7839343Biz25784890" style="font-family: Arial, sans-serif; font-size: 10.5pt;" target="_blank"><span style="color: #0d55b0; text-decoration: none; text-underline: none;">RNA Sequencing
with Next-Generation Sequencing.</span></a><br />
<br />
<span style="font-family: Arial, sans-serif;"><span style="font-size: 14px;">http://www.cshlpress.org/pdf/sample/2015/nextgen2/NGS2Chap13.pdf</span></span><div>
<span style="font-family: Arial, sans-serif;"><span style="font-size: 14px;"><br /></span></span><span style="font-family: Arial, sans-serif; font-size: 10.5pt;"><br /></span>
<span style="font-family: Arial, sans-serif; font-size: 10.5pt;"><br /></span>
<span style="font-family: Arial, sans-serif; font-size: 10.5pt;"><br /></span>
<span style="font-family: "Times New Roman","serif"; font-size: 12.0pt; mso-ansi-language: EN-US; mso-bidi-language: AR-SA; mso-fareast-font-family: "Times New Roman"; mso-fareast-language: EN-US;"><span style="color: blue; text-decoration: none; text-underline: none;"><a href="http://email.cshlpress.org/ct/uz7839343Biz25784888" target="_blank"><img alt="Cold Spring Harbor Laboratory Press banner image" border="0" height="165" id="_x0000_i1025" src="http://www.cshlpress.com/email_news/nextgen2/images/NextGen2email_01.jpg" style="display: block;" width="600" /></a></span></span><br />
<span style="font-family: "Times New Roman","serif"; font-size: 12.0pt; mso-ansi-language: EN-US; mso-bidi-language: AR-SA; mso-fareast-font-family: "Times New Roman"; mso-fareast-language: EN-US;"><a href="http://email.cshlpress.org/ct/uz7839343Biz25784889" target="_blank"><span style="color: blue; text-decoration: none; text-underline: none;"><img alt="Next-Generation DNA Sequencing Informatics, Second Edition banner image" border="0" height="262" id="_x0000_i1025" src="http://www.cshlpress.com/email_news/nextgen2/images/NextGen2email_02.jpg" style="display: block;" width="600" /></span></a></span></div>
Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com2tag:blogger.com,1999:blog-4457216402399127579.post-80773171865194008662015-05-28T16:34:00.000-04:002015-05-28T16:34:52.167-04:00New 'Next-Gen Seq 2' book is at the printerThe second edition of the <strong><u>Next-Generation Sequencing Informatics</u></strong> book (that I edit) is at the printer and available for pre-order at<a href="http://www.cshlpress.com/default.tpl?action=full&cart=141113064514747758&--eqskudatarq=1041" target="_blank"> Cold Spring Harbor Press</a> and <a href="http://www.amazon.com/Next-Generation-DNA-Sequencing-Informatics-Second/dp/1621821234/ref=la_B001H6NZLC_1_1?s=books&ie=UTF8&qid=1432132601&sr=1-1" target="_blank">Amazon</a>. We think it will ship on June 30th, maybe a bit sooner. <br />
<br />
[James Hadfield at CoreGenomics blog has posted a review: <a href="http://core-genomics.blogspot.co.uk/2015/05/book-review-next-generation-dna.html" style="font-family: Calibri, sans-serif; font-size: 10.5pt;">http://core-genomics.blogspot.co.uk/2015/05/book-review-next-generation-dna.html</a> ]<br />
<br />
We have added new chapters on the latest sequencing technology, QC, de novo transcript assembly, proteogenomics and lots of updates and expansion in areas such as RNA-seq and ChIP-seq. It has a beautiful cover and its not too expensive. <br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWqaAWaim_FvrjViMROryKPMX62rpQO1vIJaMuF9uACWCMNBZ4eOIWa1tnxTqY0FCJ4GDifcUVt94S0Kg9rSqTZlCWf91Yz_LgLdbbh243ssrPAj212iyFhnnNlJL4hMSzRDPdE07Jrbz7/s1600/NextGenDNA2_f.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWqaAWaim_FvrjViMROryKPMX62rpQO1vIJaMuF9uACWCMNBZ4eOIWa1tnxTqY0FCJ4GDifcUVt94S0Kg9rSqTZlCWf91Yz_LgLdbbh243ssrPAj212iyFhnnNlJL4hMSzRDPdE07Jrbz7/s320/NextGenDNA2_f.jpg" width="215" /></a></div>
<br />
<br />
Here is the official publication blurb:<br />
<br />
<span style="color: purple;"><span style="color: black;">Next-generation DNA sequencing (NGS) technology has revolutionized biomedical research, making genome and RNA sequencing an affordable and frequently used tool for a wide variety of research applications including variant (mutation) discovery, gene expression, transcription factor analysis, metagenomics, and epigenetics. Bioinformatics methods to support DNA sequencing have become and remain a critical bottleneck for many researchers and organizations wishing to make use of NGS technology.</span> <em><span style="color: black;">Next-Generation DNA Sequencing Bioinformatics, Second edition,</span></em><span style="color: black;"> provides thorough, plain language introduction to the necessary informatics methods and tools for analyzing NGS data as did the first edition, and provides detailed descriptions of algorithms, strengths and weaknesses of specific tools, pitfalls and alternative methods. Four new chapters in this edition cover: experimental design, sample preparation, and quality assessment of NGS data; Public databases for DNA Sequencing data; De novo transcript assembly; proteogenomics; and emerging sequencing technologies. The remaining chapters from the first edition have been updated with the latest information. This book also provides extensive reference to best-practice bioinformatics methods for NGS applications and tutorials for common workflows. The second edition of</span> <em><span style="color: black;">Next-Generation DNA Sequencing Bioinformatics</span></em><span style="color: black;"> addresses the informatics needs of students, laboratory scientists, and computing specialists who wish to take advantage of the explosion of research opportunities offered by new DNA sequencing technologies.</span></span><br />
<br />
<br />
and the Table of Contents:<br />
<br />
<dt>1) Introduction to DNA Sequencing</dt>
<dd><em>Stuart M. Brown</em></dd>
<dt>2) Quality Control and Data Processing</dt>
<dd><em>Stuart M. Brown</em></dd>
<dt>3) History of Sequencing Informatics</dt>
<dd><em>Stuart M. Brown</em></dd>
<dt>4) Public Sequence Databases</dt>
<dd><em>Stuart M. Brown</em></dd>
<dt>5) Visualization of Next-Generation Sequencing Data</dt>
<dd><em>Philip Ross Smith, Kranti Konganti, and Stuart M. Brown</em></dd>
<dt>6) DNA Sequence Alignment</dt>
<dd><em>Efstratios Efstathiadis</em></dd>
<dt>7) Genome Assembly Using Generalized de Bruijn Digraphs</dt>
<dd><em>D. Frank Hsu</em></dd>
<dt>8) De Novo Assembly of Bacterial Genomes from Short Sequence Reads</dt>
<dd><em>Silvia Argimón and Stuart M. Brown</em></dd>
<dt>9) De Novo Transcriptome Assembly</dt>
<dd><em>Lisa Cohen, Steven Shen, and Efstratios Efstathiadis</em></dd>
<dt>10) Genome Annotation</dt>
<dd><em>Steven Shen and Stuart M. Brown</em></dd>
<dt>11) Using NGS to Detect Genome Sequence Variants</dt>
<dd><em>Jinhua Wang</em></dd>
<dt>12) ChIP-seq</dt>
<dd><em>Stuart M. Brown, Zuojian Tang, Christina Schweikert, and D. Frank Hsu</em></dd>
<dt>13) RNA-seq with Next-Generation Sequencing</dt>
<dd><em>Stuart M. Brown and Jeremy Goecks</em></dd>
<dt>14) Metagenomics</dt>
<dd><em>Guillermo I. Perez-Perez, Miroslav Blumenberg, and Alexander V. Alekseyenko</em></dd>
<dt>15) Proteogenomics</dt>
<dd><em>Kelly V. Ruggles and David Fenyö</em></dd>
<dt>16) DNA Sequencing Technologies and Applications</dt>
<dd><em>Gerald A. Higgins and Brian D. Athey</em></dd>
<dt>17) Cloud-based Next-Generation Sequencing Informatics</dt>
<dd><em>Konstantinos Krampis, Efstratios Efstathiadis, and Stuart M. Brown</em></dd>Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com0tag:blogger.com,1999:blog-4457216402399127579.post-51584528744102966972015-02-09T10:03:00.002-05:002015-02-09T10:03:48.171-05:00Password hellThis is not a Bioinformatics post, just an amusing technology catch-22 that I encountered this morning. At NYU we have automatic mandatory password updates for our accounts with IT. This includes email, login to my Windows desktop computer, and wireless devices on the secure WiFi network in our building. Since I am lazy about these things, I did not heed the warnings and follow the instructions in the "Password Update" email from our IT Department. Instead, at home on Sunday night, I got a message when I tried to log in to my email account saying that I should update my password, and a helpful little box appears where it is possible to type old password and new password, hit submit and its all good.<br />
<br />
I made a new password, and checked my mail, but after about 5 min, I got knocked off the network and can't log back in. It's late, so I figure to deal with it at the office in the morning. At my desk, I can't log into my computer (uses the same network "kerberos" password), and my phone complains that it can't get on the local wireless network. I try new password, old password, and eventually get the helpful message that my account has been locked by the IT Dept, and I must call the helpdesk. Its 9 AM on Monday and the helpdesk picks up right away. Help Guy asks if I have any wireless devices that may be using the old password. I look at the offending iPhone, and shut off WiFi. Helpdesk says: "I still see wireless activity hitting your account with an invalid password." Back to my desk, where my desktop Mac is using WiFi and getting unhappy messages from the network. Shut down WiFi. Helpdesk still sees activity on my account. Think, think?? Into the drawer where I have a laptop that we use for teaching and public seminars, it is asleep, but somehow still hitting the wireless network with my old password. Turn off WiFi on that one, and finally the helpful helpdesk guy can unlock my account. Then I can go back to each device and rejoin the network with the new password. I guess I'm not the first idiot this has happened to. Moral of the story??? Follow instructions very carefully or your helpful technology tools will gang up against you.<br />
<br />
Happy Ice Storm Day from New York<br />
-StuartAnonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com2tag:blogger.com,1999:blog-4457216402399127579.post-22253672223061799322014-09-10T15:14:00.000-04:002014-09-10T15:14:36.419-04:00Introduction to Biostatistics and Bioinformatics course at NYUMCWe are giving a new course in the PhD program at NYU Med School (Sackler Institute) this semester called "Introduction to Biostatistics and Bioinformatics". It will have a mixture of lectures on Bioinformatics, Biostatistics, and Python programming. Hopefully we will be able to show the students the intersection of these topics as something like "Data Science for Biology". Lecturers will be myself, David Fenyo, Judy Zhong, and Pamela Wu.<br />
<br />
<br />
<div style="background-color: white; color: #333333; font-family: Verdana; font-size: 14px; line-height: 18px;">
<b>Course Overview</b></div>
<div style="background-color: white; color: #333333; font-family: Verdana; font-size: 14px; line-height: 18px;">
The goal for the Introduction to Biostatistics and Bioinformatics course is to provide an introduction to statistics and informatics methods for the analysis of data generated in biomedical research. Practical examples covering both small-scale lab experiments and high-throughput assays will be explored. The course covers a wide range of topics in a short time so the focus will be on the basic concepts, and in the practical programming exercises the students explore these basic concept and common pitfalls. An introduction of basic Python and R programming will be given throughout the course and many exercises will involve programming.</div>
<div style="background-color: white; color: #333333; font-family: Verdana; font-size: 14px; line-height: 18px;">
<br /></div>
<br />
The lectures will be posted to YouTube each week. Here are our first ones from yesterday:<br />
Intro Lecture/Data Visualization: <a href="http://youtu.be/YDUPzq7i49U">http://youtu.be/YDUPzq7i49U</a><br />
Python programming #1: <a href="http://youtu.be/r2N-thn7j4o">http://youtu.be/r2N-thn7j4o</a><br />
<br />
<br />
The course curriculum and links to lecture slides (PPT), readings, and various handouts and exercises is here: <a href="http://fenyolab.org/ibb2014">http://fenyolab.org/ibb2014</a><br />
<br />Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com2tag:blogger.com,1999:blog-4457216402399127579.post-25176596310351999522014-08-20T12:34:00.001-04:002014-08-20T12:34:25.729-04:00Cheap genome projects<div class="MsoNormal">
I have been helping out several groups who want to do cheap, quick genome
projects on previously "unsequenced" eukaryotic organisms. In terms of the
genetic diversity of living things, at this point we have sampled very unevenly
across taxonomic domains. Insects are particularly underrepresented in the
whole genome database. According to the number of named species, insects
represent over 80% of animal species. The NCBI has over 10,000 whole genome
projects, but only 132 insects (and 36 of those are Drosophila species). So
there is certainly taxonomic room to knock out some more insect genomes and
discover interesting new stuff. A wide variety of valid arguments can be made
for the usefulness of sequencing any number of as yet overlooked organisms in
all taxonomic domains. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf10E8l5y1EjnU8nj0Lio7G1-hsFhn_dgEfIguJii5H1VhJvhIbPnfIijjiHAWw5ZAnR6fhNwgGw6hZiX2CgEJGI0jBrBxdQi9Bv2KdLO9r5J8MD_sEA12LzAaSqs7OyK8rmAo4YVu_frx/s1600/txaon-pie-chart.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgf10E8l5y1EjnU8nj0Lio7G1-hsFhn_dgEfIguJii5H1VhJvhIbPnfIijjiHAWw5ZAnR6fhNwgGw6hZiX2CgEJGI0jBrBxdQi9Bv2KdLO9r5J8MD_sEA12LzAaSqs7OyK8rmAo4YVu_frx/s1600/txaon-pie-chart.jpg" height="640" width="497" /></a></div>
<span style="font-family: "Calibri","sans-serif"; font-size: 11.0pt; line-height: 115%; mso-ansi-language: EN-US; mso-ascii-theme-font: minor-latin; mso-bidi-font-family: "Times New Roman"; mso-bidi-language: AR-SA; mso-bidi-theme-font: minor-bidi; mso-fareast-font-family: Calibri; mso-fareast-language: EN-US; mso-fareast-theme-font: minor-latin; mso-hansi-theme-font: minor-latin;"><br />
<!--[if !supportLineBreakNewLine]--><br />
<!--[endif]--></span><br />
<div class="MsoNormal">
There are a lot of very useful experimental approaches that
require a draft genome as a reference: gene
expression, transcription factors and epigenetics (ChIP-seq), and just the
basic evolutionary biology of important genes that are present or absent in the
genome, novel paralogs in important gene families, etc. A few years ago, building a draft genome for a
new organism was a major undertaking that required substantial funding and a
dedicated research team. Today, the sequencing can be done fairly cheaply, but
the bioinformatics work is extremely open ended. Clearly you can work and work
and work to make a very well defined and annotated genome with maximum value
for all possible users. But what is the optimal set of most useful genomic
information that we can produce with a few person-weeks of time? [that would perhaps serve as preliminary data
to bring in showers of additional funding for follow-up studies]<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b>NOTE ONE</b>: Collect
both genomic and RNA sequence data. These two data types are extremely
complementary. Many earlier <i>de novo</i>
genome projects built on collections of existing EST sequence data, which was
the poor man's approach to draft genomes in the previous decade. The ESTs
provided seeds for gene finding on the genomic DNA, training of gene-finding
algorithms, etc. Now we can get a comprehensive genome wide set of RNA-seq data
for the cost of one lane on a HiSeq machine and a sample prep kit. If you have
the choice, get 100 bp paired-end sequencing of the RNA, it will map better and
end up giving more value for the dollar. It might be possible to use paired-end RNA to bridge DNA contigs into scaffolds - as far as I know, this is an untested area. It would be very helpful to have a
normalized RNA library, to get more coverage of poorly expressed genes – this
is on my wish list for future Cheap-O Genome projects. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b>NOTE TWO</b>: More
data is good. High total genome coverage is good, but long insert paired-end
DNA libraries build better genome contigs. This makes complete sense. Early
"shotgun" genome projects relied on sequencing the ends of clones
from libraries with various size inserts. It would be really nice to have 10 or
20 KB insert libraries, or "mate pair" sequences that come from the
junctions of large genomic fragments
that have been circularized, but these are generally not available when your
entire sequencing budget is in the single digit thousands of dollars. We were
able to get a 550 bp insert library for a recent project and it led to an
assembly with an N50 > 40 KB. Pretty good for two lanes of HiSeq data. </div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
For our next cheap genome, we are using the Illumina TruSeq
Synthetic Long-Read kit (which is based on technology developed by Moleculo).
This is a really clever idea: first it breaks the genome into ~10
KB fragments and sorts the fragments into wells of a 384 well plate. Just a few dozen to
a few hundred fragments in each well. The fragments in each well are clonally
amplified (sort of like 454 technology), then sheared into the normal size
range for Illumina sequencing (300-500 bp) and tagged with barcode primers at
the ends. Then all the tagged fragments are
pooled and sequenced normally on a HiSeq machine. Illumina has a custom
assembly app (built in BaseSpace) that demultiplexs the data and does separate de novo assembly on each
barcode set – so it is just assembling the small number of 10 KB fragments from
one well. The final output is a set of "synthetic long reads" that
really do seem to be 10 kb long.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcqx9gXraFSZetT7F3pGGBW4C3847TkvZNCo7iY8hPF-sBdcsPMb69XYhsLfQ2UlicuGreCNSnT1G3BIl2qmL-ATaOklwA5J2ngkxricZxtqVPHw-i5rPUjvQzxttiZ3raFSm6VZaLpk-j/s1600/long-reads.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcqx9gXraFSZetT7F3pGGBW4C3847TkvZNCo7iY8hPF-sBdcsPMb69XYhsLfQ2UlicuGreCNSnT1G3BIl2qmL-ATaOklwA5J2ngkxricZxtqVPHw-i5rPUjvQzxttiZ3raFSm6VZaLpk-j/s1600/long-reads.png" height="245" width="400" /></a></div>
<div class="MsoNormal" style="text-align: center;">
(From Illumina product literature)</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b>NOTE THREE</b>: I
like the <b>SOAPdenovo </b>assembler
(127-kmer) for Illumina DNA sequence data. It did a good job for us on several
different species with only a moderate consumption of computing resources (an
overnight job on 32 processors with shared 128 GB of RAM). The final product is
a set of contigs in FASTA format, some quite big, and a lot of little ones.
Hopefully the sum of the contigs comes out to something similar to the expected
genome size of the organism. The quite <b>new SOAPdenovo-Trans</b> assembler for
RNA-seq also worked quite well for us – at least in comparison to Trinity which
is a huge computer hog. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
</div>
<div class="MsoNormal">
I visualize the bioinformatics work in two parts. <b>First</b>, find the genes in our data. <b>Second</b>, annotate the found genes and
the genome using reference data. [Annotation will be described in another blog post.]<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Ok, so here is my gene finding workflow for the Cheap-O
Genome Project. <o:p></o:p></div>
<div class="MsoNormal">
<b><u>Gene Finding
Workflow</u></b></div>
<div class="MsoNormal">
<span style="font-size: 7pt; text-indent: -0.25in;"><br /></span></div>
<div class="MsoNormal">
</div>
<ol>
<li><span style="font-size: 7pt; text-indent: -0.25in;"> </span><span style="text-indent: -0.25in;">Assemble DNA reads into genome contigs with
SOAPdenovo assembler (127-kmer)<br /></span></li>
<li><span style="font-size: 7pt; text-indent: -0.25in;"> </span><span style="text-indent: -0.25in;">De novo gene finding on the DNA contigs with
GeneScan or GeneMark (I used GeneMark).<br /></span></li>
<li><span style="text-indent: -0.25in;">Assemble RNA-seq reads into
"transcripts" with SOAPdenovo-Trans<br /></span></li>
<li><span style="text-indent: -0.25in;">Map RNA-seq reads onto the DNA contigs with
TopHat<br /></span></li>
<li><span style="font-size: 7pt; text-indent: -0.25in;"> </span><span style="text-indent: -0.25in;">Make another set of transcripts with Cufflinks
(using no annotation file)<br /></span></li>
<li><span style="text-indent: -0.25in;">Use BLAT to map the de novo assembled
transcripts onto the DNA contigs<br /></span></li>
<li><span style="font-size: 7pt; text-indent: -0.25in;"> </span><span style="text-indent: -0.25in;">Use the extremely useful</span><span style="text-indent: -0.25in;"> </span><b style="text-indent: -0.25in;"><span style="color: red; font-family: "Arial","sans-serif";">psl_to_bed_best_score.p</span><span style="color: red;">l</span></b><span style="color: red; text-indent: -0.25in;"> </span><span style="text-indent: -0.25in;">script written by</span><span style="text-indent: -0.25in;"> </span><span style="text-indent: -0.25in;">Dave Tang </span><span style="text-indent: -0.25in;">(</span><a href="https://gist.github.com/davetang/7314846" style="text-indent: -0.25in;">https://gist.github.com/davetang/7314846</a><span style="text-indent: -0.25in;">)
</span><span style="text-indent: -0.25in;"> </span><span style="text-indent: -0.25in;">to convert the output of BLAT (in .psl
format) into a .bed file, choosing only the best match for each query. Without
this sorting and conversion, the BLAT results as a PSL file look like garbage
in a genome browser.</span></li>
</ol>
<br />
<div class="MsoNormal">
<span style="text-indent: -0.25in;"><br /></span></div>
<div class="MsoListParagraphCxSpFirst" style="mso-list: l0 level1 lfo1; text-indent: -.25in;">
<!--[if !supportLists]--><o:p></o:p></div>
<div class="MsoListParagraphCxSpMiddle" style="mso-list: l0 level1 lfo1; text-indent: -.25in;">
<o:p></o:p></div>
<div class="MsoListParagraphCxSpMiddle" style="mso-list: l0 level1 lfo1; text-indent: -.25in;">
<o:p></o:p></div>
<div class="MsoListParagraphCxSpMiddle" style="mso-list: l0 level1 lfo1; text-indent: -.25in;">
<o:p></o:p></div>
<div class="MsoListParagraphCxSpMiddle" style="mso-list: l0 level1 lfo1; text-indent: -.25in;">
<o:p></o:p></div>
<div class="MsoListParagraphCxSpMiddle" style="mso-list: l0 level1 lfo1; text-indent: -.25in;">
<o:p></o:p></div>
<div class="MsoListParagraphCxSpLast" style="mso-list: l0 level1 lfo1; text-indent: -.25in;">
<o:p></o:p></div>
<div class="MsoNormal">
OK, now assemble all 5 data sets into one nice visualization
using IGV or GBrowse. We have a genome track (the DNA contigs from SOAPdenovo
in FASTA format), an RNA track (the RNA-seq reads aligned to the draft genome
in BAM format), a gene prediction track (GeneMark GTF file), the Cufflinks
transcripts (transcripts.gtf), and the RNA assemblies (from SOAPdenovo-Trans)
as a BED file. For some genes, all of
the data agree quite nicely. For other
genes, it's guess your best. </div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Here are two IGV screenshots of examples from the same genome
contig. The first is a nice gene with plenty of RNA where
all 3 annotation methods agree. <o:p></o:p></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcEbeOqXgSOP7Aho4-uQWXLGgL7w9y5DmWczdhYWAvuTRdZSYnvUz1D9wTvjXKYpHiUa4rOt91Zl7jNe4P81HWe_pQvQKR6PMWODMidw-JUsG3zD2i6i42r1OMC1aMsL2xyH9JR-L6zWfx/s1600/otopetrin.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcEbeOqXgSOP7Aho4-uQWXLGgL7w9y5DmWczdhYWAvuTRdZSYnvUz1D9wTvjXKYpHiUa4rOt91Zl7jNe4P81HWe_pQvQKR6PMWODMidw-JUsG3zD2i6i42r1OMC1aMsL2xyH9JR-L6zWfx/s1600/otopetrin.tiff" height="248" width="640" /></a></div>
<div class="MsoNormal">
<br /></div>
<br /><div class="MsoNormal">
The second is a messy region where no gene model makes much
sense, none of the methods agree at all, but there seems to be enough RNA (and
spliced alignments!) to suggest real transcription is happening. Time to add some reference data by homology modeling (in my next post on annotation). <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghliKavQG0FwX8-PnfS1fiS33ZoyWiXubXqG90rc3ZtwRJXaIOTu7sZk3qwPwRbKop4YXVdhyphenhyphenjMG9rjfqU7arahP81afZ3Ad7tpzkkCaeOCIlZ1YfG4CgQMMXWi_4z0nqpvl9455eLsCmI/s1600/messy-scaf12.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghliKavQG0FwX8-PnfS1fiS33ZoyWiXubXqG90rc3ZtwRJXaIOTu7sZk3qwPwRbKop4YXVdhyphenhyphenjMG9rjfqU7arahP81afZ3Ad7tpzkkCaeOCIlZ1YfG4CgQMMXWi_4z0nqpvl9455eLsCmI/s1600/messy-scaf12.tiff" height="172" width="640" /></a></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
</div>
Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com2tag:blogger.com,1999:blog-4457216402399127579.post-70045174262533976582013-11-16T17:56:00.000-05:002013-11-16T17:56:11.424-05:00Dr. Evan Eichler speaks about genomic structural variation
<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:Zoom>0</w:Zoom>
<w:TrackMoves>false</w:TrackMoves>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:DrawingGridHorizontalSpacing>18 pt</w:DrawingGridHorizontalSpacing>
<w:DrawingGridVerticalSpacing>18 pt</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:DontGrowAutofit/>
<w:DontAutofitConstrainedTables/>
<w:DontVertAlignInTxbx/>
</w:Compatibility>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="276">
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-fareast-font-family:"Times New Roman";
mso-fareast-theme-font:minor-fareast;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->
<!--StartFragment-->
<br />
<div class="MsoNormal">
<span style="mso-ansi-language: DE;">Yesterday I attended an
excellent symposium on genomic structural variation organized by the <a href="http://www.eventbrite.com/e/structural-variant-detection-tickets-7887147671?aff=eorg">Simons
Foundation</a>. The unifying theme from all of the speakers was the use of
Pacific Biosicences long read technology to resolve large-scale duplicated
sequences in the human genome. These long PacBio reads (5-10 kb) can be
assembled across genetic regions with complex patterns of repeat structures,
segmental duplications, inversions and deletions. <o:p></o:p></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="mso-ansi-language: DE;">For me, The highlight of
the afternoon was a talk by <a href="http://eichlerlab.gs.washington.edu/evan.html">Evan Eichler</a> from the
University of Washington.<span style="mso-spacerun: yes;"> </span>Dr.
Eichler presented both detailed sequencing data from specific loci and a grand
overview of structural variation that synthsizes copy number variation,
multi-gene families, the biology of autism and human evolution. His first point
was that the reference genome is missing substantial sections of duplicated
DNA, which has significant variation from person to person. Assembly software
will tend to collapse multiple, nearly identical paralogus gene copies into one
locus. Dr Eichler’s group has constructed more accurate sequences for regions
with these complex patterns of segmental duplication using long PacBio reads.
He has identified paralogous copies of genes, which actually exist as<span style="mso-spacerun: yes;"> </span>multi-gene families, and then created
specific tags to track the copy number of various gene isoforms in different
human genomes (such as from the 1000 genomes project).<span style="mso-spacerun: yes;"> </span>For example the <a href="http://www.nature.com/nmeth/journal/v10/n9/extref/nmeth.2572-S1.pdf">SRGAP2
locus has 4 isoforms</a>, each of which may be repeated several times in the
genomes of some people. <o:p></o:p></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="mso-ansi-language: DE;">Second, he explained that
these <a href="http://www.ncbi.nlm.nih.gov/pubmed/23375656?ordinalpos=1&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum">regions
of frequent copy number variation are often the site of deletions in the genomes
of people with autism</a>.<span style="mso-spacerun: yes;"> </span>These
deletions and duplications may be quite large and typically include dozens of other
genes besides the family of paralogs. In fact, the genome has hotspots of CNVs
that are flanked by high-identity duplicated regions. In addition, some people
may have additional duplications at hotspots, which create a predisposition for
deletion or expansion events in their progeny. <o:p></o:p></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="mso-ansi-language: DE;">Why do these deletions
and duplications cause autism? Dr. Eicher suggested that brain development is a
process that involves many genes, and it is particularly sensitive to gene
dosage. <o:p></o:p></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="mso-ansi-language: DE;">Dr. Eichler proposed a
link to human evolution that is quite tantalizing. Many of the families of
duplicated genes at the CNV hotspots are involved in brain development. These
same genes are not duplicated in apes. A process of gene duplication and
sequence variation allows for positive selection for new brain development
phenotypes.<span style="mso-spacerun: yes;"> </span>So the gene
duplication process which created expanded and more complex human brains may
also make us susceptible to neurologially damaging CNV mutations. <o:p></o:p></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="mso-ansi-language: DE;"><a href="http://www.ncbi.nlm.nih.gov/pubmed/23892896?ordinalpos=1&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum">http://www.ncbi.nlm.nih.gov/pubmed/23892896?ordinalpos=1&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum</a><o:p></o:p></span></div>
<!--EndFragment-->Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com2tag:blogger.com,1999:blog-4457216402399127579.post-50959348267287281352013-10-26T09:31:00.000-04:002013-10-26T09:31:10.141-04:00A bit more on genome annotationOne bit of followup on the Pig genome annotation story. We usually visualize RNA-seq results in the IGV browser. It allows direct inspection of read alignments to your favorite genes and can also be helpful to spot sequence variations and splicing issues. However, IGV has a set of pre-loaded default genomes that also seem to be derived from RefSeq. So once again, working with data from the pig, There was no annotation for most of our genes of interest. This is fairly annoying since it means that the only way to look at the annotation of a gene is to first look up the gene in UCSC and then copy the exact chromosome coordinates to IGV, including intron-exon borders.<br />
<br />
It is possible to fix this by downloading to the local computer the ENSEMBL gene annotations from UCSC Table Browser as a BED file (not too large), and then loading the BED file into IGV as another data track. This works nicely in terms of showing the genes and exons, but the gene labels still carry the ugly ENSEMBL names. Once again, the ensemblToGeneName track comes in handy, providing a table with the ENSEMBL name and the Official gene symbol for about 20,000 genes. We were able to add the gene symbol to the BED file, but this has to be done carefully (in Perl or Awk) since making file edits in Excel seems to break the BED file (at least for me). Loading the edited BED file into IGV, I was then able to jump to genes by name and get screen shots of interesting regions that included a gene structure track with nice gene names.Anonymoushttp://www.blogger.com/profile/14602560263535951430noreply@blogger.com2