Next-Gen Sequencing: 2015

Oct 23, 2015

Masters in Biomedical Informatics at NYU School of Medicine

We are starting a new Masters program in Biomedical Informatics at NYU School of Medicine in 2016. We currently have about a dozen PhD students, but the Masters program is intended to serve a wider group with more diverse backgrounds.

Sep 4, 2015

Research Adventure with ENCODE Data

At NYU, first-year PhD students in the Sackler Institute start their first semester with a week-long full-time "Research Adventure" workshop. I was asked (at short notice) to mentor a group of students for something in Bionformatics. Since I had recently attended the 2015 ENCODE Users Meeting, I decided to make the workshop all about working with ENCODE data.

I included tutorials about access to ENCODE data, an Intro to Linux for complete computing novices (quite a few of our students), Genomic Intervals in the UCSC Genome Browser, use of BEDTools to compare genomic intervals for various factors, and an a tutorial in R for data display. Later in the week we looked at gene expression with RNA-seq using TopHat and Cufflinks. The general plan for the 5-day workshop (for 6 students) was as follows:

Monday

9-11:00 am Lecture (2 hr): Introduction to Gene Regulation and Epigenetics

11-12:00 am Lecture (1 hr): Use of the HighPerformance Computing Cluster

12:00-2 pm Working Lunch with HPC System Manager (2 hr): Set up HPC account for each student, practice Linux commands, move files from laptop to HPC account

2-4 pm Exercise 1: Tutorials for Accessing ENCODE data through the ENCODEPortal, UCSC Genome Browser and ENSEMBL Browser

Tuesday

9-11:00 am Lecture & Demo: (2 hr): The UCSC Genome Browser, BED file format, and BEDTools software

11-12:00 am Exercise 2: BEDTools Tutorial

12-1:00 pm Lunch

1-3:00 pm Exercise 3: Use of ENCODE Data and BEDTools to compute the Intersection of DNAse hypersensitive sites with promoters of all RefSeq genes

Wednesday

9-10:30 am Lecture: Computing Gene Expression with RNA-Seq (1.5 hr)

10:30-12 am Exercise 4: Align ENCODE RNA-seq data to hg19 reference genome with TopHat

12-1:00 pm Lunch

1-4 pm Continue work on Exercise 4

Thursday

9-10:00 am Lecture (1 hr): Intro to data visualization with R

10-12:00 am Exercise 5: TryRCodeschool tutorial.

12-1:00 pm Lunch

1-2:00 pm Lecture (1 hr): Differential Gene Expression with Cufflinks

2-4:00 pm Planning for Research Project – choose ENCODE data for transcription factors, gene expression, and epigenetic markers. Literature search.

Friday

9-12:00 am Work on Research Project

12-1:00 pm Lunch

1-4:00 pm Work on Data analysis and prepare presentation

I had six students in our Research team: Elaine Fisher, Reuben Moncada, Shushan Sargsian, Beny Shapiro, Jong Shin, and Bo Xia, I have pasted images from their final presentation below (can't upload PowerPoint or PDF in this Blogger).

My overall impression of the week was that the students learned a huge amount of computing skills, but it was a bit bumpy when we got to the RNA-seq methods. They had really good success comparing various Transcription Factor binding sites to known genes (promoter region, TSS, 3'UTR, exons, introns, 5'UTR), finding interactions between TF's by finding overlapping or nearby binding sites, We also found nice overlaps between ChIP-seq TF binding sites and DNAse sensitive sites, histone modification sites, and computationally predicted TF binding sites. Also, the students did a nice job of measuring overlapping vs. nearby binding sites (bedtools slop), and measuring the significance of intersections using bedtools shuffle to create a statistical model of random intersections as a control.

FASTQ data download and alignment is slow and error prone (we had a lot of trouble making SGE scripts that would run correctly on our compute cluster). I should have shown TopHat just as a demo and used a small local FASTQ data file as an example rather than download and re-align ENCODE data. Using Cufflinks/Cuffdiff to compare gene expression from different cell lines was feasible with real ENCODE BAM files, but we had to learn this earlier in the week and spend more time to create SGE scripts that would run nicely with multithreading (to complete in a reasonable amount of time).

If I did this sort of tutorial again, I would figure out a way for the students to measure differential gene expression between cell lines from pre-computed ENCODE RNA-seq quantified data (wig files).

Jul 31, 2015

Coffee Berry Borer genome published

Our paper on the de novo genome sequence and annotation of the Coffee Berry Borer (a beetle) is published today in Nature Scientific Reports. This was a really fun project, where I was pushed to do a lot more in-depth study of insect biology (such as antimicrobial and cytochrome P450 proteins). We also discovered that this beetle has captured a bunch of bacterial proteins into its genome (horizontal gene transfer) - which seems odd, but was actually previously reported for this insect and many others. Interestingly, most of these captured bacterial proteins provide starch digesting enzymes, which support the beetle's lifestyle of living entirely inside of the coffee bean and eating nothing but coffee! We are of course hoping that these genes can be used as some sort of target for control of the pest, which causes something like a billion $$ of annual damage worldwide to our beloved coffee.

http://www.nature.com/srep/2015/150731/srep12525/full/srep12525.html

http://www.nature.com/srep/2015/150731/srep12525/pdf/srep12525.pdf

Jul 29, 2015

I am writing new lectures and organizing a lot of teaching material to teach 4 (!) classes this fall at two different universities (NYU and Fordham). I would like to keep the teaching materials in a nice easily accessible online location, and easily share with my students without a lot of hassle to sign them all up or whatever. I had a fairly good experience with Google Drive for a short course this Spring, so I'm trying it out now. Here is the master link to all of my 2015 teaching material:

https://drive.google.com/open?id=0BzalvBlHvt6LfldpaWxZQXVLcTZxUmpWZFdqSTBGeWl0MlJHeXBFQmhTTHBaX3JHNXowVDg

Stuff will appear, change, possibly disappear from this location as I keep sorting and rewriting, up to and during the classes. Most of the material is my own, some journal articles that I provide as readings to my students, and some shameless theft of good lectures, exercises, and tutorials from other folks smarter or better at explaining stuff than I am.

We are also planning to make Screencast type videos of most of the lectures, which get dumped on YouTube. I will try to find some sensible way of organizing them and sharing via this NGS blog.

Jul 16, 2015

CSHL Press has made the RNA-seq chapter of my Next-Gen Seq book available free from their website: RNA Sequencing with Next-Generation Sequencing.

http://www.cshlpress.org/pdf/sample/2015/nextgen2/NGS2Chap13.pdf

Cold Spring Harbor Laboratory Press banner image

Next-Generation DNA Sequencing Informatics, Second Edition banner image

May 28, 2015

New 'Next-Gen Seq 2' book is at the printer

The second edition of the Next-Generation Sequencing Informatics book (that I edit) is at the printer and available for pre-order at Cold Spring Harbor Press and Amazon. We think it will ship on June 30th, maybe a bit sooner.

[James Hadfield at CoreGenomics blog has posted a review: http://core-genomics.blogspot.co.uk/2015/05/book-review-next-generation-dna.html ]

We have added new chapters on the latest sequencing technology, QC, de novo transcript assembly, proteogenomics and lots of updates and expansion in areas such as RNA-seq and ChIP-seq. It has a beautiful cover and its not too expensive.

Here is the official publication blurb:

Next-generation DNA sequencing (NGS) technology has revolutionized biomedical research, making genome and RNA sequencing an affordable and frequently used tool for a wide variety of research applications including variant (mutation) discovery, gene expression, transcription factor analysis, metagenomics, and epigenetics. Bioinformatics methods to support DNA sequencing have become and remain a critical bottleneck for many researchers and organizations wishing to make use of NGS technology. Next-Generation DNA Sequencing Bioinformatics, Second edition, provides thorough, plain language introduction to the necessary informatics methods and tools for analyzing NGS data as did the first edition, and provides detailed descriptions of algorithms, strengths and weaknesses of specific tools, pitfalls and alternative methods. Four new chapters in this edition cover: experimental design, sample preparation, and quality assessment of NGS data; Public databases for DNA Sequencing data; De novo transcript assembly; proteogenomics; and emerging sequencing technologies. The remaining chapters from the first edition have been updated with the latest information. This book also provides extensive reference to best-practice bioinformatics methods for NGS applications and tutorials for common workflows. The second edition of Next-Generation DNA Sequencing Bioinformatics addresses the informatics needs of students, laboratory scientists, and computing specialists who wish to take advantage of the explosion of research opportunities offered by new DNA sequencing technologies.

and the Table of Contents:

1) Introduction to DNA Sequencing

Stuart M. Brown

2) Quality Control and Data Processing

Stuart M. Brown

3) History of Sequencing Informatics

Stuart M. Brown

4) Public Sequence Databases

Stuart M. Brown

5) Visualization of Next-Generation Sequencing Data

Philip Ross Smith, Kranti Konganti, and Stuart M. Brown

6) DNA Sequence Alignment

Efstratios Efstathiadis

7) Genome Assembly Using Generalized de Bruijn Digraphs

D. Frank Hsu

8) De Novo Assembly of Bacterial Genomes from Short Sequence Reads

Silvia Argimón and Stuart M. Brown

9) De Novo Transcriptome Assembly

Lisa Cohen, Steven Shen, and Efstratios Efstathiadis

10) Genome Annotation

Steven Shen and Stuart M. Brown

11) Using NGS to Detect Genome Sequence Variants

Jinhua Wang

12) ChIP-seq

Stuart M. Brown, Zuojian Tang, Christina Schweikert, and D. Frank Hsu

13) RNA-seq with Next-Generation Sequencing

Stuart M. Brown and Jeremy Goecks

14) Metagenomics

Guillermo I. Perez-Perez, Miroslav Blumenberg, and Alexander V. Alekseyenko

15) Proteogenomics

Kelly V. Ruggles and David Fenyö

16) DNA Sequencing Technologies and Applications

Gerald A. Higgins and Brian D. Athey

17) Cloud-based Next-Generation Sequencing Informatics

Konstantinos Krampis, Efstratios Efstathiadis, and Stuart M. Brown

Feb 9, 2015

Password hell

This is not a Bioinformatics post, just an amusing technology catch-22 that I encountered this morning. At NYU we have automatic mandatory password updates for our accounts with IT. This includes email, login to my Windows desktop computer, and wireless devices on the secure WiFi network in our building. Since I am lazy about these things, I did not heed the warnings and follow the instructions in the "Password Update" email from our IT Department. Instead, at home on Sunday night, I got a message when I tried to log in to my email account saying that I should update my password, and a helpful little box appears where it is possible to type old password and new password, hit submit and its all good.

I made a new password, and checked my mail, but after about 5 min, I got knocked off the network and can't log back in. It's late, so I figure to deal with it at the office in the morning. At my desk, I can't log into my computer (uses the same network "kerberos" password), and my phone complains that it can't get on the local wireless network. I try new password, old password, and eventually get the helpful message that my account has been locked by the IT Dept, and I must call the helpdesk. Its 9 AM on Monday and the helpdesk picks up right away. Help Guy asks if I have any wireless devices that may be using the old password. I look at the offending iPhone, and shut off WiFi. Helpdesk says: "I still see wireless activity hitting your account with an invalid password." Back to my desk, where my desktop Mac is using WiFi and getting unhappy messages from the network. Shut down WiFi. Helpdesk still sees activity on my account. Think, think?? Into the drawer where I have a laptop that we use for teaching and public seminars, it is asleep, but somehow still hitting the wireless network with my old password. Turn off WiFi on that one, and finally the helpful helpdesk guy can unlock my account. Then I can go back to each device and rejoin the network with the new password. I guess I'm not the first idiot this has happened to. Moral of the story??? Follow instructions very carefully or your helpful technology tools will gang up against you.

Happy Ice Storm Day from New York
-Stuart

Next-Gen Sequencing

Oct 23, 2015

Masters in Biomedical Informatics at NYU School of Medicine

Sep 4, 2015

Research Adventure with ENCODE Data

Jul 31, 2015

Coffee Berry Borer genome published

Jul 29, 2015

Jul 16, 2015

May 28, 2015

New 'Next-Gen Seq 2' book is at the printer

Feb 9, 2015

Password hell

Stuart Brown

Resources

Blog Archive

List of Blogs relevant to NG Seq

Popular Posts