Next-Gen Sequencing: October 2008

Oct 28, 2008

Gene-Boosted Assembly

Steven Salzberg describes a method for de novo assembly of a bacterial genome (Pseudomonas aeruginosa strain PAb1 = 6.2 MB) from a set of 33 bp Solexa fragments, using two closely related strains as reference sequences, and "boosting" assembly using predicted protein coding regions.

PLOS Computational Biology 4(9), Sept 26, 2008

Salzberg SL, Sommer DD, Puiu D, Lee VT (2008) Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads. PLoS Comput Biol 4(9): e1000186. doi:10.1371/journal.pcbi.1000186

The AMOS assembler used in this project employs several different software modules and a considerable amount of hands-on effort.

AMOScmp is a comparative alignment tool - it aligns short reads to a similar reference genome, and then builds contigs. This avoids the challenge of all-vs-all assembly for de novo genome sequencing projects.

Minimus is a highly stringent assembler that uses Smith-Waterman alignments to identify overlaps between reads.

Contigs were then scanned for protein coding sequences using a combination of Glimmer and BLAST. The ABBA program uses protein coding information - especially at the ends of contings and singletons to close gaps.

Velvet was also used to independently assemble all the reads into contigs, them MUMMer was used to combine contigs and fill gaps.

==================

This method is not going to work for every de novo sequencing problem, but we are going to try something similar for some new Plasmodium and Trichomonas species.

All software from the Salzberg lab at the Univ. of Maryland is freely available here:

http://cbcb.umd.edu/software/

and a page describing the Short Read Assembly methods here:

http://www.cbcb.umd.edu/research/SR-assembly.shtml

Oct 20, 2008

Public Chip-Seq Data

Here are some Chip-Seq data sets that have been published and are out there in the public domain.

Broad Institute

NHLBI

Jothi et al, - Site Identification from Short Sequence Reads

Barski et al - High-Resolution Profiling of Histone Methylations

Valouev et al, Sidow lab @ Stanford,

sample data to validate QuEST software

Robertson et al, 2007, Nature Methods 4(8) 651-7.

Eland processed sequence reads and FindPeaks output for Stat1 and FoxA2 transcription factors

NCBI GEO

NCBI Short Read Archive

File Formats

What is it with bioinformatics people and file formats?!

Why is it so bloody hard to produce and agree on a single standard to represent sequence data (with quality scores) and a standard for sequence reads aligned on a reference genome? With so many formats, we are all spending exponential amounts of time writing converters between all possible combinations.

Here are some of the file formats that I've dealt with in the past couple of weeks:

SEQUENCE FORMATS

FASTQ

Sequence plus Phred quality score encoded as single ascii bytes

@NCYC361-11a03.q1k bases 1 to 1576

GCGTGCCCGAAAAAATGCTTTTGGAGCCGCGCGTGAAAT

+NCYC361-11a03.q1k bases 1 to 1576

!)))))****(((***%%((((*(((+,**(((+**+,-

Solexa/Illumina FASTQ like thing...

s_*_sequence.txt

@HWI-EAS305_3-30gf5aaxx:8:1:415:1852
GTTAGATTTTGTGTAACTTGCATGTAATGTTAAAA
+HWI-EAS305_3-30gf5aaxx:8:1:415:1852
YYYYYYYYYYYYVYYYYYYVYYYYYYYYVYVVTUU
@HWI-EAS305_3-30gf5aaxx:8:1:187:1286
GTTACACTGAAAAACAAATTCGTTGGAAACGGGAT
+HWI-EAS305_3-30gf5aaxx:8:1:187:1286
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYTVVVV
@HWI-EAS305_3-30gf5aaxx:8:1:202:440
GTGAAAAATGAGAAATGCACACTGAAGGACCTGGA
+HWI-EAS305_3-30gf5aaxx:8:1:202:440
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYVVUVV

s_*_eland_extended.txt

Solexa output format from Eland extended

>HWI-EAS305_3-30gf5aaxx:8:1:63:487      GGAGGTAGAGGTATATGGCAAGAAAACTGAAAATC     NM      -
>HWI-EAS305_3-30gf5aaxx:8:1:415:1852    GTTAGATTTTGTGTAACTTGCATGTAATGTTAAAA     3:1:0   chr14.fa:35121238F35,35121282F35,35121326F32T1T,351
21354F4T30
>HWI-EAS305_3-30gf5aaxx:8:1:187:1286    GTTACACTGAAAAACAAATTCGTTGGAAACGGGAT     0:4:5   chr6.fa:103599157R16C17A,chr2.fa:98502709R16C18,985
02829R6A9C18,98505080F4AC29,98505200F1A14C18,98505320F16C18,98506416R16C13C2CA,98506537R16C18,chrX.fa:139917587R16C2A13CA
>HWI-EAS305_3-30gf5aaxx:8:1:202:440     GTGAAAAATGAGAAATGCACACTGAAGGACCTGGA     3:87:58 chr2.fa:98503100F33T1,98506780F35,98507265F35
>HWI-EAS305_3-30gf5aaxx:8:1:359:505     TATTCAATTTACATACTCTGGCTTTGCCAACATTT     1:0:0   chr9.fa:31339651R35
>HWI-EAS305_3-30gf5aaxx:8:1:1290:135    TTGATTGTATAGTAGGGGTGAAATGGAATTTTATC     1:0:1   chrM.fa:14790R35
>HWI-EAS305_3-30gf5aaxx:8:1:627:596     GTGATTTTGAAAGTTGTAGATTGTGTGTTTGTGAT     NM      -
>HWI-EAS305_3-30gf5aaxx:8:1:379:298     GACGTGAAATATGGCGAGGAAAACTGAAAAAGGTG     31:56:28        -

s_*_eland_multi.txt

Solexa output format from Eland extended

>HWI-EAS305_3-30gf5aaxx:8:1:414:208     GTAAACTATCAATAAAATAATTTGTTACTCTGTAT     20:7:0
>HWI-EAS305_3-30gf5aaxx:8:1:59:857      TAAATTGTCCACCTTTTTCAGTTTTCCTCGCTATA     0:0:35
>HWI-EAS305_3-30gf5aaxx:8:1:1414:307    GAGAAAACTGTAAATAAAGGTAAATGAGAAAAAAA     NM
>HWI-EAS305_3-30gf5aaxx:8:1:330:1758    GGTAAAGTCCACTAAGGAAAAGAAAGAAACAATGT     1:0:0   chr7.fa:97764095R0
>HWI-EAS305_3-30gf5aaxx:8:1:576:127     GAAGTCAATCTTATGAGTTATTAGGATGGCTACTC     0:7:255 chr7.fa:111867683F1,chr12.fa:51788781R1,115833262F1
,chr6.fa:21403822R1,89734675R1,89780759R1,chrX.fa:15525553R1
>HWI-EAS305_3-30gf5aaxx:8:1:88:1045     GTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCC     66:110:72
>HWI-EAS305_3-30gf5aaxx:8:1:939:613     TACTTTACTTTCTAGGGAATGTTCACTTCTAAGTG     1:0:0   chr1.fa:150051845R0

s_*_sorted.txt

filtered eland_extended alignments w/ quality scores and genome positions

HWI-EAS305      3-30gf5aaxx     8       66      580     1584                    AGTATGGGTATCGGTTGGTGCAGAGAACTACTGCA     YYYYYYYYYYYYYYYYYYY
YYYYYVYYYYYVVUVU        chr10.fa                3001045 F       35      11
HWI-EAS305      3-30gf5aaxx     8       100     534     1062                    ATTTTCAGGTTGGAGTGACTCGCTAAAACAGCCAA     YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYTVVVV        chr10.fa                3002892 R       35      29
HWI-EAS305      3-30gf5aaxx     8       59      199     495                     CCACATGCTGTGGCAAAGCCCTTCTGAGCGGGGCG     YYYYTYYYYYYYYYYYRYY
YYYYYYYYYYYTVUVV        chr10.fa                3008958 F       34A     20
HWI-EAS305      3-30gf5aaxx     8       76      779     1406                    AGATGTACAAATGCTCCTCAGATGTTTGTGTCATA     YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYVVVVV        chr10.fa                3009290 F       35      3
HWI-EAS305      3-30gf5aaxx     8       83      547     1480                    ATCCAAACAGTTACACAAAGTTTTGAGAACATTAT     YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYVVVVV

GENOME ALIGNMENT FORMATS

SGA ('Simplified' Genome Annotation)

GFF (General Feature Format)

UCSC Genome Browser

Sanger

EXAMPLE:

track name=regulatory description="TeleGene(tm) Regulatory Regions"
chr22 TeleGene enhancer 1000000 1001000 500 + . touch1
chr22 TeleGene promoter 1010000 1010100 900 + . touch1
chr22 TeleGene promoter 1020000 1020000 800 - . touch2

FPS (Functional Position Set)

Native format for Eukaryotic Promoter Database

EXAMPLE:

FP   Pv snRNA U1         :+S  EM:J03563.1          1+       352; 17001.098
FP   Ath snRNA U2.5      :+S  EM:AL353994.1        1-     73709; 24016.116
FP   Ath snRNA U5        :+S  EM:X13012.1          1+       678; 23040.
FP   Ta histone H3       :+S  EM:X00937.1          1+       186; 07001.

WIG (Wiggle)

UCSC Genome Browser track format

EXAMPLE

track type=wiggle_0 name="Bed Format" description="BED format" \
visibility=full color=200,100,0 altColor=0,100,200 priority=20
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50

BED

UCSC Genome Browser

Example:
Here's an example of an annotation track that uses a complete BED definition:

track name=pairedReads description="Clone Paired Reads" useScore=1
chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

ALN

Alignment format for CisGenome


chr1[tab]359077[tab]F
chr1[tab]376890[tab]R

….

column1 = chromosome where the read is aligned;
column2 = coordinate where the read is aligned;
column3 = ‘F’ or ‘+’: if the read is aligned to the forward strand of the genome assembly;
         ‘R’ or ‘-’: if the read is aligned to the reverse complement strand of the genome.

Oct 19, 2008

Service Providers

Next-Gen Sequencing as a service (you don't need half a million $$ to play this game)

Agencourt

454, ABI SOLID, many related services, located in Beverly MA (USA)

SeqWright

434, ABI SOLID, located in Houston, Texas

Cofactor Genomics

Illumina GA II, can buy a single lane w/ multiplex primers, located in St. Louis MO
Article in "In Sequence"

Fasteris

Illumina Genome Analyzer, located in Geneva Switzerland

GATC Biotech

Illumina, 454, and ABI, located in Germany

GeneService

Illumina Genome Analyzer, custom bioinformatics

Applications

This is where the real action is. New applications for Chip-Seq technology are developing with the unlimited creativity of the worldwide scientific community.

Genome Resequencing

The 1000 Genomes Project

NHGRI

Personal Genome Project

SNP discovery/detection

Chip-Seq (transcription factor studies, epigenetics)

RNA-Seq (transcriptome, digital gene expresion)

MetaGenomics

Human Microbiome Project

The New Science of Metagenomics (National Academies Press - free PDF)


The New Science of Metagenomics: Revealing the Secrets of ...

Read this FREE online! Full Book \| PDF Summary \| PDF Report Brief \| Podcast

Mapping translocation breakpoints

Chen et al, Genome Research 18:1143-1149, 2008.

Commercial Software

A large number of vendors are developing or adapting products to server the Next-Gen Sequencing market. I will try to collect as much info as possible here with very brief description of functions. We can add comments or review pages for each.

Geospiza FinchLab Next-Gen

A LIMS and analysis system for complete lab management workflow

GenoLogics Geneus

A LIMS system that tracks sample submission, automates analysis pipeline commands, and keeps track of the resulting data.

CLC Genomics Workbench

both desktop and server based NG Genomics tools, assembly, chip-seq, transcriptome,

Genomatix

solutions for transcript mapping (digital gene expression) and Chip-seq with an emphasis on mapping sequence tags to annotated genome regions (TF binding sites).

GenomeQuest

A full service bioinformatics pipeline and consultant as an online service - if you have a next-gen machine and no bioinformatics support!!

DNA*

SeqMan NGen

desktop solution for assembly of NG data.

NextGene

from Softgenetics

de novo assembly, SNP detection, and transcriptome/digital gene expression

Conferences

Upcoming conferences (and links to content from recent confs. if I can find any)

Cambridge Healthtech Inst: Next Generation Sequencing

meets every 6 months, next one in March '09, San Diego CA

Metagenomics 2008

Nov 3-7 Caltech, San Diego, CA

MagazineS & Articles

These are news articles, editorials, etc about Nex-Gen Sequencing.

Published stuff that are not official refereed journal articles.

In Sequence

"The Inside Road on Genome Sequencing"

a GenomeWeb publication

Bio1NF0RM

Short Read Sequence Software

Nature: Big Data special

http://www.nature.com/news/specials/bigdata

Nature Genetics

Focus on Next-Generation Sequencing

HHMI Bulletin

Next Generation Sequencing

Venture Beat

Interview with Bill Ericson

BioIT World

ABI and the $60K Human Genome

Chip-Seq

I have been working quite hard on Chip-Seq applications for Illumina (Solexa) data.

These boil down to four basic functions:

peak calling - taking sequence reads aligned to a reference genome and counting the number of hits per genome interval, subtracting background or a control lane, smoothing, cutting off shoulders, splitting double peaks, and coming up with some statistic that suggests that the peaks are real vs. false positives
annotation - finding the location of peaks on the genome as compared to known features, especially the transcription start sites of known genes
visualization - looking at peaks in one of the genome browsers
motif detection - finding patterns of common bases within the peaks, comparing these patterns with known transcription factor binding sites

We have evaluated quite a few different pieces of software that supply various of these functions:

CisGenome

"An integrated software system for analyzing ChIP-chip and ChIP-seq data"

Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH.

Nat Biotechnol. 2008 Nov;26(11):1293-300.

FindPeaks

BC Cancer Agency: FindPeaks

Vancouver Short Read Analysis Package

This is a good peak finder, easy to use, with a reasonable statistical model (based on comparison of your genome mapped data vs. a MonteCarlo random distribution of tags)

SISSRS (Site Identification from Short Sequence Reads)

Makes use of +/- strand information in Chip-Seq reads to precisely identify transcription factor binding sites within a few tens of base pairs.

Jothi R, Cuddapah S, Barski A, Cui K, Zhao K. Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 2008 Sep;36(16):5221-31.

MACS: Model-based Analysis for Chip-Seq

Genome Biology article

written by Yong Zhang and Tao Liu from the lab of Shirley Liu at Harvard

QuEST

C++ program (requires C++ compiler) - author Anton Valouev in Sidow lab at Stanford

Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data, Valouev, et al. Nature Methods 5, 829 - 834 (2008)

Wold Lab software suite (@ Caltech)

ChipSeq Peak Finder

GeneTrack

Peak finder and visualization via UCSB Genome Browser

MIT Integrative Genome Viewer

MIT IGV

note the alignment processor that creates tag counts from Next-Gen aligned reads (such as Eland output files)

Chip-Seq Analysis Server

Web-based peak calling at the Swiss Institute of Bioinformatics

ChIPDiff - identification of differential histone modification sites by comparison of two ChIP-Seq libraries prepared from different tissues (various cell types, stages, or environmental responses). Uses a Hidden Markov Model to identify differences in ChIP tag counts.

http://bioinformatics.oxfordjournals.org/cgi/content/full/24/20/2344

Available from Genome Institue of Singapore

http://cmb.gis.a-star.edu.sg/ChIPSeq/tools.htm

Software

Software for basic next-gen sequencing operations.

Each of the commercial vendors has their own proprietary software, so we will emphasize the open source.

A great page about Next-Gen software on SeqAnswers

Image Processing

Basecalling

PyroBayes

alternative base calling for 454 sequencer with improved quality scores. Developed by the Marth lab at Boston College

Swift

Open source primary data analysis for next-gen DNA sequencers

Alignment to a Reference Sequence

Maq Mapping and Assembly with Quality

http://maq.sourceforge.net

MOSAIK

Gapped alignments to reference genome, another from the Marth lab at Boston College

Novoalign from Novocraft in Kuala Lumpur, Malaysia

SOAP — Short Oligonucleotide Alignment Program

GNU Public License

from the Bioinformatics Dept of the Beijing Genomics Institute

Ruiqiang Li, et. al. SOAP: short oligonucleotide alignment program. Bioinformatics. 2008 24: 713-714

RMAP/RMAPQ

Maps sequence reads to genomic location, variable number of mismatches and overall quality cutoff score. By Andrew D. Smith and Zhenyu Xuan in the Zhang lab at Cold Spring Harbor.

ZOOM (Zillions Of Oligos Mapped)

Product of the Michael Zhang lab at Cold Spring Harbor

Bioinformatics paper

"Zoom is freely available to non-commercial users at http://bioinfor.com/zoom"

de novo Assembly

Edena (Exact de novo Assembler)

http://www.genomic.ch/edena.php

De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer.

D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel.

Genome Research. 18:802-809, 2008.

Velvet (GPL, 64 bit Linux)

http://www.ebi.ac.uk/~zerbino/velvet

Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

D.R. Zerbino and E. Birney. Genome Research 18:821-829.

SHARCGS (SHort read Assembler based on Robust Contig extension for Genome Sequencing)

http://sharcgs.molgen.mpg.de/index.shtml

Dohm JC, Lottaz C, Borodina T, Himmelbauer H. SHARCGS, a fast and highly accurate short-read assembly

algorithm for de novo genomic sequencing.

Genome Res. 2007 17: 1697-1706

ALLPATHS

ALLPATHS: De novo assembly of whole-genome shotgun microreads
Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A. Shlyakhter, Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum, and David B. Jaffe
Genome Res. 2008 18: 810-820.

SSAKE (GNU Public License, written in Perl)

http://www.bsgc.ca/platform/bioinfo/software/ssake

The Short Sequence Assembly by K-mer search and 3' read Extension (SSAKE) is a genomics application for aggressively assembling millions of short nucleotide sequences by progressively searching for perfect 3'-most k-mers using a DNA prefix tree.

René L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. 2007.
Assembling millions of short DNA sequences using SSAKE.

Bioinformatics. 23:500-501.

SNP Discovery

PolyBayes

Another from the Marth Lab

Genome Viewers

EagleView

Another tool from the Marth Lab

Integrated Genome Browser

from Affymetrix & GenoViz

some IGB tips from Hunstman Cancer Inst. @ Univ of Utah

MIT Integrative Genomics Viewer

LIMS

Solexa Tools

Nex-Gen Blogs

These are some of the good folks posting blogs full of useful Next-Gen sequencing information:

INDEPENDENT BLOGS:

SEQanswers

"The next-generation sequencing community"

http://seqanswers.com

Solexa Google Group

fejes.ca

Anthony Fejes is a gradstudent in bioinformatics at UBC in Vancouver.

He works on FindPeaks and the Vancouver Short Read Analysis Package.

MassGenomics

"Medical Genomics in the post-genome era"

http://massgenomics.wordpress.com

Genetic Future: Daniel MacArthur

http://scienceblogs.com/geneticfuture

The genetic and evolutionary basis of human variation,

and the companies trying to sell you information about your genome.

Systems Biology & Bioinformatics

http://lurena.vox.com

The Tree of Life

Jonathan Eisen writes about a lot more than Next-Gen Sequencing, but his blog is a must-read for everyone in bioinformatics, genomics, and evolutionary biology

VENDOR BLOGS:

CLC Bio NG Seq

(developer of CLC Genomics Workbench)

Geospiza

FinchTalk

DISCUSSION LIST:

BioConductor Short Read sequencing mail list

Next Gen Sequencing Vendors

A list of the vendors of next-generation, high-throughput DNA sequencing machines.

No editorializing here, there will be comment pages for each of these.

The Big Three:

454 (Roche)

www.454.com

Illumina (Solexa)

www.illumina.com

ABI SOLID

www.appliedbiosystems.com

The New New Thing:

Helicos

www.helicosbio.com

Pacific Biosciences

www.pacificbiosciences.com

[plans to ship first unit in 2010]

An analysis of PacBio by David Hamilton, Feb 10, 2008

The Polonator (Dover Systems aka George Church & Co.)

www.polonator.org

Intelligent Biosystems

www.intelligentbiosystems.com

VisiGen Biotechnologies

www.visigenbio.com

Greetings ultra-sequencers and bioinformatics geeks.

The community of next-generation sequencing and its supporting bioinformatics is developing very rapidly, but also fragmenting into many factions. It has become very difficult for anyone to keep track of what is going on in the many different technologies, software development projects, and the various commentators.

I am going to try to make this blog a catchall where people can keep up on news from the technology vendors, survey progress in all sequencing related software, and keep up to date on relevant journal publications.

I can't do it all myself, so everyone is welcome to send in notes on relevant material as soon as they notice it. Contributors may be invited to become co-authors.

—Cheers,

Stuart Brown

NYU Bioinformatics Core

Next-Gen Sequencing