Oct 28, 2008

Gene-Boosted Assembly

Steven Salzberg describes a method for de novo assembly of a bacterial genome (Pseudomonas aeruginosa strain PAb1 = 6.2 MB) from a set of 33 bp Solexa fragments, using two closely related strains as reference sequences, and "boosting" assembly using predicted protein coding regions.

Salzberg SL, Sommer DD, Puiu D, Lee VT (2008) Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads. PLoS Comput Biol 4(9): e1000186. doi:10.1371/journal.pcbi.1000186

The AMOS assembler used in this project employs several different software modules and a considerable amount of hands-on effort. 

AMOScmp is a comparative alignment tool - it aligns short reads to a similar reference genome, and then builds contigs. This avoids the challenge of all-vs-all assembly for de novo genome sequencing projects. 

Minimus is a highly stringent assembler that uses Smith-Waterman alignments to identify overlaps between reads.

Contigs were then scanned for protein coding sequences using a combination of Glimmer and BLAST. The ABBA program uses protein coding information - especially at the ends of contings and singletons to close gaps.

Velvet was also used to independently assemble all the reads into contigs, them MUMMer was used to combine contigs and fill gaps. 


This method is not going to work for every de novo sequencing problem, but we are going to try something similar for some new Plasmodium and Trichomonas species. 

All software from the Salzberg lab at the Univ. of Maryland is freely available here:

and a page describing the Short Read Assembly methods here:

Oct 20, 2008

Public Chip-Seq Data

Here are some Chip-Seq data sets that have been published and are out there in the public domain.


Valouev et al, Sidow lab @ Stanford, 

Robertson et al, 2007, Nature Methods  4(8) 651-7.
Eland processed sequence reads and FindPeaks output for Stat1 and FoxA2 transcription factors

File Formats

What is it with bioinformatics people and file formats?!

Why is it so bloody hard to produce and agree on a single standard to represent sequence data (with quality scores) and a standard for sequence reads aligned on a reference genome? With so many formats, we are all spending exponential amounts of time writing converters between all possible combinations. 

Here are some of the file formats that I've dealt with in the past couple of weeks:


Sequence plus Phred quality score encoded as single ascii bytes

@NCYC361-11a03.q1k bases 1 to 1576
+NCYC361-11a03.q1k bases 1 to 1576

Solexa/Illumina FASTQ like thing...

Solexa output format from Eland extended
>HWI-EAS305_3-30gf5aaxx:8:1:63:487      GGAGGTAGAGGTATATGGCAAGAAAACTGAAAATC     NM      -
>HWI-EAS305_3-30gf5aaxx:8:1:415:1852 GTTAGATTTTGTGTAACTTGCATGTAATGTTAAAA 3:1:0 chr14.fa:35121238F35,35121282F35,35121326F32T1T,351
>HWI-EAS305_3-30gf5aaxx:8:1:187:1286 GTTACACTGAAAAACAAATTCGTTGGAAACGGGAT 0:4:5 chr6.fa:103599157R16C17A,chr2.fa:98502709R16C18,985
>HWI-EAS305_3-30gf5aaxx:8:1:202:440 GTGAAAAATGAGAAATGCACACTGAAGGACCTGGA 3:87:58 chr2.fa:98503100F33T1,98506780F35,98507265F35
>HWI-EAS305_3-30gf5aaxx:8:1:359:505 TATTCAATTTACATACTCTGGCTTTGCCAACATTT 1:0:0 chr9.fa:31339651R35
>HWI-EAS305_3-30gf5aaxx:8:1:1290:135 TTGATTGTATAGTAGGGGTGAAATGGAATTTTATC 1:0:1 chrM.fa:14790R35
>HWI-EAS305_3-30gf5aaxx:8:1:379:298 GACGTGAAATATGGCGAGGAAAACTGAAAAAGGTG 31:56:28 -

Solexa output format from Eland extended
>HWI-EAS305_3-30gf5aaxx:8:1:414:208     GTAAACTATCAATAAAATAATTTGTTACTCTGTAT     20:7:0
>HWI-EAS305_3-30gf5aaxx:8:1:330:1758 GGTAAAGTCCACTAAGGAAAAGAAAGAAACAATGT 1:0:0 chr7.fa:97764095R0
>HWI-EAS305_3-30gf5aaxx:8:1:576:127 GAAGTCAATCTTATGAGTTATTAGGATGGCTACTC 0:7:255 chr7.fa:111867683F1,chr12.fa:51788781R1,115833262F1
>HWI-EAS305_3-30gf5aaxx:8:1:939:613 TACTTTACTTTCTAGGGAATGTTCACTTCTAAGTG 1:0:0 chr1.fa:150051845R0

filtered eland_extended alignments w/ quality  scores and genome positions
HWI-EAS305      3-30gf5aaxx     8       66      580     1584                    AGTATGGGTATCGGTTGGTGCAGAGAACTACTGCA     YYYYYYYYYYYYYYYYYYY
YYYYYVYYYYYVVUVU chr10.fa 3001045 F 35 11
YYYYYYYYYYYTVVVV chr10.fa 3002892 R 35 29
YYYYYYYYYYYTVUVV chr10.fa 3008958 F 34A 20
YYYYYYYYYYYVVVVV chr10.fa 3009290 F 35 3


SGA ('Simplified' Genome Annotation)

GFF  (General Feature Format)
track name=regulatory description="TeleGene(tm) Regulatory Regions"
chr22 TeleGene enhancer 1000000 1001000 500 + . touch1
chr22 TeleGene promoter 1010000 1010100 900 + . touch1
chr22 TeleGene promoter 1020000 1020000 800 - . touch2

FPS (Functional Position Set)
Native format for Eukaryotic Promoter Database

FP   Pv snRNA U1         :+S  EM:J03563.1          1+       352; 17001.098
FP Ath snRNA U2.5 :+S EM:AL353994.1 1- 73709; 24016.116
FP Ath snRNA U5 :+S EM:X13012.1 1+ 678; 23040.
FP Ta histone H3 :+S EM:X00937.1 1+ 186; 07001.

WIG (Wiggle)
UCSC Genome Browser track format

track type=wiggle_0 name="Bed Format" description="BED format" \
visibility=full color=200,100,0 altColor=0,100,200 priority=20
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50

UCSC Genome Browser
Here's an example of an annotation track that uses a complete BED definition:

track name=pairedReads description="Clone Paired Reads" useScore=1
chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

Alignment format for CisGenome



column1 = chromosome where the read is aligned;
column2 = coordinate where the read is aligned;
column3 = ‘F’ or ‘+’: if the read is aligned to the forward strand of the genome assembly;
‘R’ or ‘-’: if the read is aligned to the reverse complement strand of the genome.

Oct 19, 2008

Service Providers

Next-Gen Sequencing as a service (you don't need half a million $$ to play this game)

454, ABI SOLID, many related services, located in Beverly MA (USA)

434, ABI SOLID, located in Houston, TexasLink

Illumina GA II, can buy a single lane w/ multiplex primers, located in St. Louis MO
Article in "In Sequence"

Illumina Genome Analyzer, located in Geneva Switzerland

Illumina, 454, and ABI, located in Germany

Illumina Genome Analyzer, custom bioinformatics


This is where the real action is. New applications for Chip-Seq technology are developing with the unlimited creativity of the worldwide scientific community.

Genome Resequencing

SNP discovery/detection

Chip-Seq (transcription factor studies, epigenetics)

RNA-Seq (transcriptome, digital gene expresion)

The New Science of Metagenomics (National Academies Press - free PDF)

Read this FREE online!
Full Book | PDF Summary | PDF Report Brief | Podcast

Mapping translocation breakpoints

Commercial Software

A large number of vendors are developing or adapting products to server the Next-Gen Sequencing market.  I will try to collect as much info as possible here with very brief description of functions. We can add comments or review pages for each. 

A LIMS and analysis system for complete lab management workflow

A LIMS system that tracks sample submission, automates analysis pipeline commands, and keeps track of the resulting data.

both desktop and server based NG Genomics tools, assembly, chip-seq, transcriptome,

solutions for transcript mapping (digital gene expression)  and Chip-seq with an emphasis on mapping sequence tags to annotated genome regions (TF binding sites).

A full service bioinformatics pipeline and consultant as an online service - if you have a next-gen machine and no bioinformatics support!!

desktop solution for assembly of NG data.

from Softgenetics
de novo assembly, SNP detection, and transcriptome/digital gene expression


Upcoming conferences (and links to content from recent confs. if I can find any)

Cambridge Healthtech Inst: Next Generation Sequencing
meets every 6 months, next one in March '09, San Diego CA

Nov 3-7 Caltech, San Diego, CA

MagazineS & Articles

These are news articles, editorials, etc about Nex-Gen Sequencing. 
Published stuff that are not official refereed journal articles.

"The Inside Road on Genome Sequencing"
a GenomeWeb publication

Nature: Big Data special 

Nature Genetics

HHMI Bulletin

Venture Beat

BioIT World


I have been working quite hard on Chip-Seq applications for Illumina (Solexa) data.

These boil down to four basic functions:
  • peak calling - taking sequence reads aligned to a reference genome and counting the number of hits per genome interval, subtracting background or a control lane, smoothing, cutting off shoulders, splitting double peaks, and coming up with some statistic that suggests that the peaks are real vs. false positives
  • annotation - finding the location of peaks on the genome as compared to known features, especially the transcription start sites of known genes
  • visualization - looking at peaks in one of the genome browsers 
  • motif detection - finding patterns of common bases within the peaks, comparing these patterns with known transcription factor binding sites

We have evaluated quite a few different pieces of software that supply various of these functions:

"An integrated software system for analyzing ChIP-chip and ChIP-seq data"
Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH.

BC Cancer Agency: FindPeaks
This is a good peak finder, easy to use, with a reasonable statistical model (based on comparison of your genome mapped data vs. a MonteCarlo random distribution of tags)

SISSRS (Site Identification from Short Sequence Reads)
Makes use of +/- strand information in Chip-Seq reads to precisely identify transcription factor binding sites within a few tens of base pairs. 

written by Yong Zhang and Tao Liu from the lab of Shirley Liu at Harvard

C++ program (requires C++ compiler) - author Anton Valouev in Sidow lab at Stanford
Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data, Valouev, et al. Nature Methods 5, 829 - 834 (2008)

Peak finder and visualization via UCSB Genome Browser

MIT Integrative Genome Viewer
note the alignment processor that creates tag counts from Next-Gen aligned reads (such as Eland output files)

Web-based peak calling at the Swiss Institute of Bioinformatics

ChIPDiff - identification of differential histone modification sites by comparison of two ChIP-Seq libraries prepared from different tissues (various cell types, stages, or environmental responses). Uses a Hidden Markov Model to identify differences in ChIP tag counts.
Available from Genome Institue of Singapore


Software for basic next-gen sequencing operations.
Each of the commercial vendors has their own proprietary software, so we will emphasize the open source.

A great page about Next-Gen software on SeqAnswers

Image Processing


alternative base calling for 454 sequencer with improved quality scores. Developed by the Marth lab at Boston College

Open source primary data analysis for next-gen DNA sequencers

Alignment to a Reference Sequence

Maq   Mapping and Assembly with Quality

Gapped alignments to reference genome, another from the Marth lab at Boston College

Novoalign from Novocraft in Kuala Lumpur, Malaysia

SOAP  —  Short Oligonucleotide Alignment Program
GNU Public License
from the Bioinformatics Dept of the Beijing Genomics Institute

Maps sequence reads to genomic location, variable number of mismatches and overall quality cutoff score. By Andrew D. Smith and Zhenyu Xuan in the Zhang lab at Cold Spring Harbor.

ZOOM (Zillions Of Oligos Mapped)
Product of the Michael Zhang lab at Cold Spring Harbor
"Zoom is freely available to non-commercial users at http://bioinfor.com/zoom"

de novo Assembly

Edena  (Exact de novo Assembler)
De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer.
D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel.

Velvet  (GPL, 64 bit Linux)
Velvet: algorithms for de novo short read assembly using de Bruijn graphs.
D.R. Zerbino and E. Birney. Genome Research 18:821-829.

SHARCGS (SHort read Assembler based on Robust Contig extension for Genome Sequencing)
Dohm JC, Lottaz C, Borodina T, Himmelbauer HSHARCGS, a fast and highly accurate short-read assembly 
algorithm for de novo genomic sequencing. 


ALLPATHS: De novo assembly of whole-genome shotgun microreads
Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A. Shlyakhter, Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum, and David B. Jaffe
Genome Res. 2008 18: 810-820.

SSAKE  (GNU Public License, written in Perl)
The Short Sequence Assembly by K-mer search and 3' read Extension (SSAKE) is a genomics application for aggressively assembling millions of short nucleotide sequences by progressively searching for perfect 3'-most k-mers using a DNA prefix tree.
René L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. 2007.
Assembling millions of short DNA sequences using SSAKE. 
Bioinformatics. 23:500-501.

SNP Discovery
Another from the Marth Lab

Genome Viewers

Another tool from the Marth Lab

from Affymetrix & GenoViz
some IGB tips from Hunstman Cancer Inst. @ Univ of Utah


Nex-Gen Blogs

These are some of the good folks posting blogs full of useful Next-Gen sequencing information:


"The next-generation sequencing community"

Anthony Fejes is a gradstudent in bioinformatics at UBC in Vancouver.
He works on FindPeaks and the Vancouver Short Read Analysis Package.

"Medical Genomics in the post-genome era"

Genetic Future: Daniel MacArthur
The genetic and evolutionary basis of human variation, 
and the companies trying to sell you information about your genome.

Systems Biology & Bioinformatics

Jonathan Eisen writes about a lot more than Next-Gen Sequencing, but his blog is a must-read for everyone in bioinformatics, genomics, and evolutionary biology


(developer of CLC Genomics Workbench)



Next Gen Sequencing Vendors

A list of the vendors of next-generation, high-throughput DNA sequencing machines. 
No editorializing here, there will be comment pages for each of these. 

The Big Three:

454 (Roche)

Illumina (Solexa)


The New New Thing:


Pacific Biosciences
[plans to ship first unit in 2010]

The Polonator (Dover Systems aka George Church & Co.)

Intelligent Biosystems

VisiGen Biotechnologies

Greetings ultra-sequencers and bioinformatics geeks.

The community of next-generation sequencing and its supporting bioinformatics is developing very rapidly, but also fragmenting into many factions. It has become very difficult for anyone to keep track of what is going on in the many different technologies, software development projects, and the various commentators.

I am going to try to make this blog a catchall where people can keep up on news from the technology vendors, survey progress in all sequencing related software, and keep up to date on relevant journal publications.

I can't do it all myself, so everyone is welcome to send in notes on relevant material as soon as they notice it.  Contributors may be invited to become co-authors. 

Stuart Brown
NYU Bioinformatics Core