The storage of NGS data has reached and passed the critical point. The owner of a HiSeq machine can expect to generate hundreds of Terabytes per year. Even more critical than the current large data volumes is the trend over the next few years - sequencing will grow faster and cheaper much more rapidly than hard drives. Current trends show the doubling of drive capacity (at a constant cost) every 18 months, but the doubling of sequencing output (also at constant cost) every 5 months. So you can expect to pay 3X more for NGS data storage every year.

The Pistoia Alliance, a trade group that includes most of the big Pharma companies and a bunch of software/informatics companies (but no sequencing machine vendors), has proposed a "Sequence Squeeze" challenge with a prize of $15,000 for the best novel open-source NGS compression algorithm. Nice.
www.sequencesqueeze.org

I think the basic outline of a solution has already been published in this paper by Hsi-Yang Fritz, Leinonen, Cochrane, and Birney:

Efficient storage of high throughput DNA sequencing data using reference-based compression.

http://www.ncbi.nlm.nih.gov/pubmed/21245279

Their basic idea is to reduce the amount of data stored that exactly reproduces a Reference Genome. Why store the same invariant data over and over again? Just save the interesting differences, and the quality scores near these differences.

First align all reads to a Reference Genome, then compress high quality reads (all bases Q>20) that perfectly match the Reference down to just a start position and a length. For Illumina reads, all the read lengths are the same, so that value just needs to be saved once for the entire data file. The aligned reads are sorted and indexed, so the position of each read can be marked just as an increment from the previous read. Groups of identical reads can be replaced by a count.

For reads that do not perfectly match the Ref. Genome, there may still be stretches of high quality matching bases. These can be represented by a set of start-stop coordinates with respect to the read start position, then an efficient formula to store differences for non-matching bases and the qualities of surrounding bases. Many such variant summaries already exist.

Another interesting idea is to use many different Reference Genomes (for humans), and match each sample to the most similar Reference. This might reduce the number of common variants observed by anything from 2x to 10X.

A number of new companies have recently been created, or have refocused their primary business effort on opportunities in clinical sequencing and personalized medicine. This area has received a lot of speculative attention in the past few years, but the recent development of “Exome” sequencing technology has suddenly made it a practical area for commercial investment.

There are several challenges in that must be overcome to make DNA sequencing a clinically relevant tool:

1) the cost of the assay, which includes sample collection from the patient, sample preparation, and operation of the DNA sequencing machine

2) bioinformatics to identify sequence variants in the patient’s DNA

3) filtering and interpretation of sequence variants for clinical relevance – i.e. identify variants that provide information that directly impacts disease treatment decisions.

Exome sequencing addresses all three of these challenges. The exome is defined as the protein coding exons of genes, which make up approximately 50 MB of the human genome – about 1.5% of the entire genome. New sample preparation reagents make it possible to capture this portion of the genome in a single step in a single tube for less than $100. The current Illumina HiSeq sequencing machine produces about 20 Gbp per lane for about $1500, which is equivalent to 400X coverage of the exome. Since current bioinformatics methods require only 50-100X coverage for optimal discovery of sequence variants, this allows 4 to 8 samples to be multiplexed into a single lane. Therefore exome sequencing can be used to scan all of a patient’s genes for under $500 in sequencing and sample preparation costs. The $1000 genome is available right now.

Since the exome is a much smaller amount of sequence than the entire genome, and it is focused on the best characterized regions, the task of identifying variants is simplified. The problem of false positives is reduced both by the smaller extent of sequence and by the deeper coverage (≥50X). The challenge of interpretation is also greatly reduced since exons are by definition protein coding. All exon sequence variants can be characterized as changing amino acids or not (or creating frameshifts &/or stop codons), and the likely impact on a protein of an amino acid change can be assessed by a number of existing algorithms. Most genes can be further characterized by existing knowledge about protein function such as metabolic and regulatory pathways, as well as databases of clinical genetic and pharmacogenetic information.

Since the technical ability to perform exome sequencing and basic discovery of sequence variants is available to anyone with a HiSeq machine (and a few skilled bioinformaticians), companies are currently trying to distinguish themselves with the clinical interpretation that they can offer. Some companies are skipping the sequencing entirely and focusing solely on the interpretation of clinical sequence data.

• Ambry Genetics

Ambry Genetics is the first laboratory to provide CLIA-approved exome services for applications in clinical diagnostics along with clinical interpretation and classification of variant data. The expert bioinformatics team makes Clinical Diagnostic Exome™ possible with a robust data analysis pipeline for Mendelian disease discovery.

• Knome Offers Whole-Genome Sequencing, Interpretation for $5K

Founder: George Church

KnomeSelect, a targeted sequencing service that covers the exome, costs $24,500 for individuals. A comparative analysis of genomes includes a short list of suspect variants, genes and networks. Custom desktop software is provided for further analysis, including KnomeFinder for candidate variant discovery and KnomePathways for finding gene-gene interactions and gene networks. The company recently opened up its services to scientists interested in sequencing exomes or genomes of small numbers of humans as part of research studies.

• Personalis

Founders: Stanford founders are Russ Altman, chair of the bioengineering department; Euan Ashley, director of the Stanford Center for Inherited Cardiovascular Disease; Atul Butte, chief of the division of systems medicine at the department of pediatrics; and Michael Snyder, chair of the genetics department and director of the Stanford Center for Genomics and Personalized Medicine. John West, the former CEO of Solexa, is the new firm's CEO.

“Its core capability will be the medical interpretation of human genomes. Personalis expects to work closely with a variety of sequencing technology and service providers — including Illumina, Complete Genomics, and others.”

• Omica is a new startup company. It has developed and published the VAAST system for annotating sequence variants. VAAST is a probabilistic search tool that identifies disease-causing variants in genome sequence data. It combines elements from existing amino acid substitution and aggregative approaches that increase accuracy and make it easy to use. The tool can score both coding and non-coding variants, and evaluate rare and common variants. The platform, to be used for clinical annotations of both whole genomes and more targeted data such as exomes or gene panels, is currently in beta testing with several undisclosed collaborators. Besides VAAST, which generates disease candidate lists, the Omica service will also include annotation tools that will provide additional information about the role of the genes. Users can submit their genome sequence, and it puts all the clinical annotations on top of it. It also has an interface that can relate variants to diseases.

• GenomeQuest is a provider of cloud-based computing solutions for analysis of Next Generation sequencing data. The GQ-DxSM product analyzes and reports comprehensive genomic information about variations and changes in genes and proteins to improve disease treatment. The workflow can be used for Whole-Genome, Whole-Exome, and selected Gene Panels including:

- Automated transfer of raw data from sequencing machines

- Alignment of the reads against reference genomes

- Variant detection and annotation

- Mapping and documentation of variants against known inherited and somatic mutations

- Integration with other clinical data systems such as Electronic Health Records and therapy protocols to create a comprehensive patient diagnostic record

Designed for academic research laboratories, diagnostics labs, IVD manufacturers, and pharmaceutical companion diagnostic groups, GQ-Dx is already being used in clinical research. In collaboration with GenomeQuest, pathologists at Beth Israel Deaconess Medical Center, a teaching hospital of Harvard Medical School, are developing “clinical grade” annotation methods and databases for cancer diagnoses. GenomeQuest has also created a GeneTests-based diagnostic panel that generates a comprehensive report on disease susceptibility, diagnosis, and treatment on more than 2,000 disorders from a single, whole-genome sequence of a patient.

• Foundation Medicine has narrowed the focus even further. They provide diagnostic exome sequencing of 300 cancer related genes on FFPE tumor samples submitted by clinical pathologists. They sequence these 300 genes to very deep coverage (500X) to allow detection of rare somatic variants in heterogeneous tumor tissue. The selected gene set is intended to include only genes with directly disease related functions that impact cancer treatment decisions. The test is intended to replace many different single-gene diagnostic tests currently on the market.

• 23andMe has started a pilot program that offers full exome sequencing for $999. While the company’s regular personal genome service uses Illumina genotyping arrays with around 1 million SNPs (single nucleotide polymorphisms), the exome sequencing actually sequences around 50 million DNA bases with 80x coverage.

Customers will get the raw data, without any additional reports, so it will only be useful to people who actually know how to handle this raw genetic data. 23andMe plans to eventually add a limited set of tools and content that utilize exome sequence data.

23andMe is not the first company to offer whole-genome sequencing to consumers, but it is the first to do so at a sub-$1000 pricepoint. For hardcore bioscientists who know their way around raw genetic data, this is as good a deal as you can currently get.

Next-Gen Sequencing

Oct 25, 2011

Sequence Squeeze

Efficient storage of high throughput DNA sequencing data using reference-based compression.

Oct 5, 2011

Exome: A Goldrush in Clinical Sequencing

Stuart Brown

Resources

Blog Archive

List of Blogs relevant to NG Seq

Popular Posts