Oct 25, 2011

Sequence Squeeze

The storage of NGS data has reached and passed the critical point. The owner of a HiSeq machine can expect to generate hundreds of Terabytes per year. Even more critical than the current large data volumes is the trend over the next few years - sequencing will grow faster and cheaper much more rapidly than hard drives. Current trends show the doubling of drive capacity (at a constant cost) every 18 months, but the doubling of sequencing output (also at constant cost) every 5 months. So you can expect to pay 3X more for NGS data storage every year.

The Pistoia Alliance, a trade group that includes most of the big Pharma companies and a bunch of software/informatics companies (but no sequencing machine vendors), has proposed a "Sequence Squeeze" challenge with a prize of $15,000 for the best novel open-source NGS compression algorithm. Nice.
www.sequencesqueeze.org

I think the basic outline of a solution has already been published in this paper by Hsi-Yang Fritz, Leinonen, Cochrane, and Birney:

Efficient storage of high throughput DNA sequencing data using reference-based compression.

http://www.ncbi.nlm.nih.gov/pubmed/21245279

Their basic idea is to reduce the amount of data stored that exactly reproduces a Reference Genome. Why store the same invariant data over and over again? Just save the interesting differences, and the quality scores near these differences.

First align all reads to a Reference Genome, then compress high quality reads (all bases Q>20) that perfectly match the Reference down to just a start position and a length. For Illumina reads, all the read lengths are the same, so that value just needs to be saved once for the entire data file. The aligned reads are sorted and indexed, so the position of each read can be marked just as an increment from the previous read. Groups of identical reads can be replaced by a count.

For reads that do not perfectly match the Ref. Genome, there may still be stretches of high quality matching bases. These can be represented by a set of start-stop coordinates with respect to the read start position, then an efficient formula to store differences for non-matching bases and the qualities of surrounding bases. Many such variant summaries already exist.

Another interesting idea is to use many different Reference Genomes (for humans), and match each sample to the most similar Reference. This might reduce the number of common variants observed by anything from 2x to 10X.