The Pistoia Alliance, a trade group that includes most of the big Pharma companies and a bunch of software/informatics companies (but no sequencing machine vendors), has proposed a "Sequence Squeeze" challenge with a prize of $15,000 for the best novel open-source NGS compression algorithm. Nice.
www.sequencesqueeze.org
I think the basic outline of a solution has already been published in this paper by Hsi-Yang Fritz, Leinonen, Cochrane, and Birney:
Efficient storage of high throughput DNA sequencing data using reference-based compression.
http://www.ncbi.nlm.nih.gov/pubmed/21245279
Their basic idea is to reduce the amount of data stored that exactly reproduces a Reference Genome. Why store the same invariant data over and over again? Just save the interesting differences, and the quality scores near these differences.
First align all reads to a Reference Genome, then compress high quality reads (all bases Q>20) that perfectly match the Reference down to just a start position and a length. For Illumina reads, all the read lengths are the same, so that value just needs to be saved once for the entire data file. The aligned reads are sorted and indexed, so the position of each read can be marked just as an increment from the previous read. Groups of identical reads can be replaced by a count.
For reads that do not perfectly match the Ref. Genome, there may still be stretches of high quality matching bases. These can be represented by a set of start-stop coordinates with respect to the read start position, then an efficient formula to store differences for non-matching bases and the qualities of surrounding bases. Many such variant summaries already exist.
Another interesting idea is to use many different Reference Genomes (for humans), and match each sample to the most similar Reference. This might reduce the number of common variants observed by anything from 2x to 10X.
2 comments:
Hello Stuart, Thanks for such an informative post.
thanx for sharing
Post a Comment