Mar 14, 2012

Collaborative work on Exome SNPs

I have noticed that a fair number of people who actually work with Next-Gen Sequence data read this blog, so perhaps we can use it for a collaborative project.

I want to write a paper about uneven coverage in exome sequencing leading to incorrect SNP calls. Our data is from tumor-normal pairs, and we see a lot of false negatives - failure to detect a SNP in a sample due to low coverage at that spot. Exome capture methods seem to have more than their fair share of low coverage spots (even with an average coverage over 100x), and these low coverage spots do differ somewhat from sample to sample. I'd like some other people to share data with us and/or do some similar analysis on other data sets so that we can make a stronger paper.

6 comments:

Elia Stupka said...

This is a very interesting issue, which raises a more general one. Due to the sensitive nature of genetic data, what happened to gene expression (GEO, Arrayexpress, MIAME, etc) did not happen with exome and whole genome sequencing, leading to "each lab on its own" and to only Sanger, Broad, etc. being able to generate enough data to spot patterns, issues and problems. There is a need for ethical sharing of trends and biases, and I think your comment goes into that direction.

You have our full support (we analyzed 185 exomes at my previous lab, and are now starting to process many exomes at my new lab), and we can share anonymyzed coverage files if we agree on a format,etc.

Another aspect that might be helpful would be to compare results from a high coverage genome and perform exome on the same sample, to spot differences in coverage and biases.

best regards,

Elia Stupka

CY said...

I have recently been working on methods for detecting copy number alterations from whole genome sequencing and have run into the exact problem that Elia has identified.

Getting hold of sequencing data is extremely difficult and it can take months for Material Transfer Agreements to go through unless you are a member of one of the major genome sequencing centers or have the ability to generate data in-house.

Christopher Yau.

Anna Ross said...

If one has to validate the NGS test for clinical use, how many samples do you think should be analysed.

Attila BĂ©rces said...

What exactly do you mean? Technical replicates or different samples from different patients.
We carry out validation for targeted sequencing on HLA and for BRCA. The two are completely different. HLA has the highest diversity of mutations, but most of the mutations are known. BRCA has a low mutation rate but when they find a mutant that is often novel. In principle with HLA we can design a test with a limited number of samples that cover all known mutations. With BRCA it is impossible.
Another point is that the computational analysis is probably the most significant error source. This can be tested, however, exactly. Since human analysis is (almost) always reference based, one can use mutated reference to see if the introduced mutation can be recovered.

rocketprince said...

This perspectives article cites high GC content as a possible source of bias (among others), something to consider:

http://www.nature.com/ng/journal/v44/n6/full/ng.2303.html

DNA Testing said...

Thanks for posting such informative article.Keep posting such type of articles, I am very happy with this article.