I am speaking at the CHI Next-Gen Sequencing conf in Providence 9.26.2010 (Sunday short course). My topic is going to be about the role of bioinformatics in QC for NG sequencing, with examnples from ChIPseq, where I have the most experience.
My main point is that the informatics team works hardest on experiments that produce poor data - or where the data contradict the investigator's expectations. When the experiment is beautiful, then you can use your automated (or semi-automated) pipeline, and hand over the analyzed data with a standard report. For a transcription factor type ChIPseq, the standard result is a set of peaks with p-value and fold change vs. an input DNA, annotated by distance to the nearest gene's Transcription Start Site. If pressed, we can deliver this about 2 days after the sequencing run is complete. For an epigenomics type ChIPseq (histone methylation, acetylation, etc) we deliver both peaks vs input DNA and some type of fold-change for each peak comparing one biological condition vs. another.
However, we spend a lot more time squabbling about runs with high PCR duplication, weird artifacts, low yield, peaks in the input DNA lane, etc. To deal with this, we have been developing a variety of tools to quantify overall data quality in a ChIPseq run. We are looking as the overall clustering of mapped reads on the Reference genome (average spacing of adjacent/overlapping reads), as well as coverage at various depths. Some of these metrics make intersting graphs, but we have not completely pinned down their predictive power for understanding the data.
We have recently been playing with selecting sets of genes based on external data such as gene exprssion values from microarray or RNAseq experiments, and looking at the aggregate profile of reads mapped near the TSS of groups of genes that are upregulated, downregulated, unchanged, etc. By combining reads for a bunch of genes, we get smooter curves and you can actually say fairly clearly that upregualted genes have (or do NOT have) a change in histone methylation near the TSS as compared with downreg or unchanged genes.
SLAM-seq – Thiol-linked alkylation for the metabolic sequencing of RNA - Gene expression profiling by high-throughput sequencing reveals qualitative and quantitative changes in RNA species at steady-state but obscures the intrac...
5 hours ago