Dec 15, 2011

Job Opening: Sequencing Informatics Scientist

One of my very good informatics people is leaving at the end of the year, so I have a job vacancy to fill in our Sequencing Informatics unit (funded as an Institutional core to support our Nex-Gen Sequencing Lab and our investigators, not from any one grant). I want someone with either a Masters and some experience with Next-Gen sequencing informatics, or a PhD. (bioinformatics or computer science, or something similar) who is looking for a more stable, service oriented position, rather than the usual highly competitive postdoc.  There will be opportunity for both collaborative and independent work on various projects, and publications are expected. UNIX/Perl/Java skills are necessary.

The job previously involved informatics support for 454 sequencing, but that turned out to be less than 30% of the actual work. Looking forward, our Microbiome work will be done mostly on Illumina, bacterial genomes on Illumina... you get the idea. Send cv's to

Oct 25, 2011

Sequence Squeeze

The storage of NGS data has reached and passed the critical point. The owner of a HiSeq machine can expect to generate hundreds of Terabytes per year. Even more critical than the current large data volumes is the trend over the next few years - sequencing will grow faster and cheaper much more rapidly than hard drives. Current trends show the doubling of drive capacity (at a constant cost) every 18 months, but the doubling of sequencing output (also at constant cost) every 5 months. So you can expect to pay 3X more for NGS data storage every year.

The Pistoia Alliance, a trade group that includes most of the big Pharma companies and a bunch of software/informatics companies (but no sequencing machine vendors), has proposed a "Sequence Squeeze" challenge with a prize of $15,000 for the best novel open-source NGS compression algorithm. Nice.

I think the basic outline of a solution has already been published in this paper by Hsi-Yang FritzLeinonenCochrane, and Birney:

Efficient storage of high throughput DNA sequencing data using reference-based compression.

Their basic idea is to reduce the amount of data stored that exactly reproduces a Reference Genome. Why store the same invariant data over and over again? Just save the interesting differences, and the quality scores near these differences.

First align all reads to a Reference Genome, then compress high quality reads (all bases Q>20) that perfectly match the Reference down to just a start position and a length. For Illumina reads, all the read lengths are the same, so that value just needs to be saved once for the entire data file. The aligned reads are sorted and indexed, so the position of each read can be marked just as an increment from the previous read. Groups of identical reads can be replaced by a count.

For reads that do not perfectly match the Ref. Genome, there may still be stretches of high quality matching bases. These can be represented by a set of start-stop coordinates with respect to the read start position, then an efficient formula to store differences for non-matching bases and the qualities of surrounding bases.  Many such variant summaries already exist.

Another interesting idea is to use many different Reference Genomes (for humans), and match each sample to the most similar Reference. This might reduce the number of common variants observed by anything from 2x to 10X.

Oct 5, 2011

Exome: A Goldrush in Clinical Sequencing

A number of new companies have recently been created, or have refocused their primary business effort on opportunities in clinical sequencing and personalized medicine. This area has received a lot of speculative attention in the past few years, but the recent development of “Exome” sequencing technology has suddenly made it a practical area for commercial investment.

There are several challenges in that must be overcome to make DNA sequencing a clinically relevant tool:

  • 1) the cost of the assay, which includes sample collection from the patient, sample preparation, and operation of the DNA sequencing machine

  • 2)   bioinformatics to identify sequence variants in the patient’s DNA 

  • 3)   filtering and interpretation of sequence variants for clinical relevance – i.e. identify variants that provide information that directly impacts disease treatment decisions.

Exome sequencing addresses all three of these challenges. The exome is defined as the protein coding exons of genes, which make up approximately 50 MB of the human genome – about 1.5% of the entire genome. New sample preparation reagents make it possible to capture this portion of the genome in a single step in a single tube for less than $100. The current Illumina HiSeq sequencing machine produces about 20 Gbp per lane for about $1500, which is equivalent to 400X coverage of the exome. Since current bioinformatics methods require only 50-100X coverage for optimal discovery of sequence variants, this allows 4 to 8 samples to be multiplexed into a single lane. Therefore exome sequencing can be used to scan all of a patient’s genes for under $500 in sequencing and sample preparation costs. The $1000 genome is available right now.

Since the exome is a much smaller amount of sequence than the entire genome, and it is focused on the best characterized regions, the task of identifying variants is simplified. The problem of false positives is reduced both by the smaller extent of sequence and by the deeper coverage (≥50X). The challenge of interpretation is also greatly reduced since exons are by definition protein coding. All exon sequence variants can be characterized as changing amino acids or not (or creating frameshifts &/or stop codons), and the likely impact on a protein of an amino acid change can be assessed by a number of existing algorithms. Most genes can be further characterized by existing knowledge about protein function such as metabolic and regulatory pathways, as well as databases of clinical genetic and pharmacogenetic information.

Since the technical ability to perform exome sequencing and basic discovery of sequence variants is available to anyone with a HiSeq machine (and a few skilled bioinformaticians), companies are currently trying to distinguish themselves with the clinical interpretation that they can offer. Some companies are skipping the sequencing entirely and focusing solely on the interpretation of clinical sequence data.

Ambry Genetics
Ambry Genetics is the first laboratory to provide CLIA-approved exome services for applications in clinical diagnostics along with clinical interpretation and classification of variant data. The expert bioinformatics team makes Clinical Diagnostic Exome™ possible with a robust data analysis pipeline for Mendelian disease discovery.

Knome Offers Whole-Genome Sequencing, Interpretation for $5K
Founder: George Church
KnomeSelect, a targeted sequencing service that covers the exome, costs $24,500 for individuals. A comparative analysis of genomes includes a short list of suspect variants, genes and networks. Custom desktop software is provided for further analysis, including KnomeFinder for candidate variant discovery and KnomePathways for finding gene-gene interactions and gene networks. The company recently opened up its services to scientists interested in sequencing exomes or genomes of small numbers of humans as part of research studies.

Founders: Stanford founders are Russ Altman, chair of the bioengineering department; Euan Ashley, director of the Stanford Center for Inherited Cardiovascular Disease; Atul Butte, chief of the division of systems medicine at the department of pediatrics; and Michael Snyder, chair of the genetics department and director of the Stanford Center for Genomics and Personalized Medicine. John West, the former CEO of Solexa, is the new firm's CEO.
Its core capability will be the medical interpretation of human genomes. Personalis expects to work closely with a variety of sequencing technology and service providers — including Illumina, Complete Genomics, and others.”

Omica is a new startup company. It has developed and published the VAAST system for annotating sequence variants. VAAST is a probabilistic search tool that identifies disease-causing variants in genome sequence data. It combines elements from existing amino acid substitution and aggregative approaches that increase accuracy and make it easy to use. The tool can score both coding and non-coding variants, and evaluate rare and common variants. The platform, to be used for clinical annotations of both whole genomes and more targeted data such as exomes or gene panels, is currently in beta testing with several undisclosed collaborators. Besides VAAST, which generates disease candidate lists, the Omica service will also include annotation tools that will provide additional information about the role of the genes. Users can submit their genome sequence, and it puts all the clinical annotations on top of it. It also has an interface that can relate variants to diseases.

GenomeQuest is a provider of cloud-based computing solutions for analysis of Next Generation sequencing data. The GQ-DxSM product analyzes and reports comprehensive genomic information about variations and changes in genes and proteins to improve disease treatment. The workflow can be used for Whole-Genome, Whole-Exome, and selected Gene Panels including:
- Automated transfer of raw data from sequencing machines
- Alignment of the reads against reference genomes
- Variant detection and annotation
- Mapping and documentation of variants against known inherited and somatic mutations
- Integration with other clinical data systems such as Electronic Health Records and therapy protocols to create a comprehensive patient diagnostic record
Designed for academic research laboratories, diagnostics labs, IVD manufacturers, and pharmaceutical companion diagnostic groups, GQ-Dx is already being used in clinical research. In collaboration with GenomeQuest, pathologists at Beth Israel Deaconess Medical Center, a teaching hospital of Harvard Medical School, are developing “clinical grade” annotation methods and databases for cancer diagnoses. GenomeQuest has also created a GeneTests-based diagnostic panel that generates a comprehensive report on disease susceptibility, diagnosis, and treatment on more than 2,000 disorders from a single, whole-genome sequence of a patient.

Foundation Medicine has narrowed the focus even further. They provide diagnostic exome sequencing of 300 cancer related genes on FFPE tumor samples submitted by clinical pathologists. They sequence these 300 genes to very deep coverage (500X) to allow detection of rare somatic variants in heterogeneous tumor tissue. The selected gene set is intended to include only genes with directly disease related functions that impact cancer treatment decisions. The test is intended to replace many different single-gene diagnostic tests currently on the market. 

23andMe has started a pilot program that offers full exome sequencing for $999. While the company’s regular personal genome service uses Illumina genotyping arrays with around 1 million SNPs (single nucleotide polymorphisms), the exome sequencing actually sequences around 50 million DNA bases with 80x coverage.
Customers will get the raw data, without any additional reports, so it will only be useful to people who actually know how to handle this raw genetic data. 23andMe plans to eventually add a limited set of tools and content that utilize exome sequence data.
23andMe is not the first company to offer whole-genome sequencing to consumers, but it is the first to do so at a sub-$1000 pricepoint. For hardcore bioscientists who know their way around raw genetic data, this is as good a deal as you can currently get.

Sep 29, 2011

Foundation Medicine grabs for the low-hanging fruit of NGS cancer diagnostics

I was at the CHI APPLYING NEXT-GENERATION SEQUENCING conference in Providence RI, where I heard an extremely interesting presentation from a new Genomics company called Foundation Medicine. This company plans to offer a clinical diagnostic test based on very deep sequencing of all exons from about 300 cancer related genes. They will sequence directly from pathologist's FFPE blocks using Illumina HiSeq to a depth of 500 to 1000X.

Here is a recent poster they presented at ASCO, but the information at the CHI conference was updated and more in depth.
ASCO poster

Here is why I think this is very important. First, this test will include all existing genes that are currently being tested for any type of cancer (BRCA1&2, KRAS, BRAF, HER2, EGFR, etc), but will include all exons and greater diagnostic sensitivity for mutations present in low abundance in heterogenous samples which may suffer from mixed tumor and normal tissue, multiple clones, mixed aneuploidy etc. It will likely also contain the majority of known pharmacogenomic genes. So this one test could put all the other providers of cancer related genetic tests out of business.

It is also very important that the test is highly targeted only at "actionable" genes. Foundation Med. plans to deliver a report for each patient (in 14 days) that lists all mutations observed in the diagnostic genes, as well as some key items drawn from the literature, clinical trials, and a curated knowledge base about treatments relevant to those genes. In the presentation, COO Kevin Krenitsky said that they typically found 2-3 mutated genes per patient. This is an amount of data that the oncologist or pathologist can reasonably be expected to deal with — rather than the hundreds to thousands of mutated genes with questionable to zero clinical implications that will be produced by whole genome sequencing.

Another interesting discovery reported by Foundation Med. was that in a small number of cases (perhaps 5%), they found mutations for genes that were associated with a different type of cancer. This suggests the use of a non-traditional drug, possibly in combination with other more typical therapies, as an individualized treatment for that one patient. There are currently about 30 drugs for which genetic information can aid in treatment decisions, but this is clearly an area of intense development. Foundation Med. can easily modify its test to include any relevant new genes. We are clearly heading to the point where every cancer patient will benefit from an individualized genomics workup.

Jul 18, 2011

GWAS vs Exome Sequencing

I learned something interesting today about the SNP arrays used for GWAS. There has been a lot of discussion about the nature of mutations/alleles discovered by GWAS studies in terms of the "common disease: common variant" hypothesis. It is clear that SNP arrays are designed to cover common variants - alleles that are present in at least 2% of the human population (or at least of some population). Contrary-wise, genome sequencing studies tend to focus on rare variants. In fact a number of recent studies show that major diseases such as cancer and autism tend to be associated with novel, very severe mutations in coding regions of genes.

Now this is the interesting part. We took a look at the intersection between the Illumina 2.5 M SNP array and the regions targeted by the Agilent Sure Select exon enrichment kit. It turns out that only about 90K of the Illumina SNPs are in the exon regions. This matches up with Illumina's own annotation file showing that more than 80% of the SNPs on the array are intron or intergenic.  My human genetics colleague suggests that the SNP array targets sequence variants (alleles) with small effects, while the exon sequencing strategy targets mutations with large effects. So we can't really replace the SNP array with exome sequencing, they are looking at completely different things.

Jun 28, 2011

The False Discovery of Mutations by Sequencing

     I am amazed by the success reported in recent papers finding mutations by Next-Gen Sequencing in rare genetic diseases and cancer. In our lab, the sequence data for SNPs is messy and difficult to interpret. The basic problem is that NGS data, particularly Illumina data in our case, contains a moderate level of sequencing errors. We get somewhere between 0.5% and 1% errors in our sequence reads from the GAII and HiSeq machines. This is not bad for many practical purposes (ChIPseq and RNAseq experiments have no trouble with this data) and this error level "is within specified operating parameters" according to Illumina Tech support. The errors are not random, they occur much more frequently at the ends of long (100 bp) reads. Some types of errors are systematic in all Illumina sequencing (A>T miscalls are most common), and other types of errors are common to a particular sample, run, or lane of sequence data. Also, when you are screening billions of bases, looking for mutations, rare overlaps of errors will occur.
     So if sequence data contains errors, and the point of your experiment is to find mutations, then when you find a difference between your data and the reference genome (a variant), you had better make doubly sure that the difference is real. There is a lot of software designed to filter out real mutations (SNPs) from the random sequence errors. The basic idea is to first filter out bad, low quality bases using the built-in quality scores produced by the sequencer. Second, require that multiple reads show the same variant, and that the fraction of reads showing the variant makes sense in your experiment: 40-60% might be good for a heterozygous allele in a human germline sample, 10% or less might make sense if you are screening for a rare variant in a sample from a mixed population of cells.  Also, it is usually wise to filter out all common SNPs in the dbSNP database - we assume that these are not cancer causing, and they have a high likelihood of being present in healthy germline cells as well as tumor cells.
     We have used the SNP calling tools in the Illumina CASAVA software, the MAQ software package, similar tools in SAMtools, and recently the GATK toolkit. In all cases, it is possible to tweak parameters to get a stringent set of predicted mutations, filtering out low quality bases, low frequency mutations, and SNPs that are near other types of genomic problems such as insertion/deletion sites, repetitive sequence, etc. Using their own tools Illumina has published data showing a false positive detection rate of 2.89% (Illumina FP Rate).  Under many experimental designs, validating 97% of your predicted mutations would be excellent.
     Unfortunately, our medical scientists don't want predicted SNPs vs. an arbitrary reference genome. They want to find mutations in cancer cells vs. the normal cells (germline or wild type) of the same patient. This is where all the tools seem to fall apart. When we run the same SNP detection tools on two NGS samples, and then look for the mutations that are unique to the tumor vs the wild type (WT), we get a list of garbage, thousands of lines long. We get stupid positions with 21% variant allele detected in tumor and 19% variant in WT. Or we get positions where the 80% variant allele frequency is not called as a SNP in WT because 2 out of 80 reads have a one base deletion near that base. So the stringent settings on our SNP discovery software create FALSE NEGATIVES where we miss real SNPs in the WT genome, which then show up as tumor-specific mutations in our SNP discovery pipeline.
     Zuojian Tang is creating a post-SNP data filter that imposes a sanity check on the data based on allele frequencies. We are trying out various parameters, but something like a minimum of 40% variant in the tumor and less than 5% variant in the WT narrows the list of tumor-specific mutations down to a manageable number that could be validated by PCR or Sequenom.

Jun 6, 2011

Involve Bioinformatics in design of every experiment... please

Two interesting projects came through our informatics group last week, both in the 'data drop' mode were the investigator asks for help to analyze data as it comes off of the sequencers. I have noted many times before, that our informatics effort is much greater on the poorly designed and failed experiments.

Experiment #1 was a seemingly standard SNP detection using exome sequences with 100 bp paired-end  reads on Illumina HiSeq (Agilent Sure Select capture) - the entire thing done by an private sequencing contractor. The contractor also supplied SNP calls using Illumina CASAVA software. Our job was simply to find overlaps between the SNP calls for various samples and controls, and to annotate the SNPs with genomic information (coding or non-coding, conservative mutations, biological pathways, etc).  However, we have an obsession with QC data, which the vendor was very reluctant to supply. Turns out that these sequencing reads have a 1.5% error rate, while our internal sequencing lab generates 0.5% error. We also see 10K novel SNPs in each sample with only minimal overlap across samples (a red flag for me). More QC data is extracted from the vendor, and now we see a steep increase in error at the ends of reads. So we wish to trim all reads down by 10-25% and recall SNPs - extract more files from vendor 3x (Illumina requires a LOT of runtime and intermediate files in order to run CASAVA for SNP calling).

Meanwhile, Experiment #2 is an RNAseq project where the investigator is interested in alternative splicing. We analyzed one earlier data set with 50bp reads with only moderate success. It seems that very deep coverage is needed to get valid data for alt-splicing, especially when levels of a poorly expressed isoform are suspected to change by a small amount due to biological treatment. The investigator saw some published results suggesting that paired-end RNAseq data would provide more information about splicing isoforms. So, WITHOUT a bioinformatics consult, they sent an existing sample (created for 50bp single end sequencing) to the lab for 100 bp paired-end sequencing. This data came out of our pipeline with more than 20% error and a strange mix of incorrectly oriented read pairs (facing outward rather than inward). After a few days of head scratching and escalating levels of Illumina bioinformatics tech support, we have an explanation. A 225 bp library fragment contains 130 bp of primers and adapters. Thus the insert has an average size of about 95 bp. Some are shorter!  Thus, our 100 cyle reads go off the far end of most sequences, adding 5 or more bases of adapter sequence where the alignment software is expecting genomic sequence. In addition, the paired ends overlap more than 100% - so the start of one read is inside the end of the other. Thus they map in the opposite orientation, with an insert size of 5-10 bp. Our best effort to analyze this data will involve chopping all reads back to 36 bp and repeating the Paired-End analysis. So that was 3 days of bioinformatics analysis time not so well spent on forensic QC.

Now we are looking back to Experiment #1 and wondering about insert sizes in that library. What if that library's insert size was about 110 or 120 bp (perhaps with a sizeable tail of much smaller fragments), and a fraction of the reads also run off into the adapter, adding mismatched bases at the ends of alignments, and thus jacking up the overall error rate.

Two conclusions: 1) talk to bioinformatics BEFORE you build your sequencing libraries
2) if you want something done right, do it yourself.

May 19, 2011

$10K bioinformatics on thousand dollar genome

It is now possible to get 100x coverage of the exome sequence for a cancer sample (or any other type of human genomic sample) on one lane of an Illumina HiSeq machine. With the Sure Select 50 MB exome kit, it still costs quite a bit more than one thousand dollars to get this data, but it is getting close. At maximum yield, it might currently be possible to multiplex 4 samples into a singe lane and still get 100x coverage of each. This will certainly be true when planned upgrades to the HiSeq machine are available.

Illumina provides some nice software (called CASAVA) that is typically run at the default settings by Core labs and sequencing outsourcing companies. This software gives high-quality genome alignments and pretty good SNP calls - useful for many purposes. However, real-world research needs are often not satisfied with default automated bioinformatics analysis. Narrowing down hundreds of thousands of SNP calls to the few real disease-related mutations is difficult hands-on work for skilled bioinformaticians. Today in my lab group, we are fighting with false-negatives: SNPs that were present but not called in the germ line sample, leading to false identification of mutations unique to the tumor. It looks like we will have to re-run the SNP detection software many times with small changes in various parameters to optimize specificity vs. sensitivity in each sample. Investigators may sub-contract this type of work to the lab that does the sequencing, they may have skilled bioinformaticians in their lab group, or they may hire bioinformatics consultants. In any case, $1K of sequence data may cost more than $10K for analysis.

Feb 15, 2011

Archiving NGS data

Anyone who has worked with NextGen sequence data quickly gains an appreciation for the difficulties associated with long term data storage. The current 'state of the art,' at least for Illumina machines, involves saving some fairly raw data files such as fastq text to the NCBI Short Read Archive (SRA).

Our GAIIx is producing about 30 million reads per lane, which gives files of 8-10 GB (72 cycles) per lane in either qseq (completely unfiltered) or fastq (quality scored) format. If we max out two runs per week, that is about 140 GB of raw sequence data per Illumina machine per week.

There has been some recent discussion about the possibility of phasing out the SRA at NCBI.
[see this post which claims to be a memo from NCBI director David Lipman: "The Sequence Read Archive (SRA) will also be phased out over the next 12 months."]
If cost cutting is truly necessary for our national biomedical research infrastructure, I can see why the raw SRA data might be growing at an awkwardly rapid rate and have less value than the higly used databases of GenBank non-redundant nucleotide, GEO, etc.

I think that it is interesting to turn this discussion around and ask why are we archiving all of this raw sequence data? The trivial argument is that: Journals require open access to raw data as a condition of publication." But that argument ignores the more interesting question: What is the 'raw data' for a sequencing project? No one is loading Illumina (or SOLID or 454) image data into public archives. The impracticality of saving multiple terabytes of image data for each run made that approach moot a couple of years ago. We are saving raw qseq or fastq files right now because our methods for basecalling and SNP calling (and indel/translocation/copy number calling) are imprecise. I have seen data analysts go back into primary sequence reads for a single sample and find a SNP that was not called because a few reads had below threshold quality scores.

If we consider the actual "useful" data content of a NGS run on a single sample, the landscape looks quite different. ChIP-seq is our most common NGS application. The useful data from a ChIP-seq run is actually just a set of genome positions where read starts are mapped. At most, this is 20-30 million positions. In actuality, 30% of reads are not mapped, and another 10-50% are duplicates (multiple reads that map to the exact same position), so the final data set might be compressed to about 10 million genomic loci with a read count at each spot. After sorting and indexing, this information could be efficiently stored in a very compact file.

RNA sequencing is becoming increasingly popular. Our clients are typically not interested in the sequence data itself, only in gene expression counts - essentially the same data as produced by a microarray. However, there are some cool new applications that look at alternative splicing, so we may have to keep the actual sequence reads on hand for a while longer.

Human (and mouse) SNP/indel/cnv detection is another popular NGS application. We are only really interested in the variants. However, SNP calling software requires both numbers of reads with reference vs. variant bases and quality scores for each basecall. Some software also uses context dependent quality metrics, such as distance from other SNPs, distance from indels, etc. Given the highly diverse collection of existing SNP detection software, and the likelihood of new software development, it seems impossible to compress this class of data to a set of variant calls and discard the raw reads. This is very unfortunate, since typical variant detection projects use anything from 20x to 50x coverage of the genome. So we are storing 150 GB of raw sequence data in order to track a few million bytes worth of actual variation in the genome of each research sample.

Other applications, such as de novo genome sequencing of new organisms, or metagenomic sequencing of environmental or medical samples will not be easily compressed. Fortunately, these data are currently archived in places other than the SRA.

Jan 24, 2011

Genetic Disease Diagnostics with NGS

A couple of recent papers demonstrate a significant opportunity for the use of NextGen Sequencing in the diagnosis of genetic disease. Dennis Lo et al, at The Chinese University of Hong Kong, have published results for a NGS fetal genetic diagnostic test based on recovery of fragments of fetal DNA from the mother's blood plasma. Preliminary results show that complete coverage of the fetal diploid genome is possible [Science Translational Medicine] at a resolution that allows for differentiation of heterozygous vs. homozygous mutations in disease genes; and also that aneuploidy, such as trisomy 21 can be detected with high specificity and sensitivity [British Medical Journal]. The key benefit of this approach is that it can be done non-invasively from simple blood draw from the mother, so it avoids the relatively high incidence of pregnancy complications created by amniocentesis or chorionic villus sampling procedures.

Meanwhile, the lab of Stephen Kingsmore at the US National Center for Genome Resources reported results of a targeted sequencing carrier screen for a total of 448 severe (rare) recessive genetic diseases [Science Translational Medicine]. This work is particularly significant because the screen is designed to work in multiplex, allowing for a potential total cost per patient of below $500 (less than $1 per disease screened). While each gene is rare in isolation, the combined screen shows an average of 2.8 mutations per individual tested in the proof-of-concept phase of the study.

Taken together, these advances suggest that routine clinical applications of NGS will soon be practical, attractive, and economically feasible for large numbers of healthy people (pregnant women and marriage minded couples). This is great news for NGS equipment vendors, and also suggests a software engineering opportunity for the development of much more robust bioinformatics pipelines for processing this data and including it in electronic medical records. At the same time, I am worried that the lab folks may be progressing much more rapidly than the thinking in the ELSI community. What kind of databases will be created when every pregnancy and every marriage license is associated with gigabyte files of deep sequencing data? This issue is all the more problematic because disease carrier testing and Down syndrome screening are already so widely accepted. Changing prenatal tests to use sequencing in order to reduce complications in pregnancy, and adding pre-conception tests for diseases that were previously thought to be too rare to merit widespread screening are non-controversial medical advances. The downside might come from the unintentional discovery of other genetic information, the availability to law enforcement and other organizations of large files of genetic information on every person, etc.