Jan 4, 2018

Genome Annotation Challenges

Public databases of genetic information have a fundamental garbage-in>garbage-out problem. A huge number of useful databases are populated by pulling information from other databases and adding new value by computational inferences, but automated linking of databases can propagate incorrect information. The curators of primary repositories such as GenBank make an substantial effort to publish only correct information, so they are very conservative about annotating genes with only verifiable information. NCBI also has a policy that the original depositor of any given entry (a gene, protein, genome, experimental dataset, etc.) is the author of its annotation and metadata, and no one else can alter it. 


Staphylococcus aureus is an important human pathogen with perhaps the largest number of whole genome sequences in public repositories of any bacteria. NCBI has 8367 Staph genomes in its “Genomes” section (on Jan 1, 2018), and another ~40,000 in the SRA and Whole Genome Shotgun sections. However, GenBank has chosen strain NCTC 8325 as the Reference Genome for Staph, and put its genes in RefSeq. 


This genome was sequenced, annotated, and submitted on 27-Jan-2006 by Gillaspy et al from the Oklahoma Health Sciences Center. As a result of this “Reference Genome” designation, an automatic lookup of a Staph gene in GenBank is likely to get the annotation from NCTC 8325. This particular Staph genome has 2,767 protein coding genes (plus 30 pseudogenes, 61 tRNA, and 16 rRNA genes), however 1496 of these proteins are annotated with only “hypothetical protein” in their “gene product” or “description” field. This is very confusing, since many of these genes are 100% identical to proteins that have specific and well documented functions in other Staph strains. 

Here is one example:

hypothetical protein SAOUHSC_00010 [Staphylococcus aureus subsp. aureus NCTC 8325]
NCBI Reference Sequence: YP_498618.1
FEATURES             Location/Qualifiers      source          1..231                      /organism="Staphylococcus aureus subsp. aureus NCTC 8325"                      /strain="NCTC 8325"                      /sub_species="aureus"                      /db_xref="taxon:93061"      Protein         1..231                      /product="hypothetical protein"

GenBank knows that this is not the correct annotation for this protein. In the “Region” sub-field of the record (which is very rarely used by automated annotation tools that take data from GenBank) an appropriate function, COG and a CDD conserved domain are noted:

Region  7..231 /
region_name="AzlC"                      
/note="Predicted branched-chain amino acid permease (azaleucine resistance) 
[Amino acid transport and metabolism]; COG1296"                      
/db_xref="CDD:224215"


NCBI also links this gene to an “Identical Protein Group” where 3957 proteins are listed with 100% amino acid identity, which are annotated variously as: “azaleucine resistance protein AzlC”, “branched-chain amino acid ABC transporter permease”, “AzlC”, and “Inner membrane protein YgaZ”. A very conservative annotation bot might panic at this level of inconsistency and default to the lowest common denominator of “hypothetical protein”. However, a more sophisticated automaton might compare the protein sequence to PFAM or COG protein functional families and assign a common annotation to them all.

The incorrect “hypothetical” annotations for Staph genes in GenBank can be found downstream in many other databases, such as the Database of Essential Genes, AureoWiki, KEGG, UniProt, etc. which all upload their primary annotation from GenBank. So someone sequencing a new strain of Staph and using any of these resources to annotate predicted genes will probably end up assigning “hypothetical protein” for the AzlC gene and many hundreds of others, perpetuating the cycle of misinformation.








In a lot of other cases, it does not seem possible for an algorithm to resolve messy annotations that a human expert might  be able to figure out. For example, Staph strain COL has many hypothetical genes such as SACOL1097. NCBI Identical Proteins also show only “hypothetical protein” annotations.  However, a BLAST search shows 95% identity to nitrogen fixation protein NifR.  


hypothetical protein SACOL1097 [Staphylococcus aureus subsp. aureus COL]
GenBank: AAW37977.1
Identical Proteins FASTA Graphics
LOCUS       AAW37977                  59 aa            linear   BCT 31-JAN-2014
DEFINITION  hypothetical protein SACOL1097 [Staphylococcus aureus subsp. aureus  COL].
ACCESSION   AAW37977
VERSION     AAW37977.1
DBLINK      BioProject: PRJNA238
            BioSample: SAMN02603996
DBSOURCE    accession CP000046.1
SOURCE      Staphylococcus aureus subsp. aureus COL
































No comments: