Next-Gen Sequencing: Genome Annotation Challenges

Jan 4, 2018

Genome Annotation Challenges

Public databases of genetic information have a fundamental garbage-in>garbage-out problem. A huge number of useful databases are populated by pulling information from other databases and adding new value by computational inferences, but automated linking of databases can propagate incorrect information. The curators of primary repositories such as GenBank make an substantial effort to publish only correct information, so they are very conservative about annotating genes with only verifiable information. NCBI also has a policy that the original depositor of any given entry (a gene, protein, genome, experimental dataset, etc.) is the author of its annotation and metadata, and no one else can alter it.

Staphylococcus aureus is an important human pathogen with perhaps the largest number of whole genome sequences in public repositories of any bacteria. NCBI has 8367 Staph genomes in its “Genomes” section (on Jan 1, 2018), and another ~40,000 in the SRA and Whole Genome Shotgun sections. However, GenBank has chosen strain NCTC 8325 as the Reference Genome for Staph, and put its genes in RefSeq.

This genome was sequenced, annotated, and submitted on 27-Jan-2006 by Gillaspy et al from the Oklahoma Health Sciences Center. As a result of this “Reference Genome” designation, an automatic lookup of a Staph gene in GenBank is likely to get the annotation from NCTC 8325. This particular Staph genome has 2,767 protein coding genes (plus 30 pseudogenes, 61 tRNA, and 16 rRNA genes), however 1496 of these proteins are annotated with only “hypothetical protein” in their “gene product” or “description” field. This is very confusing, since many of these genes are 100% identical to proteins that have specific and well documented functions in other Staph strains.

Here is one example:

hypothetical protein SAOUHSC_00010 [Staphylococcus aureus subsp. aureus NCTC 8325]

NCBI Reference Sequence: YP_498618.1

FEATURES Location/Qualifiers source 1..231 /organism="Staphylococcus aureus subsp. aureus NCTC 8325" /strain="NCTC 8325" /sub_species="aureus" /db_xref="taxon:93061" Protein 1..231 /product="hypothetical protein"

GenBank knows that this is not the correct annotation for this protein. In the “Region” sub-field of the record (which is very rarely used by automated annotation tools that take data from GenBank) an appropriate function, COG and a CDD conserved domain are noted:

Region 7..231 /

region_name="AzlC"

/note="Predicted branched-chain amino acid permease (azaleucine resistance)

[Amino acid transport and metabolism]; COG1296"

/db_xref="CDD:224215"

NCBI also links this gene to an “Identical Protein Group” where 3957 proteins are listed with 100% amino acid identity, which are annotated variously as: “azaleucine resistance protein AzlC”, “branched-chain amino acid ABC transporter permease”, “AzlC”, and “Inner membrane protein YgaZ”. A very conservative annotation bot might panic at this level of inconsistency and default to the lowest common denominator of “hypothetical protein”. However, a more sophisticated automaton might compare the protein sequence to PFAM or COG protein functional families and assign a common annotation to them all.

The incorrect “hypothetical” annotations for Staph genes in GenBank can be found downstream in many other databases, such as the Database of Essential Genes, AureoWiki, KEGG, UniProt, etc. which all upload their primary annotation from GenBank. So someone sequencing a new strain of Staph and using any of these resources to annotate predicted genes will probably end up assigning “hypothetical protein” for the AzlC gene and many hundreds of others, perpetuating the cycle of misinformation.

In a lot of other cases, it does not seem possible for an algorithm to resolve messy annotations that a human expert might be able to figure out. For example, Staph strain COL has many hypothetical genes such as SACOL1097. NCBI Identical Proteins also show only “hypothetical protein” annotations. However, a BLAST search shows 95% identity to nitrogen fixation protein NifR.

hypothetical protein SACOL1097 [Staphylococcus aureus subsp. aureus COL]

GenBank: AAW37977.1

Identical Proteins FASTA Graphics

LOCUS AAW37977 59 aa linear BCT 31-JAN-2014

DEFINITION hypothetical protein SACOL1097 [Staphylococcus aureus subsp. aureus COL].

ACCESSION AAW37977

VERSION AAW37977.1

DBLINK BioProject: PRJNA238

BioSample: SAMN02603996

DBSOURCE accession CP000046.1

SOURCE Staphylococcus aureus subsp. aureus COL

4 comments:

Anonymous said...: POWERFUL LOTTERY SPELL CASTER DR GBOJIE 2018/2019
i am very grateful sharing this great testimonies with you, The best thing that has ever happened in my life is how i win the lottery. I am a woman who believe that one day i will win the lottery.finally my dreams came through when i email Dr gbojie . and tell him i need the lottery numbers. i have come a long way spending money on ticket just to make sure i win. But i never know that winning was so easy until the day i meant the spell caster online which so many people has talked about that he is very great in casting lottery spell, so i decide to give it a try.I contacted this man and he did a spell and he gave me the winning lottery numbers. But believe me when the draws were out i was among winners. i win 1.900.000 million Dollar. Dr. gbojie truly you are the best, with these man you can will millions of money through lottery. i am so very happy to meet these man, i will forever be grateful to you. Email him for your own winning lottery numbers gbojiespiritualspelltemple@gmail.com. OR call him +2349066410185.or check out his website :http://gbojiespiritualtemple.website2.me; Dec 6, 2018, 11:17:00 AM
Donna said...: I am Doctor Paul I got affected with HIV in the process of attending to my HIV patient I tried all I can to get cured but all to no avail, until I saw a post in a health forum about a herbalist man who prepare herbal medication to cure all kind of diseases including HIV virus, at first I doubted if it was real but decided to give it a try, when I contact this herbalist via his email Blessedlovetemple@gmail.com and he prepared a HIV herbal cure and sent it to me via fed-ex delivery company service, when I received this herbal cure, he gave me step by directions on how to apply it, when I applied it as instructed, I was totally cured of this deadly disease within 5 days of usage, I am now free from the deadly disease called HIV, all thanks to Dr Mark. Contact this great herbal spell caster. Kindly contact him. Blessedlovetemple@gmail.com
He cures all kinds of sickness or diseases such as: 1. HERPES VIRUS 2. LASSA FEVER 3. GONORRHEA 4. HIV/AID 5. EX BACK.
Thanks Dr Mark for saving my life.; Apr 7, 2021, 4:27:00 PM
Ric Clayton said...: I really want to thank Dr Emu for saving my marriage. My wife really treated me badly and left home for almost 3 month this got me sick and confused. Then I told my friend about how my wife has changed towards me. Then she told me to contact Dr Emu that he will help me bring back my wife and change her back to a good woman. I never believed in all this but I gave it a try. Dr Emu casted a spell of return of love on her, and my wife came back home for forgiveness and today we are happy again. If you are going through any relationship stress or you want back your Ex or Divorce husband you can contact his whats app +2347012841542 or email emutemple@gmail.com website: Https://emutemple.wordpress.com/; Sep 2, 2021, 11:02:00 AM
Drink Recipes said...: Appreciate your blog ppost; Jun 9, 2023, 2:45:00 PM

Next-Gen Sequencing

Jan 4, 2018

Genome Annotation Challenges

4 comments:

Stuart Brown

Resources

Blog Archive

List of Blogs relevant to NG Seq

Popular Posts