Public databases of genetic information have a fundamental garbage-in>garbage-out
problem. A huge number of useful databases are populated by pulling information
from other databases and adding new value by computational inferences, but
automated linking of databases can propagate incorrect information. The
curators of primary repositories such as GenBank make an substantial effort to
publish only correct information, so they are very conservative about
annotating genes with only verifiable information. NCBI also has a policy that
the original depositor of any given entry (a gene, protein, genome, experimental
dataset, etc.) is the author of its annotation and metadata, and no one else
can alter it.
Staphylococcus aureus is an
important human pathogen with perhaps the largest number of whole
genome sequences in public repositories of any bacteria. NCBI has 8367 Staph genomes
in its “Genomes” section (on Jan 1, 2018), and another ~40,000 in the SRA and
Whole Genome Shotgun sections. However, GenBank has chosen strain NCTC 8325 as the Reference Genome for Staph, and put its genes in RefSeq.
This genome was
sequenced, annotated, and submitted on 27-Jan-2006 by Gillaspy et al from the Oklahoma
Health Sciences Center. As a result of this “Reference Genome” designation, an
automatic lookup of a Staph gene in GenBank is likely to get the annotation
from NCTC 8325. This particular Staph genome has 2,767 protein coding genes (plus
30 pseudogenes, 61 tRNA, and 16 rRNA genes), however 1496 of these proteins are
annotated with only “hypothetical protein” in their “gene product” or
“description” field. This is very confusing, since many of these genes are 100%
identical to proteins that have specific and well documented functions in other
Staph strains.
Here is one example:
hypothetical protein SAOUHSC_00010 [Staphylococcus aureus subsp. aureus NCTC 8325]
NCBI Reference Sequence: YP_498618.1
FEATURES Location/Qualifiers source 1..231 /organism="Staphylococcus aureus subsp. aureus NCTC 8325" /strain="NCTC 8325" /sub_species="aureus" /db_xref="taxon:93061" Protein 1..231 /product="hypothetical protein"
GenBank knows that this is not the correct annotation for this protein. In the “Region” sub-field of the record (which is very rarely used by automated annotation tools that take data from GenBank) an appropriate function, COG and a CDD conserved domain are noted:
region_name="AzlC"
/note="Predicted branched-chain amino acid permease (azaleucine resistance)
[Amino acid transport and metabolism]; COG1296"
/db_xref="CDD:224215"
NCBI also links this gene to an “Identical Protein Group”
where 3957 proteins are listed with 100% amino acid identity, which are
annotated variously as: “azaleucine resistance protein AzlC”, “branched-chain
amino acid ABC transporter permease”, “AzlC”, and “Inner membrane protein YgaZ”.
A very conservative annotation bot might panic at this level of inconsistency
and default to the lowest common denominator of “hypothetical protein”.
However, a more sophisticated automaton might compare the protein sequence to PFAM
or COG protein functional families and assign a common annotation to them all.
The incorrect “hypothetical” annotations for Staph genes in
GenBank can be found downstream in many other databases, such as the Database
of Essential Genes, AureoWiki, KEGG, UniProt, etc. which all upload their
primary annotation from GenBank. So someone sequencing a new strain of Staph
and using any of these resources to annotate predicted genes will probably end
up assigning “hypothetical protein” for the AzlC gene and many hundreds of
others, perpetuating the cycle of misinformation.
In a lot of other cases, it does not seem possible for an
algorithm to resolve messy annotations that a human expert might be able to figure out. For example,
Staph strain COL has many hypothetical genes such as SACOL1097. NCBI Identical
Proteins also show only “hypothetical protein” annotations. However, a BLAST search shows 95%
identity to nitrogen fixation protein NifR.
hypothetical protein SACOL1097 [Staphylococcus aureus subsp.
aureus COL]
GenBank: AAW37977.1
Identical Proteins FASTA Graphics
LOCUS AAW37977
59 aa
linear BCT
31-JAN-2014
DEFINITION
hypothetical protein SACOL1097 [Staphylococcus aureus subsp. aureus COL].
ACCESSION
AAW37977
VERSION AAW37977.1
DBLINK BioProject: PRJNA238
BioSample: SAMN02603996
DBSOURCE
accession CP000046.1
SOURCE Staphylococcus
aureus subsp. aureus COL
4 comments:
POWERFUL LOTTERY SPELL CASTER DR GBOJIE 2018/2019
i am very grateful sharing this great testimonies with you, The best thing that has ever happened in my life is how i win the lottery. I am a woman who believe that one day i will win the lottery.finally my dreams came through when i email Dr gbojie . and tell him i need the lottery numbers. i have come a long way spending money on ticket just to make sure i win. But i never know that winning was so easy until the day i meant the spell caster online which so many people has talked about that he is very great in casting lottery spell, so i decide to give it a try.I contacted this man and he did a spell and he gave me the winning lottery numbers. But believe me when the draws were out i was among winners. i win 1.900.000 million Dollar. Dr. gbojie truly you are the best, with these man you can will millions of money through lottery. i am so very happy to meet these man, i will forever be grateful to you. Email him for your own winning lottery numbers gbojiespiritualspelltemple@gmail.com. OR call him +2349066410185.or check out his website :http://gbojiespiritualtemple.website2.me
I am Doctor Paul I got affected with HIV in the process of attending to my HIV patient I tried all I can to get cured but all to no avail, until I saw a post in a health forum about a herbalist man who prepare herbal medication to cure all kind of diseases including HIV virus, at first I doubted if it was real but decided to give it a try, when I contact this herbalist via his email Blessedlovetemple@gmail.com and he prepared a HIV herbal cure and sent it to me via fed-ex delivery company service, when I received this herbal cure, he gave me step by directions on how to apply it, when I applied it as instructed, I was totally cured of this deadly disease within 5 days of usage, I am now free from the deadly disease called HIV, all thanks to Dr Mark. Contact this great herbal spell caster. Kindly contact him. Blessedlovetemple@gmail.com
He cures all kinds of sickness or diseases such as: 1. HERPES VIRUS 2. LASSA FEVER 3. GONORRHEA 4. HIV/AID 5. EX BACK.
Thanks Dr Mark for saving my life.
I really want to thank Dr Emu for saving my marriage. My wife really treated me badly and left home for almost 3 month this got me sick and confused. Then I told my friend about how my wife has changed towards me. Then she told me to contact Dr Emu that he will help me bring back my wife and change her back to a good woman. I never believed in all this but I gave it a try. Dr Emu casted a spell of return of love on her, and my wife came back home for forgiveness and today we are happy again. If you are going through any relationship stress or you want back your Ex or Divorce husband you can contact his whats app +2347012841542 or email emutemple@gmail.com website: Https://emutemple.wordpress.com/
Appreciate your blog ppost
Post a Comment