Next-Gen Sequencing: Contaminated Genomes

Dec 22, 2017

Contaminated Genomes

This is a long post, so first a quick summary. Some genome sequences contain contaminants. These contaminants create many problems when we use a trusted resource like GenBank or UniProt to summarize the sequences in a taxonomic group. I have illustrated one typical example, but there are thousands (maybe tens of thousands) of others.

I have been obsessing over errors and contamination in our public sequence databases. This week I was trying to use UniProt as a set of reference sequences for fungi. Our goal is fairly simple: To find the fungal DNA in a metagenomic shotgun sequence sample - which is just a mixture of all the DNA present in a scraping from mouth, throat, or any other body site.

UniProt makes it quite easy to sort all their proteins by taxonomy, and to download a subset of the data clustered at 100% (combining all exact duplicate sequences), 90%, or 50% amino acid identity. One might expect that fungal genes should not match bacteria at more than 50% identity. But surprisingly there are quite a lot of 50% and 90% clusters that contain both bacterial and fungal sequences (about 3000 of the 90% fungal clusters also contain bacterial proteins).

The UniProt support staff provided some very useful help to build a query on their system that finds only those clusters of 90% identical proteins that contain fungal genes, but NO (NO!) bacterial genes. In case you like this sort of thing, here is the exact query:

uniprot:(taxonomy:"Fungi [4751]") NOT taxonomy:"Bacteria [2]" AND identity:0.9
[Note the careful use of quote marks parenthesis and square brackets, this stuff is rather tricky]

So I downloaded this set of putative fungal proteins (UniProt very helpfully creates a single 'representative' UniRef sequence in FASTA format for each cluster). I tested the fungal proteins against all the gene coding sequences (CDS) from the E.coli genome using BLASTx. Once again, there are far too many high similarity matches.

One of the top matches is to a gene (Guanosine-3',5'-bis(diphosphate) 3'-pyrophosphohydrolase) from the fungus Beauveria bassiana that has 98% identity to E.coli. Since I am in an obsessive mood about this sort of thing, I decided that for this one example, I would collect some evidence to decide if we have strong sequence homology between bacteria and fungi for this gene, if Beauveria bassiana has a horizontal gene transfer, or if E. COLI CAN BE A CONTAMINANT IN GENOME SEQUENCES (!!!) [emphasis mine]

I put this Beauveria gene into a generic NCBI BLAST against all 'nr' proteins, and I got a very interesting result. There are exactly two matches to eukaryotes (Beauveria and a nematode), and 11,858 matches to bacteria, including lots of E.coli.

So I traced the Beauveria bassiana protein in UniProt back to its source as a whole genome shotgun sequence uploaded to GenBank on Nov 3, 2014 by the Institute of Plant Protection, Jilin Academy of Agricultural Sciences, Accession PRJNA178080, WGS ANFO00000000, Assembly GCA_000770705.1.

I downloaded the whole genome assembly and BLASTed it with the E.coli hydrolase gene from above. This very quickly pinpointed a contig 00271 (ANFO01000251.1 Beauveria bassiana D1-5 contig00271) that contains the matching sequence. The contig is 72,232 bases long. I then put this conting into NCBI BLAST against Bacteria. I get matches that correspond to lots of bacterial genes (POL I, RecG, iPGM, XanP, CpxA, GTP binding protein, GSI beta, and my ppGpp hydrolase) all with >90% identity and BLAST e-value 0.0.

Final answer: This is a contaminant. There was some E.coli DNA sequenced and assembled with the Beauveria DNA, and nobody checked before loading these sequences into GenBank.

My recommendation to GenBank and de novo genome sequencers everywhere is to check all predicted proteins from new genomes for matches to bacteria and human before loading them into a trusted database.

5 comments:

prof prem raj pushpakaran said...: prof premraj pushpakaran -- 2018 marks the 100th birth year of Frederick Sanger!!!; Jan 2, 2018, 4:52:00 AM
Anonymous said...: POWERFUL LOTTERY SPELL CASTER DR GBOJIE 2018/2019
i am very grateful sharing this great testimonies with you, The best thing that has ever happened in my life is how i win the lottery. I am a woman who believe that one day i will win the lottery.finally my dreams came through when i email Dr gbojie . and tell him i need the lottery numbers. i have come a long way spending money on ticket just to make sure i win. But i never know that winning was so easy until the day i meant the spell caster online which so many people has talked about that he is very great in casting lottery spell, so i decide to give it a try.I contacted this man and he did a spell and he gave me the winning lottery numbers. But believe me when the draws were out i was among winners. i win 1.900.000 million Dollar. Dr. gbojie truly you are the best, with these man you can will millions of money through lottery. i am so very happy to meet these man, i will forever be grateful to you. Email him for your own winning lottery numbers gbojiespiritualspelltemple@gmail.com. OR call him +2349066410185.or check out his website :http://gbojiespiritualtemple.website2.me; Dec 6, 2018, 11:18:00 AM
Anacyte Laboratories said...: Thanks for an interesting blog. What else may I get that sort of info written in such a perfect approach? I have an undertaking that I am just now operating on, and I have been on the lookout for such info.

RNA Sequencing
RNA Stabilizer
Protein Fixation; Feb 2, 2021, 1:40:00 AM
Donna said...: I am Doctor Paul I got affected with HIV in the process of attending to my HIV patient I tried all I can to get cured but all to no avail, until I saw a post in a health forum about a herbalist man who prepare herbal medication to cure all kind of diseases including HIV virus, at first I doubted if it was real but decided to give it a try, when I contact this herbalist via his email Blessedlovetemple@gmail.com and he prepared a HIV herbal cure and sent it to me via fed-ex delivery company service, when I received this herbal cure, he gave me step by directions on how to apply it, when I applied it as instructed, I was totally cured of this deadly disease within 5 days of usage, I am now free from the deadly disease called HIV, all thanks to Dr Mark. Contact this great herbal spell caster. Kindly contact him. Blessedlovetemple@gmail.com
He cures all kinds of sickness or diseases such as: 1. HERPES VIRUS 2. LASSA FEVER 3. GONORRHEA 4. HIV/AID 5. EX BACK.
Thanks Dr Mark for saving my life.; Apr 7, 2021, 4:27:00 PM
URGENT LOAN said...: My name is Mrs Aisha Mohamed, am a Citizen Of Qatar.Have you been looking for a loan?Do you need an urgent personal loan or business loan?contact Adam Ibrahim Finance Home he help me with a loan of $80,000 some days ago after been scammed of $6,800 from a woman claiming to been a loan lender but i thank God today that i got my loan worth $80,000.Feel free to contact the company for a genuine financial service. Email:adamibrahimfinanceltd1976@gmail.com call/whats-App Contact Number +918119841594 Adam Ibrahim Finance Pvt Ltd; Mar 24, 2025, 8:43:00 AM

Next-Gen Sequencing

Dec 22, 2017

Contaminated Genomes

5 comments:

Stuart Brown

Resources

Blog Archive

List of Blogs relevant to NG Seq

Popular Posts