Next-Gen Sequencing: The Tardigrade Miscalculation

Jan 28, 2016

The Tardigrade Miscalculation

There was a lot of publicity back in November about the genome sequence of theTardigrade (Hypsibius dujardini), a small animal (0.05 – 1mm) that is somewhat similar to nematodes. These are fascinating little creatures that have been described as incredibly resistant to all manner of physical stress – high and low temperatures (reportedly from -272oC to +151oC), high pressure, complete vacuum (Tardigrades in Space = TARDIS {I kid you not}), ionizing radiation, and can survive without food or water for more than 10 years as kind of a dehydrated little lump.

Tippett Studio/Cosmos A Spacetime Odyssey

The reason the genome of the Tardigrade was such big news in November is that the group doing the bioinformatics analysis claimed that the genome contained 6,663 genes from bacteria, a full sixth of the genome, and twice as many horizontally transferred genes as have ever been seen in any other organism (Boothby et al, PNAS 112(52):15976-81. doi: 10.1073/pnas.1510461112. PMID: 26598659). This "weird science" observation was covered by National Geographic, Science News, Phys.org, Meta Science News, and of course the Univ. of North Carolina press site.

However, it seems quite clear now that this claim about horizontal DNA from bacteria (and maybe other phyla) in the genome of the Tardigrade was wrong. In fact, another group (Georgios Koutsovoulos, Sujai Kumar, Dominik R Laetsch, Lewis Stevens, Jennifer Daub, Claire Conlon, Habib Maroon, Fran Thomas, Aziz Aboobaker, Mark Blaxter) also working on the sequence of the exact same species has rapidly published a preprint manuscript on the bioRxiv preprint server "The genome of the tardigrade Hypsibius dujardini" that clearly refutes the claims of Boothby et al. and points out their mistakes in genome analysis: "Cross-comparison of the assemblies, using raw read and RNA-Seq data, confirmed that the overwhelming majority of the putative HGT candidates in the previous genome were predicted from scaffolds at very low coverage and were not transcribed."

It is quite easy to get contaminants when you are doing whole genome sequencing for a multicellular organism. You grind up your target species, extract DNA and put it into the sequencing machine. Any bacteria and other small organisms on the surface or in the gut come along for the ride and can contribute their DNA to the sequencing library. Surprisingly, a small amount of bacterial contaminating DNA (perhaps just 1%) can lead to a large number of bacterial contigs in the final genome assembly. I can think of a couple of reasons for this, based on the small size of bacterial genomes (~1 MB), vs metazoan genomes (most >100 MB). First, relative genome coverage of a contaminant bacteria will be much higher for each KB of sequence data, so the 1% of contaminating DNA may have deep coverage of a bacterial genome. Second, any two bacterial DNA fragments randomly selected from a library have a much higher chance to overlap (less complex genome), so they will assemble better.

There are a few QC steps that one can take on the raw data. There is a nice tool called Kraken (Wood DE, Salzberg SL Genome Biology 2014, 15:R46) that can quickly run through an entire FASTQ file (4 million reads per minute on a single core) and mark each read according to a set of reference genomes based on exact matching of 31 base k-mers. The Kraken team also make available a pre-built 4 GB database constructed from complete bacterial, archaeal, and viral genomes in RefSeq. DeconSeq is another good tool to find contaminants with an easy web interface. Of course, some legitimate reads from any target organism will share k-mer sized chunks with some bacteria, viruses, etc. (and some sequences from contaminating bacteria will not be in any database), so one has to make some tough choices about what to remove from the data before assembly.

After assembly, there are some additional steps one can take to flag contaminants. It is extremely helpful (I would now say required) to have some RNA-seq data from the same organism. RNA-seq data is prepared using a poly-A protocol, so no bacterial RNA contaminants should be present. Any contigs (with predicted genes) that do not contain a reasonable amount of aligned RNA-seq reads are highly suspect. Any contig that has predicted genes only from a different species is clearly a red flag.

While the authors of the original have not (yet) published a retraction, the citation in PubMed does carry a link to the refuting article provided by author Sujai Kumar

Rather than rant on about proper workflows for genome annotation (a best practices document does exist: Mark Yandell & Daniel Ence, Nature Reviews Genetics 13, 329-342 doi:10.1038/nrg3174) let me just say to the authors, the reviewers and the editors at PNAS that "EXTRODINARY CLAIMS REQUIRE EXTRODINARY EVIDENCE" (Carl Sagan). Or as said by Laplace: “The weight of evidence for an extraordinary claim must be proportioned to its strangeness.”

13 comments:

Sujai said...: Thanks for this post. (Disclaimer - I'm one of the authors on the followup tardigrade genome paper No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini - this latest version includes a supp info section where each of the Boothby et al claims is refuted in detail).

I have great admiration for Mark Yandell and his group's work - we use the MAKER gene finding software for many of our projects. And I agree that they have an excellent best-practices workflow. However, I'd just like to point out that Yandell was a co-author on the Boothby et al paper that proposed 17% HGT. I'm guessing that Yandell's group was asked to find genes on the tardigrade genome assembly without the bacterial contaminants removed and they ran their MAKER pipeline on it. As we show at gist.github.com/GDKO/bc507bc9b620e6006a44 - a eukaryotic gene finder will find "eukaryotic genes", complete with introns, even on a pure bacterial genome (E. coli in the github gist example).

This might be something for you and readers of your excellent blog to keep in mind, that a great tool (or model) applied to the wrong data can easily result in wrong results.; Mar 8, 2016, 3:39:00 AM
Anonymous said...: The group involved with the sequencing of this genome was already looking for horizontal gene transfer even before the assembly was annotated (based on their own lab experiments). It was a young graduate student from Mark's lab who assisted with running MAKER, and she both identified and pointed out the contamination in the assembly to the Tardigrade group. They unfortunately passed over her concerns.; Oct 13, 2019, 2:21:00 PM
Anacyte Laboratories said...: Wow, really made me understand things as a beginner! Thanks for sharing.

RNA Fixation
RNA Stabilizer
RNA Protect
RNA Sequencing
formaldehyde alternative; Feb 9, 2021, 7:06:00 AM
Ric Clayton said...: I really want to thank Dr Emu for saving my marriage. My wife really treated me badly and left home for almost 3 month this got me sick and confused. Then I told my friend about how my wife has changed towards me. Then she told me to contact Dr Emu that he will help me bring back my wife and change her back to a good woman. I never believed in all this but I gave it a try. Dr Emu casted a spell of return of love on her, and my wife came back home for forgiveness and today we are happy again. If you are going through any relationship stress or you want back your Ex or Divorce husband you can contact his whats app +2347012841542 or email emutemple@gmail.com website: Https://emutemple.wordpress.com/; Sep 2, 2021, 11:05:00 AM
Web of Biology said...: well done. nice post
omicron virus; Jan 25, 2022, 1:32:00 PM
baccaratsite.biz said...: I blog often and I seriously appreciate your information. Your article has really peaked my interest. I will bookmark your blog and keep checking for new information about once per week. Feel free to visit my website; 바카라사이트비즈; Mar 11, 2022, 10:02:00 PM
totosafedb.com said...: Wow! This can be one particular of the most beneficial blogs We have ever arrive across on this subject. 안전놀이터; Mar 16, 2022, 4:43:00 PM
Robert Hohl said...: hey there, this is excellent content. Very interesting, excellent work, and many thanks for sharing such an informative blog. I actually recommend you to read this article how to find my xbox ip address guide to setting . This article provided me with a wealth of useful information. All of the steps in this article are simple to follow and comprehend. Please take a look.; Aug 29, 2022, 2:45:00 AM
안전 토토사이트 said...: I am so blessed to discover this. thank you; Jan 6, 2024, 2:41:00 PM
gostopsite.com said...: You’re a very skilled blogger. thank you; Jan 6, 2024, 2:42:00 PM
casinositerank.com said...: Thank you for sharing.; Jan 6, 2024, 2:44:00 PM
sportstotomen.com said...: Your website deserves all of the positive feedback it’s been getting.; Jan 6, 2024, 2:45:00 PM
casinosite.zone said...: It was very well authored and easy to understand.; Jan 6, 2024, 2:47:00 PM

Next-Gen Sequencing

Jan 28, 2016

The Tardigrade Miscalculation

13 comments:

Stuart Brown

Resources

Blog Archive

List of Blogs relevant to NG Seq

Popular Posts