We are working on a new de novo sequence for the little skate shark ((Leucoraja erinacea), an interesting system to study developmental biology. The skate has a BIG genome, estimated at 4 Gb (bigger than human), so this is going to be a difficult project. The existing skate genome in Genbank and the SkateBase website is not in very good shape (3 million contigs with N50 size of 665 bp).
We got a couple of preliminary Oxford MinION reads from skate DNA - not nearly enough coverage to make a dent in this project, just having a look at the data. Oxford produces two kinds of data (in their own annoying FAST5 format, but I won't rant about that right now), single pass 1D and double pass 2D. [My Bioinformatics programmer Yuhan Hao did this analysis.] Here is what our data looks like.
So the 1D reads are really long - some more than 50 kb. The 2D reads are mostly 2-10 kb. The SkateBase has a complete contig of the mitochondrial genome, so we were able to align the Oxford sequences to this as a reference. Coverage was low, but we do have some regions where both 1D and 2D reads match the reference. What we can see is that the 1D reads have a lot of SNPs vs the reference, while the 2D reads have very few SNPs- so it is clear that the 2D reads have been successfully error corrected. Strangely, both the 1D and 2D reads have a lot of insertion-deletion errors (several per hundred bases) compared to the reference, and in fact they do not match each other - so we consider these to all be novel, uncorrected errors.
poor quality Oxoford sequences. With more coverage, we may be able to use the Oxoford reads as scaffolds for our de novo assembly. It may be possible to use Illumina reads for error correction, and mark all uncorrected areas of the Oxford sequences as low quality, but that is not the usual method for submitting draft genomes to Genbank.