Next-Gen Sequencing: File Formats

Oct 20, 2008

File Formats

What is it with bioinformatics people and file formats?!

Why is it so bloody hard to produce and agree on a single standard to represent sequence data (with quality scores) and a standard for sequence reads aligned on a reference genome? With so many formats, we are all spending exponential amounts of time writing converters between all possible combinations.

Here are some of the file formats that I've dealt with in the past couple of weeks:

SEQUENCE FORMATS

FASTQ

Sequence plus Phred quality score encoded as single ascii bytes

@NCYC361-11a03.q1k bases 1 to 1576

GCGTGCCCGAAAAAATGCTTTTGGAGCCGCGCGTGAAAT

+NCYC361-11a03.q1k bases 1 to 1576

!)))))****(((***%%((((*(((+,**(((+**+,-

Solexa/Illumina FASTQ like thing...

s_*_sequence.txt

@HWI-EAS305_3-30gf5aaxx:8:1:415:1852
GTTAGATTTTGTGTAACTTGCATGTAATGTTAAAA
+HWI-EAS305_3-30gf5aaxx:8:1:415:1852
YYYYYYYYYYYYVYYYYYYVYYYYYYYYVYVVTUU
@HWI-EAS305_3-30gf5aaxx:8:1:187:1286
GTTACACTGAAAAACAAATTCGTTGGAAACGGGAT
+HWI-EAS305_3-30gf5aaxx:8:1:187:1286
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYTVVVV
@HWI-EAS305_3-30gf5aaxx:8:1:202:440
GTGAAAAATGAGAAATGCACACTGAAGGACCTGGA
+HWI-EAS305_3-30gf5aaxx:8:1:202:440
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYVVUVV

s_*_eland_extended.txt

Solexa output format from Eland extended

>HWI-EAS305_3-30gf5aaxx:8:1:63:487      GGAGGTAGAGGTATATGGCAAGAAAACTGAAAATC     NM      -
>HWI-EAS305_3-30gf5aaxx:8:1:415:1852    GTTAGATTTTGTGTAACTTGCATGTAATGTTAAAA     3:1:0   chr14.fa:35121238F35,35121282F35,35121326F32T1T,351
21354F4T30
>HWI-EAS305_3-30gf5aaxx:8:1:187:1286    GTTACACTGAAAAACAAATTCGTTGGAAACGGGAT     0:4:5   chr6.fa:103599157R16C17A,chr2.fa:98502709R16C18,985
02829R6A9C18,98505080F4AC29,98505200F1A14C18,98505320F16C18,98506416R16C13C2CA,98506537R16C18,chrX.fa:139917587R16C2A13CA
>HWI-EAS305_3-30gf5aaxx:8:1:202:440     GTGAAAAATGAGAAATGCACACTGAAGGACCTGGA     3:87:58 chr2.fa:98503100F33T1,98506780F35,98507265F35
>HWI-EAS305_3-30gf5aaxx:8:1:359:505     TATTCAATTTACATACTCTGGCTTTGCCAACATTT     1:0:0   chr9.fa:31339651R35
>HWI-EAS305_3-30gf5aaxx:8:1:1290:135    TTGATTGTATAGTAGGGGTGAAATGGAATTTTATC     1:0:1   chrM.fa:14790R35
>HWI-EAS305_3-30gf5aaxx:8:1:627:596     GTGATTTTGAAAGTTGTAGATTGTGTGTTTGTGAT     NM      -
>HWI-EAS305_3-30gf5aaxx:8:1:379:298     GACGTGAAATATGGCGAGGAAAACTGAAAAAGGTG     31:56:28        -

s_*_eland_multi.txt

Solexa output format from Eland extended

>HWI-EAS305_3-30gf5aaxx:8:1:414:208     GTAAACTATCAATAAAATAATTTGTTACTCTGTAT     20:7:0
>HWI-EAS305_3-30gf5aaxx:8:1:59:857      TAAATTGTCCACCTTTTTCAGTTTTCCTCGCTATA     0:0:35
>HWI-EAS305_3-30gf5aaxx:8:1:1414:307    GAGAAAACTGTAAATAAAGGTAAATGAGAAAAAAA     NM
>HWI-EAS305_3-30gf5aaxx:8:1:330:1758    GGTAAAGTCCACTAAGGAAAAGAAAGAAACAATGT     1:0:0   chr7.fa:97764095R0
>HWI-EAS305_3-30gf5aaxx:8:1:576:127     GAAGTCAATCTTATGAGTTATTAGGATGGCTACTC     0:7:255 chr7.fa:111867683F1,chr12.fa:51788781R1,115833262F1
,chr6.fa:21403822R1,89734675R1,89780759R1,chrX.fa:15525553R1
>HWI-EAS305_3-30gf5aaxx:8:1:88:1045     GTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCC     66:110:72
>HWI-EAS305_3-30gf5aaxx:8:1:939:613     TACTTTACTTTCTAGGGAATGTTCACTTCTAAGTG     1:0:0   chr1.fa:150051845R0

s_*_sorted.txt

filtered eland_extended alignments w/ quality scores and genome positions

HWI-EAS305      3-30gf5aaxx     8       66      580     1584                    AGTATGGGTATCGGTTGGTGCAGAGAACTACTGCA     YYYYYYYYYYYYYYYYYYY
YYYYYVYYYYYVVUVU        chr10.fa                3001045 F       35      11
HWI-EAS305      3-30gf5aaxx     8       100     534     1062                    ATTTTCAGGTTGGAGTGACTCGCTAAAACAGCCAA     YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYTVVVV        chr10.fa                3002892 R       35      29
HWI-EAS305      3-30gf5aaxx     8       59      199     495                     CCACATGCTGTGGCAAAGCCCTTCTGAGCGGGGCG     YYYYTYYYYYYYYYYYRYY
YYYYYYYYYYYTVUVV        chr10.fa                3008958 F       34A     20
HWI-EAS305      3-30gf5aaxx     8       76      779     1406                    AGATGTACAAATGCTCCTCAGATGTTTGTGTCATA     YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYVVVVV        chr10.fa                3009290 F       35      3
HWI-EAS305      3-30gf5aaxx     8       83      547     1480                    ATCCAAACAGTTACACAAAGTTTTGAGAACATTAT     YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYVVVVV

GENOME ALIGNMENT FORMATS

SGA ('Simplified' Genome Annotation)

GFF (General Feature Format)

UCSC Genome Browser

Sanger

EXAMPLE:

track name=regulatory description="TeleGene(tm) Regulatory Regions"
chr22 TeleGene enhancer 1000000 1001000 500 + . touch1
chr22 TeleGene promoter 1010000 1010100 900 + . touch1
chr22 TeleGene promoter 1020000 1020000 800 - . touch2

FPS (Functional Position Set)

Native format for Eukaryotic Promoter Database

EXAMPLE:

FP   Pv snRNA U1         :+S  EM:J03563.1          1+       352; 17001.098
FP   Ath snRNA U2.5      :+S  EM:AL353994.1        1-     73709; 24016.116
FP   Ath snRNA U5        :+S  EM:X13012.1          1+       678; 23040.
FP   Ta histone H3       :+S  EM:X00937.1          1+       186; 07001.

WIG (Wiggle)

UCSC Genome Browser track format

EXAMPLE

track type=wiggle_0 name="Bed Format" description="BED format" \
visibility=full color=200,100,0 altColor=0,100,200 priority=20
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50

BED

UCSC Genome Browser

Example:
Here's an example of an annotation track that uses a complete BED definition:

track name=pairedReads description="Clone Paired Reads" useScore=1
chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

ALN

Alignment format for CisGenome


chr1[tab]359077[tab]F
chr1[tab]376890[tab]R

….

column1 = chromosome where the read is aligned;
column2 = coordinate where the read is aligned;
column3 = ‘F’ or ‘+’: if the read is aligned to the forward strand of the genome assembly;
         ‘R’ or ‘-’: if the read is aligned to the reverse complement strand of the genome.

3 comments:

Anonymous said...: Dont forget maq and bowtie. It is getting rather frustrating...; Apr 27, 2009, 1:01:00 PM
Anonymous said...: personally, it's driving me to drink. Next stop painkillers.; Aug 13, 2009, 9:57:00 PM
Anonymous said...: oh, and don't forget s_*_eland_result.txt

here's an example:

s_5_eland_result.txt
>HWUSI-EAS528:5:1:764:491#0/1 TACTGCAAGGACCTCTGACCTCCACGCAGGTGTGCT U0 1 0 0 chr12.fa 109518808 R DD; Aug 13, 2009, 10:03:00 PM

Next-Gen Sequencing

Oct 20, 2008

File Formats

3 comments:

Stuart Brown

Resources

Blog Archive

List of Blogs relevant to NG Seq

Popular Posts