Oct 20, 2008

File Formats

What is it with bioinformatics people and file formats?!

Why is it so bloody hard to produce and agree on a single standard to represent sequence data (with quality scores) and a standard for sequence reads aligned on a reference genome? With so many formats, we are all spending exponential amounts of time writing converters between all possible combinations. 

Here are some of the file formats that I've dealt with in the past couple of weeks:

SEQUENCE FORMATS

Sequence plus Phred quality score encoded as single ascii bytes

@NCYC361-11a03.q1k bases 1 to 1576
GCGTGCCCGAAAAAATGCTTTTGGAGCCGCGCGTGAAAT
+NCYC361-11a03.q1k bases 1 to 1576
!)))))****(((***%%((((*(((+,**(((+**+,-


Solexa/Illumina FASTQ like thing...
s_*_sequence.txt
@HWI-EAS305_3-30gf5aaxx:8:1:415:1852
GTTAGATTTTGTGTAACTTGCATGTAATGTTAAAA
+HWI-EAS305_3-30gf5aaxx:8:1:415:1852
YYYYYYYYYYYYVYYYYYYVYYYYYYYYVYVVTUU
@HWI-EAS305_3-30gf5aaxx:8:1:187:1286
GTTACACTGAAAAACAAATTCGTTGGAAACGGGAT
+HWI-EAS305_3-30gf5aaxx:8:1:187:1286
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYTVVVV
@HWI-EAS305_3-30gf5aaxx:8:1:202:440
GTGAAAAATGAGAAATGCACACTGAAGGACCTGGA
+HWI-EAS305_3-30gf5aaxx:8:1:202:440
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYVVUVV

s_*_eland_extended.txt
Solexa output format from Eland extended
>HWI-EAS305_3-30gf5aaxx:8:1:63:487      GGAGGTAGAGGTATATGGCAAGAAAACTGAAAATC     NM      -
>HWI-EAS305_3-30gf5aaxx:8:1:415:1852 GTTAGATTTTGTGTAACTTGCATGTAATGTTAAAA 3:1:0 chr14.fa:35121238F35,35121282F35,35121326F32T1T,351
21354F4T30
>HWI-EAS305_3-30gf5aaxx:8:1:187:1286 GTTACACTGAAAAACAAATTCGTTGGAAACGGGAT 0:4:5 chr6.fa:103599157R16C17A,chr2.fa:98502709R16C18,985
02829R6A9C18,98505080F4AC29,98505200F1A14C18,98505320F16C18,98506416R16C13C2CA,98506537R16C18,chrX.fa:139917587R16C2A13CA
>HWI-EAS305_3-30gf5aaxx:8:1:202:440 GTGAAAAATGAGAAATGCACACTGAAGGACCTGGA 3:87:58 chr2.fa:98503100F33T1,98506780F35,98507265F35
>HWI-EAS305_3-30gf5aaxx:8:1:359:505 TATTCAATTTACATACTCTGGCTTTGCCAACATTT 1:0:0 chr9.fa:31339651R35
>HWI-EAS305_3-30gf5aaxx:8:1:1290:135 TTGATTGTATAGTAGGGGTGAAATGGAATTTTATC 1:0:1 chrM.fa:14790R35
>HWI-EAS305_3-30gf5aaxx:8:1:627:596 GTGATTTTGAAAGTTGTAGATTGTGTGTTTGTGAT NM -
>HWI-EAS305_3-30gf5aaxx:8:1:379:298 GACGTGAAATATGGCGAGGAAAACTGAAAAAGGTG 31:56:28 -


s_*_eland_multi.txt
Solexa output format from Eland extended
>HWI-EAS305_3-30gf5aaxx:8:1:414:208     GTAAACTATCAATAAAATAATTTGTTACTCTGTAT     20:7:0
>HWI-EAS305_3-30gf5aaxx:8:1:59:857 TAAATTGTCCACCTTTTTCAGTTTTCCTCGCTATA 0:0:35
>HWI-EAS305_3-30gf5aaxx:8:1:1414:307 GAGAAAACTGTAAATAAAGGTAAATGAGAAAAAAA NM
>HWI-EAS305_3-30gf5aaxx:8:1:330:1758 GGTAAAGTCCACTAAGGAAAAGAAAGAAACAATGT 1:0:0 chr7.fa:97764095R0
>HWI-EAS305_3-30gf5aaxx:8:1:576:127 GAAGTCAATCTTATGAGTTATTAGGATGGCTACTC 0:7:255 chr7.fa:111867683F1,chr12.fa:51788781R1,115833262F1
,chr6.fa:21403822R1,89734675R1,89780759R1,chrX.fa:15525553R1
>HWI-EAS305_3-30gf5aaxx:8:1:88:1045 GTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCC 66:110:72
>HWI-EAS305_3-30gf5aaxx:8:1:939:613 TACTTTACTTTCTAGGGAATGTTCACTTCTAAGTG 1:0:0 chr1.fa:150051845R0

s_*_sorted.txt
filtered eland_extended alignments w/ quality  scores and genome positions
HWI-EAS305      3-30gf5aaxx     8       66      580     1584                    AGTATGGGTATCGGTTGGTGCAGAGAACTACTGCA     YYYYYYYYYYYYYYYYYYY
YYYYYVYYYYYVVUVU chr10.fa 3001045 F 35 11
HWI-EAS305 3-30gf5aaxx 8 100 534 1062 ATTTTCAGGTTGGAGTGACTCGCTAAAACAGCCAA YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYTVVVV chr10.fa 3002892 R 35 29
HWI-EAS305 3-30gf5aaxx 8 59 199 495 CCACATGCTGTGGCAAAGCCCTTCTGAGCGGGGCG YYYYTYYYYYYYYYYYRYY
YYYYYYYYYYYTVUVV chr10.fa 3008958 F 34A 20
HWI-EAS305 3-30gf5aaxx 8 76 779 1406 AGATGTACAAATGCTCCTCAGATGTTTGTGTCATA YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYVVVVV chr10.fa 3009290 F 35 3
HWI-EAS305 3-30gf5aaxx 8 83 547 1480 ATCCAAACAGTTACACAAAGTTTTGAGAACATTAT YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYVVVVV



GENOME ALIGNMENT FORMATS

SGA ('Simplified' Genome Annotation)

GFF  (General Feature Format)
EXAMPLE:
track name=regulatory description="TeleGene(tm) Regulatory Regions"
chr22 TeleGene enhancer 1000000 1001000 500 + . touch1
chr22 TeleGene promoter 1010000 1010100 900 + . touch1
chr22 TeleGene promoter 1020000 1020000 800 - . touch2

FPS (Functional Position Set)
Native format for Eukaryotic Promoter Database

EXAMPLE:
FP   Pv snRNA U1         :+S  EM:J03563.1          1+       352; 17001.098
FP Ath snRNA U2.5 :+S EM:AL353994.1 1- 73709; 24016.116
FP Ath snRNA U5 :+S EM:X13012.1 1+ 678; 23040.
FP Ta histone H3 :+S EM:X00937.1 1+ 186; 07001.

WIG (Wiggle)
UCSC Genome Browser track format

EXAMPLE
track type=wiggle_0 name="Bed Format" description="BED format" \
visibility=full color=200,100,0 altColor=0,100,200 priority=20
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50

UCSC Genome Browser
Example:
Here's an example of an annotation track that uses a complete BED definition:

track name=pairedReads description="Clone Paired Reads" useScore=1
chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

ALN
Alignment format for CisGenome

chr1[tab]359077[tab]F
chr1
[tab]376890[tab]R

….

column1 = chromosome where the read is aligned;
column2 = coordinate where the read is aligned;
column3 = ‘F’ or ‘+’: if the read is aligned to the forward strand of the genome assembly;
‘R’ or ‘-’: if the read is aligned to the reverse complement strand of the genome.

3 comments:

Anonymous said...

Dont forget maq and bowtie. It is getting rather frustrating...

Anonymous said...

personally, it's driving me to drink. Next stop painkillers.

Anonymous said...

oh, and don't forget s_*_eland_result.txt

here's an example:

s_5_eland_result.txt
>HWUSI-EAS528:5:1:764:491#0/1 TACTGCAAGGACCTCTGACCTCCACGCAGGTGTGCT U0 1 0 0 chr12.fa 109518808 R DD