What is it with bioinformatics people and file formats?!
Why is it so bloody hard to produce and agree on a single standard to represent sequence data (with quality scores) and a standard for sequence reads aligned on a reference genome? With so many formats, we are all spending exponential amounts of time writing converters between all possible combinations.
Here are some of the file formats that I've dealt with in the past couple of weeks:
SEQUENCE FORMATS
Sequence plus Phred quality score encoded as single ascii bytes
@NCYC361-11a03.q1k bases 1 to 1576
GCGTGCCCGAAAAAATGCTTTTGGAGCCGCGCGTGAAAT
+NCYC361-11a03.q1k bases 1 to 1576
!)))))****(((***%%((((*(((+,**(((+**+,-
Solexa/Illumina FASTQ like thing...
s_*_sequence.txt
@HWI-EAS305_3-30gf5aaxx:8:1:415:1852
GTTAGATTTTGTGTAACTTGCATGTAATGTTAAAA
+HWI-EAS305_3-30gf5aaxx:8:1:415:1852
YYYYYYYYYYYYVYYYYYYVYYYYYYYYVYVVTUU
@HWI-EAS305_3-30gf5aaxx:8:1:187:1286
GTTACACTGAAAAACAAATTCGTTGGAAACGGGAT
+HWI-EAS305_3-30gf5aaxx:8:1:187:1286
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYTVVVV
@HWI-EAS305_3-30gf5aaxx:8:1:202:440
GTGAAAAATGAGAAATGCACACTGAAGGACCTGGA
+HWI-EAS305_3-30gf5aaxx:8:1:202:440
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYVVUVV
s_*_eland_extended.txt
Solexa output format from Eland extended
>HWI-EAS305_3-30gf5aaxx:8:1:63:487 GGAGGTAGAGGTATATGGCAAGAAAACTGAAAATC NM -
>HWI-EAS305_3-30gf5aaxx:8:1:415:1852 GTTAGATTTTGTGTAACTTGCATGTAATGTTAAAA 3:1:0 chr14.fa:35121238F35,35121282F35,35121326F32T1T,351
21354F4T30
>HWI-EAS305_3-30gf5aaxx:8:1:187:1286 GTTACACTGAAAAACAAATTCGTTGGAAACGGGAT 0:4:5 chr6.fa:103599157R16C17A,chr2.fa:98502709R16C18,985
02829R6A9C18,98505080F4AC29,98505200F1A14C18,98505320F16C18,98506416R16C13C2CA,98506537R16C18,chrX.fa:139917587R16C2A13CA
>HWI-EAS305_3-30gf5aaxx:8:1:202:440 GTGAAAAATGAGAAATGCACACTGAAGGACCTGGA 3:87:58 chr2.fa:98503100F33T1,98506780F35,98507265F35
>HWI-EAS305_3-30gf5aaxx:8:1:359:505 TATTCAATTTACATACTCTGGCTTTGCCAACATTT 1:0:0 chr9.fa:31339651R35
>HWI-EAS305_3-30gf5aaxx:8:1:1290:135 TTGATTGTATAGTAGGGGTGAAATGGAATTTTATC 1:0:1 chrM.fa:14790R35
>HWI-EAS305_3-30gf5aaxx:8:1:627:596 GTGATTTTGAAAGTTGTAGATTGTGTGTTTGTGAT NM -
>HWI-EAS305_3-30gf5aaxx:8:1:379:298 GACGTGAAATATGGCGAGGAAAACTGAAAAAGGTG 31:56:28 -
s_*_eland_multi.txt
Solexa output format from Eland extended
>HWI-EAS305_3-30gf5aaxx:8:1:414:208 GTAAACTATCAATAAAATAATTTGTTACTCTGTAT 20:7:0
>HWI-EAS305_3-30gf5aaxx:8:1:59:857 TAAATTGTCCACCTTTTTCAGTTTTCCTCGCTATA 0:0:35
>HWI-EAS305_3-30gf5aaxx:8:1:1414:307 GAGAAAACTGTAAATAAAGGTAAATGAGAAAAAAA NM
>HWI-EAS305_3-30gf5aaxx:8:1:330:1758 GGTAAAGTCCACTAAGGAAAAGAAAGAAACAATGT 1:0:0 chr7.fa:97764095R0
>HWI-EAS305_3-30gf5aaxx:8:1:576:127 GAAGTCAATCTTATGAGTTATTAGGATGGCTACTC 0:7:255 chr7.fa:111867683F1,chr12.fa:51788781R1,115833262F1
,chr6.fa:21403822R1,89734675R1,89780759R1,chrX.fa:15525553R1
>HWI-EAS305_3-30gf5aaxx:8:1:88:1045 GTTTCTCATTTTCCATGATTTTCAGTTTTCTTGCC 66:110:72
>HWI-EAS305_3-30gf5aaxx:8:1:939:613 TACTTTACTTTCTAGGGAATGTTCACTTCTAAGTG 1:0:0 chr1.fa:150051845R0
s_*_sorted.txt
filtered eland_extended alignments w/ quality scores and genome positions
HWI-EAS305 3-30gf5aaxx 8 66 580 1584 AGTATGGGTATCGGTTGGTGCAGAGAACTACTGCA YYYYYYYYYYYYYYYYYYY
YYYYYVYYYYYVVUVU chr10.fa 3001045 F 35 11
HWI-EAS305 3-30gf5aaxx 8 100 534 1062 ATTTTCAGGTTGGAGTGACTCGCTAAAACAGCCAA YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYTVVVV chr10.fa 3002892 R 35 29
HWI-EAS305 3-30gf5aaxx 8 59 199 495 CCACATGCTGTGGCAAAGCCCTTCTGAGCGGGGCG YYYYTYYYYYYYYYYYRYY
YYYYYYYYYYYTVUVV chr10.fa 3008958 F 34A 20
HWI-EAS305 3-30gf5aaxx 8 76 779 1406 AGATGTACAAATGCTCCTCAGATGTTTGTGTCATA YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYVVVVV chr10.fa 3009290 F 35 3
HWI-EAS305 3-30gf5aaxx 8 83 547 1480 ATCCAAACAGTTACACAAAGTTTTGAGAACATTAT YYYYYYYYYYYYYYYYYYY
YYYYYYYYYYYVVVVV
GENOME ALIGNMENT FORMATS
SGA ('Simplified' Genome Annotation)
GFF (General Feature Format)
EXAMPLE:
track name=regulatory description="TeleGene(tm) Regulatory Regions"
chr22 TeleGene enhancer 1000000 1001000 500 + . touch1
chr22 TeleGene promoter 1010000 1010100 900 + . touch1
chr22 TeleGene promoter 1020000 1020000 800 - . touch2
FPS (Functional Position Set)
Native format for Eukaryotic Promoter Database
EXAMPLE:FP Pv snRNA U1 :+S EM:J03563.1 1+ 352; 17001.098
FP Ath snRNA U2.5 :+S EM:AL353994.1 1- 73709; 24016.116
FP Ath snRNA U5 :+S EM:X13012.1 1+ 678; 23040.
FP Ta histone H3 :+S EM:X00937.1 1+ 186; 07001.
UCSC Genome Browser track format
EXAMPLE
track type=wiggle_0 name="Bed Format" description="BED format" \
visibility=full color=200,100,0 altColor=0,100,200 priority=20
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50
UCSC Genome Browser
Example:
Here's an example of an annotation track that uses a complete BED definition:
track name=pairedReads description="Clone Paired Reads" useScore=1
chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601
ALN
Alignment format for CisGenome
chr1[tab]359077[tab]F
chr1[tab]376890[tab]R
….
column1 = chromosome where the read is aligned;
column2 = coordinate where the read is aligned;
column3 = ‘F’ or ‘+’: if the read is aligned to the forward strand of the genome assembly;
‘R’ or ‘-’: if the read is aligned to the reverse complement strand of the genome.