GFF and GTF file support

From ZENBU documentation wiki
Jump to: navigation, search

The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines. The following documentation is based on the Version 2 specifications.

The GTF (General Transfer Format) is nearly identical to GFF version 2.

ZENBU currently provides full support for parsing GFF, GFF2, and GTF files. GFF3 file parsing is mostly supported except for the linking of parents and children (Parent= ID= attributes) to create transcripts with subfeatures. Full GFF3 support will come in a future version of ZENBU.

Fields

Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.'

  1. seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix.
  2. source - name of the program that generated this feature, or the data source (database or project name)
  3. feature - feature type name, e.g. Gene, Variation, Similarity
  4. start - Start position of the feature, with sequence numbering starting at 1.
  5. end - End position of the feature, with sequence numbering starting at 1.
  6. score - A floating point value.
  7. strand - defined as + (forward) or - (reverse).
  8. frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
  9. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature. Format is tag=value or tag <space> value.

Sample GFF output from Ensembl export:

X	Ensembl	Repeat	2419108	2419128	42	.	.	hid=trf; hstart=1; hend=21
X	Ensembl	Repeat	2419108	2419410	2502	-	.	hid=AluSx; hstart=1; hend=303
X	Ensembl	Repeat	2419108	2419128	0	.	.	hid=dust; hstart=2419108; hend=2419128
X	Ensembl	Pred.trans.	2416676	2418760	450.19	-	2	genscan=GENSCAN00000019335
X	Ensembl	Variation	2413425	2413425	.	+	.	
X	Ensembl	Variation	2413805	2413805	.	+	.

ZENBU interpretation of GFF / GTF files

The GFF/GTF file format easily maps into the ZENBU data model. Both ZENBU and GFF use a 1base-exclusive coordinate system so there is adjust between coordinate spaces.

  • seqname : is mapped to Feature chromosome
  • source : is not interpreted but simply stored as metadata on the Feature with the tag gff:source
  • feature is interpreted as a FeatureSources category multiplexer. This allows a complex GFF file with many different feature / category types to be organized into separate ZENBU FeatureSources after loading.
  • start : is Feature chrom_start 1base-exclusive coordinate system
  • end : is Feature chrom_end
  • score is stored in the Feature significance. On data uploading there is an option to copy the score into an Expression value of a specified DataType.
  • strand : is Feature strand
  • frame : is not interpreted but simply stored as metadata on the Feature
  • attributes : is parsed into Metadata attached to the Feature. In the future ZENBU will support the GFF3 special tags for extended parsing [1]
    • ID= is used for feature/subfeature linking, not stored into metadata. not currently supported
    • Parent= to used for feature/subfeature linking. not stored into metadata. not currently supported
    • Name= is stored as the name of the Feature, not stored into metadata. not currently supported

ZENBU supports all variations of attributes formatting from GFF, GFF2, GTF, GTF2 and GFF3 variations

 some_tag=some_value;
 some_tag="some value";
 some_tag some_value;
 some_tag "some value";
 some_tag="value 1","value 2","value 3";  
 some_tag "value 1","value 2","value 3";
 some_tag=value1,value2,value3;
 some_tag value1,value2,value3;

GFF as OSCtable header

GFF files can easily be represented with an OSCtable column header line using the ZENBU extended column namespace.

eedb:chrom	gff:source	eedb:fsrc_category	eedb:start.1base	eedb:end	eedb:score	eedb:strand	gff:frame	gff:attributes

The gff:attributes column has a complete ZENBU parser attached to it. The parser can interpret this column in either the older GFF/GTF tag<space>value format or the GFF2/GFF3 style tag=value format. In the future, this gff:attributes parser will be expanded to parse the special GFF3 specification tags for 'feature names' and the GFF3 style of storing feature/subfeature relationships. Currently all data in the gff:attributes is parsed into metadata of the Feature.

gff:source and gff:frame are currently not interpreted but simply stored as Metadata.

GFF GTF specifications

For more information about this file format, see the documentation on these external websites.
http://asia.ensembl.org/info/website/upload/gff.html
http://www.sanger.ac.uk/resources/software/gff/spec.html
http://www.sequenceontology.org/gff3.shtml
http://genome.ucsc.edu/FAQ/FAQformat#format4
http://gmod.org/wiki/GFF