BED file support

From ZENBU documentation wiki
Revision as of 18:00, 19 October 2018 by Jessica Severin (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

BED files are a common interchange format for genomic annotations. BED lines have three required fields and nine additional optional fields. The number of fields per line must be consistent throughout any single set of data in an annotation track. The order of the optional fields is binding: lower-numbered fields must always be populated if higher-numbered fields are used.

The 12 BED columns are labeled as follows:

  1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
  2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
  3. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
  4. name - Defines the name of the BED line.
  5. score - A score between 0 and 1000.
  6. strand - Defines the strand - either '+' or '-'.
  7. thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays).
  8. thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).
  9. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line.
  10. blockCount - The number of blocks (exons) in the BED line.
  11. blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.
  12. blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.

Special note there is a special translation for BED4 (4 column) version files to match the BEDGraph format (chrom,start,end,score). BED3, BED6 and BED12 are standard versions that is commonly used in the field.

ZENBU interpretation of BED files

The BED file format easily maps into the ZENBU data model.

  • chrom, chromStart, chromEnd, strand are directly interpreted as genomic coordinates. It should be noted that BED files are in a zero-exclusive coordinate space, while ZENBU uses a 1-based-inclusive coordinate space. ZENBU automatically handles the conversion between coordinate spaces.
  • name is stored in the ZENBU Feature name
  • score is stored in the Feature significance. On data uploading there is an option to copy the score into an Expression value of a specified DataType.
  • the three columns blockCount, blockSizes, blockStarts work together and are interpreted into SubFeatures on the primary Feature. Each of these SubFeatures are created with a FeatureSource category of block. If these columns generate SubFeatures then ZENBU can also interpret the thickStart and thickEnd columns as follows
    • if thickStart is not equal to chromStart then the region from chromStart to thickStart is interpreted into a SubFeature of category 5utr
    • if thickEnd is not equal to chromEnd then the region from thickEnd to chromEnd is interpreted into a SubFeature of category 3utr
  • itemRGB allow manual coloring of features in tracks (loaded from bed12 or bed9 files). If the itemRgb column is empty it is not inserted into the metadata of the features. To visualize the itemRGB metadata-stored color, make sure the metadata is present and make sure "full_feature" is enabled for source outmode.

for example this BED line

chr5	137801180	137805004	NM_001964	0.00	+	137801451	137803770	0	2	576,2558	0,1265

is interpreted into the ZENBU data model (here displayed in a ZENBU XML export/interchange format)

<feature name="NM_001964" start="137801181" end="137805004" strand="+" >
    <chrom chr="chr5" asm="hg19" ucsc_sm="hg19" ncbi_asm="GRCh37" taxon_id="9606" length="180915260"/>
    <featuresource category="refgene" name="UCSC_hg19_refgene" feature_count="35067"/>
    <subfeatures count="4">
        <feature category="5utr" start="137801181" end="137801451" strand="+"/>
        <feature category="block" start="137801181" end="137801757" strand="+"/>
        <feature category="block" start="137802446" end="137805004" strand="+"/>
        <feature category="3utr" start="137803770" end="137805004" strand="+"/>
    </subfeatures>
</feature>

BED as OSCtable header

BED files can easily be represented with an OSCtable column header line using the ZENBU extended column namespace.

eedb:chrom	eedb:start.0base	eedb:end	eedb:name	eedb:score	eedb:strand	eedb:bed_thickstart	eedb:bed_thickend	bed:itemRgb	eedb:bed_block_count	eedb:bed_block_sizes	eedb:bed_block_starts

BED6

eedb:chrom	eedb:start.0base	eedb:end	eedb:name	eedb:score	eedb:strand

BED4 : aka bedGraph

eedb:chrom	eedb:start.0base	eedb:end	eedb:score

BED3

eedb:chrom	eedb:start.0base	eedb:end

BED specification

The official BED specification is available here http://genome.ucsc.edu/FAQ/FAQformat.html#format1