Difference between revisions of "BED file support"

From ZENBU documentation wiki
Jump to: navigation, search
(Created page with "= BED file support = BED files are a common interchange format for genomic annotations. BED lines have three required fields and nine additional optional fields. The number of f...")
 
 
(31 intermediate revisions by 2 users not shown)
Line 1: Line 1:
= BED file support =
 
 
BED files are a common interchange format for genomic annotations.  
 
BED files are a common interchange format for genomic annotations.  
 
BED lines have three required fields and nine additional optional fields. The number of fields per line must be consistent throughout any single set of data in an annotation track. The order of the optional fields is binding: lower-numbered fields must always be populated if higher-numbered fields are used.
 
BED lines have three required fields and nine additional optional fields. The number of fields per line must be consistent throughout any single set of data in an annotation track. The order of the optional fields is binding: lower-numbered fields must always be populated if higher-numbered fields are used.
  
If your data set is BED-like, but it is very large and you would like to keep it on your own server, you should use the bigBed data format.
+
The 12 BED columns are labeled as follows:
  
The first three required BED fields are:
+
# ''chrom'' - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
 +
# ''chromStart'' - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
 +
# ''chromEnd'' - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
 +
# ''name'' - Defines the name of the BED line.
 +
# ''score'' - A score between 0 and 1000. 
 +
# ''strand'' - Defines the strand - either '+' or '-'.
 +
# ''thickStart'' - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays).
 +
# ''thickEnd'' - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).
 +
# ''itemRgb'' - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line.
 +
# ''blockCount'' - The number of blocks (exons) in the BED line.
 +
# ''blockSizes'' - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.
 +
# ''blockStarts'' - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.
  
chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
+
Special note there is a special translation for BED4 (4 column) version files to match the BEDGraph format (chrom,start,end,score). BED3, BED6 and BED12 are standard versions that is commonly used in the field.
chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
 
chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
 
The 9 additional optional BED fields are:
 
  
name - Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode.
+
== ZENBU interpretation of BED files ==
score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). This table shows the Genome Browser's translation of BED score values into shades of gray:
+
The BED file format easily maps into the [[DataModel|ZENBU data model]].
shade
+
 
score in range  ≤ 166 167-277 278-388 389-499 500-611 612-722 723-833 834-944 ≥ 945
+
* ''chrom'', ''chromStart'', ''chromEnd'', ''strand'' are directly interpreted as genomic coordinates. It should be noted that BED files are in a zero-exclusive coordinate space, while ZENBU uses a 1-based-inclusive coordinate space. ZENBU automatically handles the conversion between coordinate spaces.
strand - Defines the strand - either '+' or '-'.
+
* ''name'' is stored in the ZENBU [[DataModel#Feature|'''Feature''']] name
thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays).
+
* ''score'' is stored in the [[DataModel#Feature|'''Feature''']] significance. On [[Data_loading|data uploading]] there is an option to copy the score into an [[DataModel#Expression|'''Expression''']] value of a specified '''DataType'''.
thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).
+
* the three columns ''blockCount'', ''blockSizes'', ''blockStarts'' work together and are interpreted into [[DataModel#SubFeatures|'''SubFeatures''']] on the primary Feature. Each of these SubFeatures are created with a [[DataModel#FeatureSource|'''FeatureSource''']] category of '''block'''. If these columns generate SubFeatures then ZENBU can also interpret the ''thickStart'' and  ''thickEnd'' columns as follows
itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser.
+
** if ''thickStart'' is not equal to ''chromStart'' then the region from ''chromStart'' to ''thickStart'' is interpreted into a SubFeature of category '''5utr'''
blockCount - The number of blocks (exons) in the BED line.
+
** if ''thickEnd'' is  not equal to ''chromEnd'' then the region from ''thickEnd'' to ''chromEnd'' is interpreted into a SubFeature of category '''3utr'''
blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.
+
* ''itemRGB'' allow manual coloring of features in tracks (loaded from bed12 or bed9 files). If the itemRgb column is empty it is not inserted into the metadata of the features. To visualize the itemRGB metadata-stored color, make sure the metadata is present and make sure "full_feature" is enabled for source outmode.
blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.
+
 
 +
for example this BED line  
 +
<pre style="font-size:10px;">chr5 137801180 137805004 NM_001964 0.00 + 137801451 137803770 0 2 576,2558 0,1265</pre>
  
== ZENBU interpretation of BED files ==
+
is interpreted into the ZENBU data model (here displayed in a ZENBU XML export/interchange format)
The BED file format easily maps into the [[DataModel|ZENBU data model]]. Genomic coordinate columns (''chrom'', ''chromStart'', ''chromEnd'', ''strand'') are directly interpreted, but it should be noted that BED files are in a zero-exclusive coordinate space, while ZENBU uses a 1based-inclusive coordinate space. ZENBU automatically handles the conversion between coordinate spaces.
+
<pre style="font-size:10px;">
 +
<feature name="NM_001964" start="137801181" end="137805004" strand="+" >
 +
    <chrom chr="chr5" asm="hg19" ucsc_sm="hg19" ncbi_asm="GRCh37" taxon_id="9606" length="180915260"/>
 +
    <featuresource category="refgene" name="UCSC_hg19_refgene" feature_count="35067"/>
 +
    <subfeatures count="4">
 +
        <feature category="5utr" start="137801181" end="137801451" strand="+"/>
 +
        <feature category="block" start="137801181" end="137801757" strand="+"/>
 +
        <feature category="block" start="137802446" end="137805004" strand="+"/>
 +
        <feature category="3utr" start="137803770" end="137805004" strand="+"/>
 +
    </subfeatures>
 +
</feature>
 +
</pre>
 +
 
 +
== BED as OSCtable header ==
 +
BED files can easily be represented with an [[OSCtable|OSCtable]] column header line using the ZENBU extended column namespace.
 +
 
 +
<pre style="font-size:10px; white-space:pre-wrap;">
 +
eedb:chrom eedb:start.0base eedb:end eedb:name eedb:score eedb:strand eedb:bed_thickstart eedb:bed_thickend bed:itemRgb eedb:bed_block_count eedb:bed_block_sizes eedb:bed_block_starts
 +
</pre>
 +
 
 +
BED6
 +
<pre style="font-size:10px; white-space:pre-wrap;">
 +
eedb:chrom eedb:start.0base eedb:end eedb:name eedb:score eedb:strand
 +
</pre>
  
* ''name'' is stored in the ZENBU [[DataModel#Feature|Feature]] name
+
BED4 : aka bedGraph
* ''score'' is stored in the Feature significance. On [[uploading|upload]] there is an option to copy the score into an [[DataModel#Expression|Expression]] value of a specified [[DataModel#DataType|Datatype]].
+
<pre style="font-size:10px; white-space:pre-wrap;">
* the three columns ''blockCount'', ''blockSizes'', ''blockStarts'' work together and are interpreted into subfeatures on the primary Feature. Each of these subfeatures are created with a [[DataMode#FeatureSource|FeatureSource]] category of '''block'''.
+
eedb:chrom eedb:start.0base eedb:end eedb:score
 +
</pre>
  
 +
BED3
 +
<pre style="font-size:10px; white-space:pre-wrap;">
 +
eedb:chrom eedb:start.0base eedb:end
 +
</pre>
  
 
== BED specification ==
 
== BED specification ==
 
The official BED specification is available here
 
The official BED specification is available here
 
http://genome.ucsc.edu/FAQ/FAQformat.html#format1
 
http://genome.ucsc.edu/FAQ/FAQformat.html#format1

Latest revision as of 18:00, 19 October 2018

BED files are a common interchange format for genomic annotations. BED lines have three required fields and nine additional optional fields. The number of fields per line must be consistent throughout any single set of data in an annotation track. The order of the optional fields is binding: lower-numbered fields must always be populated if higher-numbered fields are used.

The 12 BED columns are labeled as follows:

  1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
  2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
  3. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
  4. name - Defines the name of the BED line.
  5. score - A score between 0 and 1000.
  6. strand - Defines the strand - either '+' or '-'.
  7. thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays).
  8. thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).
  9. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line.
  10. blockCount - The number of blocks (exons) in the BED line.
  11. blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.
  12. blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.

Special note there is a special translation for BED4 (4 column) version files to match the BEDGraph format (chrom,start,end,score). BED3, BED6 and BED12 are standard versions that is commonly used in the field.

ZENBU interpretation of BED files

The BED file format easily maps into the ZENBU data model.

  • chrom, chromStart, chromEnd, strand are directly interpreted as genomic coordinates. It should be noted that BED files are in a zero-exclusive coordinate space, while ZENBU uses a 1-based-inclusive coordinate space. ZENBU automatically handles the conversion between coordinate spaces.
  • name is stored in the ZENBU Feature name
  • score is stored in the Feature significance. On data uploading there is an option to copy the score into an Expression value of a specified DataType.
  • the three columns blockCount, blockSizes, blockStarts work together and are interpreted into SubFeatures on the primary Feature. Each of these SubFeatures are created with a FeatureSource category of block. If these columns generate SubFeatures then ZENBU can also interpret the thickStart and thickEnd columns as follows
    • if thickStart is not equal to chromStart then the region from chromStart to thickStart is interpreted into a SubFeature of category 5utr
    • if thickEnd is not equal to chromEnd then the region from thickEnd to chromEnd is interpreted into a SubFeature of category 3utr
  • itemRGB allow manual coloring of features in tracks (loaded from bed12 or bed9 files). If the itemRgb column is empty it is not inserted into the metadata of the features. To visualize the itemRGB metadata-stored color, make sure the metadata is present and make sure "full_feature" is enabled for source outmode.

for example this BED line

chr5	137801180	137805004	NM_001964	0.00	+	137801451	137803770	0	2	576,2558	0,1265

is interpreted into the ZENBU data model (here displayed in a ZENBU XML export/interchange format)

<feature name="NM_001964" start="137801181" end="137805004" strand="+" >
    <chrom chr="chr5" asm="hg19" ucsc_sm="hg19" ncbi_asm="GRCh37" taxon_id="9606" length="180915260"/>
    <featuresource category="refgene" name="UCSC_hg19_refgene" feature_count="35067"/>
    <subfeatures count="4">
        <feature category="5utr" start="137801181" end="137801451" strand="+"/>
        <feature category="block" start="137801181" end="137801757" strand="+"/>
        <feature category="block" start="137802446" end="137805004" strand="+"/>
        <feature category="3utr" start="137803770" end="137805004" strand="+"/>
    </subfeatures>
</feature>

BED as OSCtable header

BED files can easily be represented with an OSCtable column header line using the ZENBU extended column namespace.

eedb:chrom	eedb:start.0base	eedb:end	eedb:name	eedb:score	eedb:strand	eedb:bed_thickstart	eedb:bed_thickend	bed:itemRgb	eedb:bed_block_count	eedb:bed_block_sizes	eedb:bed_block_starts

BED6

eedb:chrom	eedb:start.0base	eedb:end	eedb:name	eedb:score	eedb:strand

BED4 : aka bedGraph

eedb:chrom	eedb:start.0base	eedb:end	eedb:score

BED3

eedb:chrom	eedb:start.0base	eedb:end

BED specification

The official BED specification is available here http://genome.ucsc.edu/FAQ/FAQformat.html#format1