Difference between revisions of "OSCtable"
(→core column name nomenclature) |
(→core column name nomenclature) |
||
Line 28: | Line 28: | ||
* ''eedb:strand'' -- chromosome strand | * ''eedb:strand'' -- chromosome strand | ||
* ''eedb:name'' -- the name of the [[DataModel#Feature|'''Feature''']] | * ''eedb:name'' -- the name of the [[DataModel#Feature|'''Feature''']] | ||
− | * ''exp.YYY.ZZZ'' -- interpreted as expression value of datatype YYY for experiment named ZZZ. This | + | * ''exp.YYY.ZZZ'' -- interpreted as expression value of datatype YYY for experiment named ZZZ. This will create the experiment source named ZZZ. |
* ''exp.YYY'' -- interpreted as expression value of datatype YYY for the primary/default experiment of the file. This will not create additional experiments within the file. | * ''exp.YYY'' -- interpreted as expression value of datatype YYY for the primary/default experiment of the file. This will not create additional experiments within the file. | ||
− | * ''raw.ZZZ'' -- interpreted as expression value of datatype ''raw'' for experiment named ZZZ | + | * ''raw.ZZZ'' -- interpreted as expression value of datatype ''raw'' for experiment named ZZZ. This will create the experiment source named ZZZ. |
− | * ''norm.ZZZ'' -- interpreted as expression value of datatype ''norm'' for experiment named ZZZ | + | * ''norm.ZZZ'' -- interpreted as expression value of datatype ''norm'' for experiment named ZZZ. This will create the experiment source named ZZZ. |
While ZENBU uses a 1-based-inclusive coordinate space internally, it can automatically handle the conversion between coordinate spaces at load time and when exporting data out of the system. Please just specify the correct coordinate space for your file, there is no need to convert your files. | While ZENBU uses a 1-based-inclusive coordinate space internally, it can automatically handle the conversion between coordinate spaces at load time and when exporting data out of the system. Please just specify the correct coordinate space for your file, there is no need to convert your files. |
Revision as of 12:27, 25 October 2017
OSCtable file format
The OSCTable is an open structure tab-separated text table format. Any tab-text file which can be loaded into a spreadsheet program like excel or into data analysis programs like R can be loaded into ZENBU using the OSCtable format (file extension .osc). Just like an excel file or most analysis related tab-text files, the oscfile uses the first line of the file as a header to describe the columns. The difference is that OSCTable uses a controlled nomenclature of header terms to control how the file is loaded and interpreted. What this means is that OSCTable allows for flexible column ordering, of a user specified number of columns.
ZENBU includes a rich set of header terms to be able to translate any tab-text table into .osc format and loaded. This column header name nomenclature will be described bellow.
Here is a simple example of adding an OSCTable header line to a bed6 file so that it can be loaded as an OSCTable file with a .osc file extension. In fact, internally this is how ZENBU translates bed files
eedb:chrom eedb:start.0base eedb:end eedb:name eedb:score eedb:strand chr1 67092175 67134971 NM_001276352 0 - chr1 201283451 201332993 NM_000299 0 + chr1 67092175 67134971 NM_001276351 0 - chr1 67092175 67134971 NR_075077 0 -
OSCtable is one of the main interchange formats for ZENBU. It allows all possible mapping of data into the ZENBU data model. Since the OSCtable specification is highly flexible, it was possible for the ZENBU OSCtable parser to have an extended vocabulary of metadata directives and column name spaces.
Column header name nomenclature
The fist non-commented line of the file is interpreted as the column-header line. This line is a requirement for OSCTable files because without it, it is not possible to translate the file.
core column name nomenclature
The official OSCtable specification has very few predefined column names, but all are understood by ZENBU and easily mapped onto the Feature
- eedb:chrom -- chromosome name
- eedb:start.0base -- chromosome start in a 0base-exclusive coordinate system (like BED files)
- eedb:start.1base -- chromosome start in a 1base-inclusive coordinate system (like GFF files)
- eedb:end -- chromosome end location
- eedb:strand -- chromosome strand
- eedb:name -- the name of the Feature
- exp.YYY.ZZZ -- interpreted as expression value of datatype YYY for experiment named ZZZ. This will create the experiment source named ZZZ.
- exp.YYY -- interpreted as expression value of datatype YYY for the primary/default experiment of the file. This will not create additional experiments within the file.
- raw.ZZZ -- interpreted as expression value of datatype raw for experiment named ZZZ. This will create the experiment source named ZZZ.
- norm.ZZZ -- interpreted as expression value of datatype norm for experiment named ZZZ. This will create the experiment source named ZZZ.
While ZENBU uses a 1-based-inclusive coordinate space internally, it can automatically handle the conversion between coordinate spaces at load time and when exporting data out of the system. Please just specify the correct coordinate space for your file, there is no need to convert your files.
additional column name nomenclature
- eedb:fsrc_category -- causes creation of multiple FeatureSources from a single OSCtable file using the value of this column as different source categories
- eedb:score -- is stored in the Feature significance.
- eedb:bed_block_count -- are taken from the BED file specification.
- eedb:bed_block_sizes -- are taken from the BED file specification.
- eedb:bed_block_starts -- are taken from the BED file specification.
- These three columns work together and are interpreted into SubFeatures on the primary Feature. Each of these SubFeatures are created with a FeatureSource category of block.
- eedb:bed_thickstart
- eedb:bed_thickend -- are taken from the BED file specification.
- if eedb:bed_thickstart is not equal to start then the region from start to bed:thickStart is interpreted into a SubFeature of category 5utr
- if eedb:bed_thickend is not equal to end then the region from bed:thickEnd to end is interpreted into a SubFeature of category 3utr
- bed:itemRgb -- bed file style rgb color stored as Feature Metadata
- eedb:genome -- if specified in a column, this allows data from multiple-genomes within the same OSCtable file
- eedb:sam_flag -- ZENBU parses strand out of the SAM file flag column.
- eedb:sam_cigar -- can be parsed into chrom_end and subfeatures
- eedb:sam_opt
- eedb:ctg_cigar -- a ZENBU specific extension of the cigar concept allowing multiple overlapping subfeature layers with different FeatureSource categories.
- example eedb:ctg_cigar -- 3utr:4738N415M,5utr:375M,block:404M1628N769M1377N977M
- gff:attributes -- GFF2/GFF3 style tag=value; data with extended support for feature/subfeature linking
- gff:source -- GFF source column. Interpreted as metadata
- gff:frame -- GFF frame column. Interpreted as metadata
Any column with an unknown name is mapped into the Metadata of the Feature. The column name becomes the key of the metadata. For example you could use the column name barcode for a column and since that is not part of the controlled nomenclature, it will be interpreted as metadata and the values in that column will be added to the row feature as barcode=xxxxxx.
Aliased column names
Several column names have aliases to other column name spaces
- name -- same as eedb:name
- ID -- same as eedb:name
- score -- same as eedb:score
- chrom -- same as eedb:chrom
- start.0base -- same as eedb:start.0base
- start.1base -- same as eedb:start.1base
- end -- same as eedb:end'
- strand -- same as eedb:strand
- eedb:mapcount -- same as mapcount
- eedb:significance -- same as eedb:score
ignoring columns
The OSCtable allows for easy wrapping of any tab-texted file into an OSCtable by simply pre-pending a header with the appropriate column names. But sometimes these original files contain columns which one might not really need. To simplify the process of loading, ZENBU added a special column-name
- ignore.xxxx -- ignore this column, where xxxx would be the original column name
If a column is labeled as such, on loading this column will be stripped from the data file and thrown away. This can simplify the process for bioinformatician and avoid un-needed data-scripting to munge data prior to loading.
Wrapping external file formats with OSCtable headers
With the extended vocabulary of the ZENBU OSCtable parser, it is possible to wrap external file formats very easily with an OSCtable header and load them into ZENBU. In fact the ZENBU upload support for BED, GTF and GFF are done through wrapping predefined OSCtable column headers onto these files.
BED oscheader
Here is the column header line to wrap a BED file
eedb:chrom eedb:start.0base eedb:end eedb:name eedb:score eedb:strand eedb:bed_thickstart eedb:bed_thickend bed:itemRgb eedb:bed_block_count eedb:bed_block_sizes eedb:bed_block_starts
GFF oscheader
GFF files can easily be represented with an OSCtable column header line using the ZENBU extended column namespace.
eedb:chrom gff:source eedb:fsrc_category eedb:start.1base eedb:end eedb:score eedb:strand gff:frame gff:attributes
The gff:attributes column has a complete ZENBU parser attached to it. The parser can interpret this column in either the older GFF/GTF tag<space>value format or the GFF2/GFF3 style tag=value format. The gff:attributes column can be used to store feature/subfeature relationships(GFF3 specification), the name of the feature(GFF2 & GFF3), and all variable metadata of the Feature (original GFF specification)
gff:source and gff:frame are currently not interpreted but simply stored as Metadata.
SAM oscheader
Here is the column header line to wrap a SAM file
eedb:name eedb:sam_flag eedb:chrom eedb:start.1base eedb:score eedb:sam_cigar sam:mrnm sam:mpos sam:isize eedb:seqread sam:qual eedb:sam_opt
ENCODE NarrowPeak oscheader
ENCODE narrowPeak (or Point-Source) format is used to provide called peaks of signal enrichment based on pooled, normalized (interpreted) data. It is a BED6+4 format.
- chrom - Name of the chromosome (or contig, scaffold, etc.).
- chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
- chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
- name - Name given to a region (preferably unique). Use '.' if no name is assigned.
- score - Indicates how dark the peak will be displayed in the browser (0-1000). If all scores were '0' when the data were submitted to the DCC, the DCC assigned scores 1-1000 based on signal value. Ideally the average signalValue per base spread is between 100-1000.
- strand - +/- to denote strand or orientation (whenever applicable). Use '.' if no orientation is assigned.
- signalValue - Measurement of overall (usually, average) enrichment for the region.
- pValue - Measurement of statistical significance (-log10). Use -1 if no pValue is assigned.
- qValue - Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.
- peak - Point-source called for this peak; 0-based offset from chromStart. Use -1 if no point-source called.
Here is an example of narrowPeak format:
track type=narrowPeak visibility=3 db=hg19 name="nPk" description="ENCODE narrowPeak Example" browser position chr1:9356000-9365000 chr1 9356548 9356648 . 0 . 182 5.0945 -1 50 chr1 9358722 9358822 . 0 . 91 4.6052 -1 40 chr1 9361082 9361182 . 0 . 182 9.2103 -1 75
To wrap ENCODE broadPeak (or Regions) file formats very easily with an OSCtable header and load them into ZENBU, the OSC header should be
##ParameterValue[filetype] = osc ##ParameterValue[display_name] = <THE-DISAPLY-NAME-TO-BE-EDITED-HERE> ##ExperimentMetadata[x][eedb:display_name] = <THE-DISPLAY-NAME-TO-BE-EDITED-HERE> ##ColumnVariable[eedb:chrom] = chromosome name ##ColumnVariable[eedb:start.0base] = chromosome start in 0base coordinate system ##ColumnVariable[eedb:end] = chromosome end ##ColumnVariable[eedb:strand] = chromosome strand ##ColumnVariable[eedb:score] = score or significance of the feature ##ColumnVariable[exp.signal.x] = measurement of overall (usually, average) enrichment for the region ##ColumnVariable[exp.qvalue.x] = measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned ##ColumnVariable[exp.pvalue.x] = measurement of statistical significance (-log10). Use -1 if no pValue is assigned ##ColumnVariable[point_source] = point-source called for this peak; 0-based offset from chromStart. Use -1 if no point-source called eedb:chrom eedb:start.0base eedb:end eedb:name eedb:score eedb:strand exp.signal.x exp.pvalue.x exp.qvalue.x point_source
Upon data loading, leave the "display name" textbox empty so has to not overwrite the one available from the OSCtable (and keep the featureSource name and ExpreimentSource names in sync)
ENCODE BroadPeak oscheader
ENCODE broadPeak (or Regions) format is used to provide called regions of signal enrichment based on pooled, normalized (interpreted) data. It is a BED 6+3 format.
- chrom - Name of the chromosome (or contig, scaffold, etc.).
- chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
- chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. If all scores were '0' when the data were submitted to the DCC, the DCC assigned scores 1-1000 based on signal value. Ideally the average signalValue per base spread is between 100-1000.
- name - Name given to a region (preferably unique). Use '.' if no name is assigned.
- score - Indicates how dark the peak will be displayed in the browser (0-1000).
- strand - +/- to denote strand or orientation (whenever applicable). Use '.' if no orientation is assigned.
- signalValue - Measurement of overall (usually, average) enrichment for the region.
- pValue - Measurement of statistical significance (-log10). Use -1 if no pValue is assigned.
- qValue - Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.
Here is an example of broadPeak format:
track type=broadPeak visibility=3 db=hg19 name="bPk" description="ENCODE broadPeak Example" browser position chr1:798200-800700 chr1 798256 798454 . 116 . 4.89716 3.70716 -1 chr1 799435 799507 . 103 . 2.46426 1.54117 -1 chr1 800141 800596 . 107 . 3.22803 2.12614 -1
To wrap ENCODE broadPeak (or Regions) file formats very easily with an OSCtable header and load them into ZENBU, the OSCtable header should be
##ParameterValue[filetype] = osc ##ParameterValue[display_name] = <THE-DISAPLY-NAME-TO-BE-EDITED-HERE> ##ExperimentMetadata[x][eedb:display_name] = <THE-DISPLAY-NAME-TO-BE-EDITED-HERE> ##ColumnVariable[eedb:chrom] = chromosome name ##ColumnVariable[eedb:start.0base] = chromosome start in 0base coordinate system ##ColumnVariable[eedb:end] = chromosome end ##ColumnVariable[eedb:strand] = chromosome strand ##ColumnVariable[eedb:score] = score or significance of the feature ##ColumnVariable[exp.signal.x] = measurement of overall (usually, average) enrichment for the region ##ColumnVariable[exp.qvalue.x] = measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned ##ColumnVariable[exp.pvalue.x] = measurement of statistical significance (-log10). Use -1 if no pValue is assigned eedb:chrom eedb:start.0base eedb:end eedb:name eedb:score eedb:strand exp.signal.x exp.pvalue.x exp.qvalue.x
Upon data loading, leave the "display name" textbox empty so has to not overwrite the one available from the OSCtable (and keep the featureSource name and ExpreimentSource names in sync).
Any line in the file can start with a # character to represent a comment. This lines are ignored by the ZENBU OSCTable parser, but can be useful for making your data file more human readable. In addition the original OSCTable specification included a special set of ## comments line prior to the column-header line. ZENBU does not parse the ##Namespace metadata directive but instead uses a controlled vocabulary of column names for interpreting the data and mapping it into the ZENBU DataModel.
While the official OSCtable specification include mandatory metadata elements, the ZENBU OSCtable parser relaxes this requirement. All ## metadata lines are parsed as optional metadata. Even ##ColumnVariable[] directives are consider as optional metadata. The only requirement for a valid OSCtable for ZENBU is the column header line.
The primary ## Metadata syntax used by ZENBU are the
##ParameterValue[key] = value
##ColumnVariable[col_name] = description
##key = value
Experiment metadata
The original OSCtable specification works very well with single experiment data files, but does not provide enough fine control of assigning metadata in a multi-experiment data file. To alleviate this, ZENBU added an additional metadata directive to the OSCtable specification
- ##ExperimentMetadata[experiment-name][key] = value
experiment-name is the same as the ZZZ in the expression column header descriptions. By referencing to the experiment-name it is possible to have the same experiment used in multiple columns with different datatypes.
Example (please note that the text is wrapped below for display)
##ParameterValue[filetype] = osc ##ParameterValue[genome] = mm9 ##ColumnVariable[eedb:chrom] = chromosome name ##ColumnVariable[eedb:start.0base] = chromosome start in 0base coordinate system ##ColumnVariable[eedb:end] = chromosome end ##ColumnVariable[eedb:strand] = chromosome strand ##ColumnVariable[eedb:score] = score or significance of the feature ##ColumnVariable[exp.tagcount.Mouse_Embryoid_Body_RNAseq_exonic] = tagcount Mouse_Embryoid_Body_RNAseq_exonic ##ExperimentMetadata[Mouse_Embryoid_Body_RNAseq_exonic][eedb:display_name] = Mouse_Embryoid_Body_RNAseq_exonic ##ExperimentMetadata[Mouse_Embryoid_Body_RNAseq_exonic][eedb:platform] = SQRL_RNAseq ##ExperimentMetadata[Mouse_Embryoid_Body_RNAseq_exonic][description] = This is the exon junction signal for RNAseq of Mouse Embryoid body cells after 4 days of differentiation to the 'primitive streak stage' (see PMID:17286599 , and should contain expression of brachyury, mixl1, tbx6, and flk1) carried out at the IMB on Applied Biosystems SOLiD system (PMID: 18516046). Mouse strain: SV129. Mapping: published version. ##ExperimentMetadata[Mouse_Embryoid_Body_RNAseq_exonic][strain] = SV129 ##ExperimentMetadata[Mouse_Embryoid_Body_RNAseq_exonic][tissue] = Mouse Embryoid body ##ColumnVariable[exp.tagcount.Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal] = tagcount Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal ##ExperimentMetadata[Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal][eedb:display_name] = Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal ##ExperimentMetadata[Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal][eedb:platform] = SQRL_RNAseq ##ExperimentMetadata[Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal][description] = Description: This is the exonic signal for RNAseq of Mouse Embryonic Stem Cells carried out at the IMB on Applied Biosystems SOLiD system (PMID: 18516046). Mouse strain: SV129. Mapping: published version ##ExperimentMetadata[Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal][strain] = SV129 ##ExperimentMetadata[Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal][tissue] = Mouse Embryonic Stem Cells eedb:chrom eedb:start.0base eedb:end eedb:name eedb:score eedb:strand exp.tagcount.Mouse_Embryoid_Body_RNAseq_exonic exp.tagcount.Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal exp.tpm.Mouse_Embryoid_Body_RNAseq_exonic exp.tpm.Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal chr7 52253690 52253691 block_chr7:52253691..52253691+ 186.00 + 84.00 102.00 8.40 10.20 chr7 52253691 52253692 block_chr7:52253692..52253692+ 184.00 + 83.00 101.00 8.30 10.10 chr7 52253692 52253693 block_chr7:52253693..52253693+ 185.00 + 83.00 102.00 8.30 10.20 chr7 52253693 52253694 block_chr7:52253694..52253694+ 180.00 + 81.00 99.00 8.10 9.90 chr7 52253694 52253695 block_chr7:52253695..52253695+ 174.00 + 77.00 97.00 7.70 9.70
Basic structure
- A simple tabdelimited text file.
- Column order is flexible
- Lines starting with '#' are comments.
- Lines starting with '##' are attributes or metadata of the table. (See Metadata section below)
- The first line after the comments/metadata (see below) is a header line, which indicate column names of the table.
- All the comment and attribute lines should appear above the header line
- The first column should describe a 'key' (unique in many cases, but not necessarily) of the data, and the column name should be 'id'
- If a cell needs to include multiple values, comma(',') is recommended to be used as a separator.
- All the columns should be described in Metadata (See 'Metadata' section below)
File Metadata ('##' lines)
The basic structure is '##qualifier = value'.
Required metadata: FileFormat, Date, ProtocolREF, ColumnVariable, ContactName, ContactEmail
'genome_assembly' for parameter value is required, when using
##NameSpace=genomic_coordinate
Required (mandatory) metadata
- FileFormat -- describes file format of this file.
##FileFormat = OSCtable1.1
- Date -- describes the date when the data file is generated
##Date = 20090602
- ProtocolREF -- describes the protocol used to generate the data file.
##ProtocolREF = CAGEmappingv1.0
- ColumnVariable -- describes ALL the columns used in the data file
##ColumnVariable[start] = this is a start position of the genomic coordinate
##ColumnVariable[end] = this is a stop position of the genomic coordinate
##ColumnVariable[norm.THP10h] = this is TPM normalized value with 10h
##ColumnVariable[entrez_gene_id] = Entrez gene ID, which is assigned to the cluster
- ContactName -- describes the contact name about the data file.
##ContactName = Hideya Kawaji
- ContactEmail -- describes the contact address about the data file
##ContactEmail = kawaji@gsc.riken.jp
Optional metadata
- InputFile -- describes the file(s) used to generate the data file
##InputFile = lane1.fa
##InputFile = lane2.fa
- ParameterValue -- describes the parameter(s) used to generate the data file in the protocol. The parameter(s) should be consistent with the protocol description
##ParameterValue[alignment_program] = BWA
##ParameterValue[aligment_program_version] = 1.3.5
##ParameterValue[UCSC_gene_tracks] = RefSeq
##ParameterValue[UCSC_gene_tracks] = ENSEMBL transcript
- NameSpace -- describes the name space for the column names. See below (NameSpace)
##NameSpace=genomic_coordinate
##NameSpace=expression
Column Name Spaces
- A set of column names (and parameters) to be used for a specific purpose or context.
- The same column names with the same name space are recognized as the same (equivalent) meaning.
- Supported name space: genomic_coordinate, expression
genomic_coordinate
- column names are: chrom, start.0base, start.1base, end, strand
- parameter value: genome_assembly
- chrom: chromosome name used in the genome assembly. For example, chr1, chr2, chr3, ... chrM for the UCSC hg18 genome assembly.
- start.0base: start position (bp) on the chromosome in 0start coordinate system (BED, PSL, BLAT, exonerate, and nexAlign style)
- start.1base: start position (bp) on the chromosome in 1start coordinate system (conventional coordinate system; adopted in GFF as well)
- end: end position (bp) on the chromosome
- strand: strand on the chromosome; optional
- Note:
- All of the above columns are not necessarily required. For example, start.0base would not be required if you have start.1base, and strand would not be required if the annotation do not have strand distinction
- 'genome_assembly' for parameter value is required.
expression
- generic expression tags -- column names follow the form : exp.YYY.ZZZ, raw.ZZZ, norm.ZZZ, or mapcount
- exp.YYY.ZZZ is the general form for describing an expression column. YYY labels the DataType of the expression and should not include any dot (.) characters. ZZZ indicates the name of the Experiment source of the expression.
- raw.ZZZ is shorthand for exp.raw.ZZZ where raw is a datatype for un-processed values of expression such as raw_counts and raw signal intensities.
- norm.ZZZ is shorthand for exp.norm.ZZZ where norm is a datatype for normalized value.
- mapcount specifies the number of locations where this element has been mapped onto the genome.