Difference between revisions of "OSCtable"

From ZENBU documentation wiki
Jump to: navigation, search
(ZENBU additional column namespaces)
(Aliased column names)
Line 136: Line 136:
  
 
=== Aliased column names ===
 
=== Aliased column names ===
Several column names are aliased into the official ZENBU namespace ''eedb:''
+
Several column names have aliases to other column name spaces
* ''eedb::name'' -- same as ''name''
+
* ''name'' -- same as ''eedb:name''
* ''eedb::score'' -- same as ''score''
+
* ''score'' -- same as ''eedb:score''
 
* ''eedb::chrom'' -- same as ''chrom''
 
* ''eedb::chrom'' -- same as ''chrom''
 
* ''eedb:start.0base'' -- same as ''start.0base''
 
* ''eedb:start.0base'' -- same as ''start.0base''

Revision as of 09:45, 8 May 2012

OSCtable1.1

Basic structure

  • A simple tabdelimited text file.
  • Lines starting with '#' are comments.
  • Lines starting with '##' are attributes or metadata of the table. (See 'Metadata' section below)
  • The first line after the comments/metadata (see below) is a header line, which indicate column names of the table.
  • Column order is flexible
  • All the comment and attribute lines should appear above the header line
  • The first column should describe a 'key' (unique in many cases, but not necessarily) of the data, and the column name should be 'id'
  • If a cell need to include multiple values, comma(',') is recommended to be used as a separator.
  • All the columns should be described in Metadata (See 'Metadata' section below)


Metadata ('##' lines)

The basic structure is '##qualifier = value'.
Required metadata: FileFormat, Date, ProtocolREF, ColumnVariable, ContactName, ContactEmail

'genome_assembly' for parameter value is required, when using

##NameSpace=genomic_coordinate

Required (mandatory) metadata

FileFormat

describes file format of this file.
example:

##FileFormat = OSCtable1.1

Date

describes the date when the data file is generated

##Date = 20090602

ProtocolREF

describes the protocol used to generate the data file.

##ProtocolREF = CAGEmappingv1.0

ColumnVariable

describes the ALL columns used in the data file

##ColumnVariable[start] = this is a start position of the genomic coordinate
##ColumnVariable[end] = this is a stop position of the genomic coordinate
##ColumnVariable[norm.THP10h] = this is TPM normalized value with 10h
##ColumnVariable[entrez_gene_id] = Entrez gene ID, which is assigned to the cluster

ContactName

describes the contact name about the data file.

##ContactName = Hideya Kawaji

ContactEmail

describes the contact address about the data file

##ContactEmai = kawaji@gsc.riken.jp

Optional metadata

InputFile

describes the file(s) used to generate the data file

##InputFile = lane1.fa
##InputFile = lane2.fa

ParameterValue

describes the parameter(s) used to generate the data file in the protocol
the parameter(s) should be consistent with the protocol description

##ParameterValue[alignment_program] = BWA
##ParamterValue[aligment_program_version] = 1.3.5
##ParameterValue[UCSC_gene_tracks] = RefSeq
##ParameterValue[UCSC_gene_tracks] = ENSEMBL transcript

NameSpace

describes the name space for the column names. See below (NameSpace)

##NameSpace=genomic_coordinate
##NameSpace=expression

Column Name Spaces

  • A set of column names (and parameters) to be used for a specific purpose or context.
  • The same column names with the same name space are recognized as the same (equivalent) meaning.
  • Supported name space: genomic_coordinate, expression

genomic_coordinate

  • column names are: chrom, start.0base, start.1base, end, strand
  • parameter value: genome_assembly
    • chrom: chromosome name used in the genome assembly. For example, chr1, chr2, chr3, ... chrM for the UCSC hg18 genome assembly.
    • start.0base: start position (bp) on the chromosome in 0start coordinate system (BED, PSL, BLAT, exonerate, and nexAlign style)
    • start.1base: start position (bp) on the chromosome in 1start coordinate system (conventional coordinate system; adopted in GFF as well)
    • end: end position (bp) on the chromosome
    • strand: strand on the chromosome; optional
    • Note:
      • All of the above columns are not necessarily required. For example, start.0base would not be required if you have start.1base, and strand would not be required if the annotation do not have strand distinction
      • 'genome_assembly' for parameter value is required.

expression

generic expression tags
  • column names are: raw.ZZZ, norm.ZZZ, or exp.YYY.ZZZ
    • raw.ZZZ : for raw value of expression such as raw_counts and raw signal intensities.
    • norm.ZZZ : for normalized value.
    • ZZZ indicates the experiment (cell conditions, RNAs, etc) of the expressions
    • YYY should not include dot (.), and indicates the type of expression.
'mapcount' tag

To specify that alignments may be be mapped on more than one location, you can either use...


ZENBU interpretation of OSCtable files

OSCtable is one of the main interchange formats for ZENBU. It allows all possible mapping of data into the ZENBU data model. Since the OSCtable specification is highly flexible, it was possible for the ZENBU OSCtable parser to have an extended vocabulary of metadata directives and column name spaces.

Metadata

While the official OSCtable specification include mandatory metadata elements, the ZENBU OSCtable parser relaxes this requirement. All ## metadata lines are parsed as optional metadata. The only requirement for a valid OSCtable for ZENBU is a column header line.

Experiment metadata

The original OSCtable specification works very well with single experiment data files, but does not provide enough fine control of assigning metadata in a multi-experiment data file. To alleviate this, ZENBU added an additional metadata directive to the OSCtable specification

  • ##ExperimentMetadata[experiment-name][key] = value

experiment-name is the same as the ZZZ in the expression column header descriptions. By referencing to the experiment-name it is possible to have the same experiment used in multiple columns with different datatypes.

Example

##ParameterValue[filetype] = osc
##ParameterValue[genome] = mm9
##ColumnVariable[eedb:chrom] = chromosome name
##ColumnVariable[eedb:start.0base] = chromosome start in 0base coordinate system
##ColumnVariable[eedb:end] = chromosome end
##ColumnVariable[eedb:strand] = chromosome strand
##ColumnVariable[eedb:score] = score or significance of the feature
##ColumnVariable[exp.tagcount.Mouse_Embryoid_Body_RNAseq_exonic] = tagcount Mouse_Embryoid_Body_RNAseq_exonic
##ExperimentMetadata[Mouse_Embryoid_Body_RNAseq_exonic][eedb:display_name] = Mouse_Embryoid_Body_RNAseq_exonic
##ExperimentMetadata[Mouse_Embryoid_Body_RNAseq_exonic][eedb:platform] = SQRL_RNAseq
##ExperimentMetadata[Mouse_Embryoid_Body_RNAseq_exonic][description] = This is the exon junction signal for RNAseq of Mouse Embryoid body cells after 4 days of differentiation to the 'primitive streak stage' (see PMID:17286599 , and should contain expression of brachyury, mixl1, tbx6, and flk1) carried out at the IMB on Applied Biosystems SOLiD system (PMID: 18516046). Mouse strain: SV129. Mapping: published version.
##ExperimentMetadata[Mouse_Embryoid_Body_RNAseq_exonic][strain] = SV129
##ExperimentMetadata[Mouse_Embryoid_Body_RNAseq_exonic][tissue] = Mouse Embryoid body
##ColumnVariable[exp.tagcount.Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal] = tagcount Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal
##ExperimentMetadata[Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal][eedb:display_name] = Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal
##ExperimentMetadata[Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal][eedb:platform] = SQRL_RNAseq
##ExperimentMetadata[Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal][description] = Description: This is the exonic signal for RNAseq of Mouse Embryonic Stem Cells carried out at the IMB on Applied Biosystems SOLiD system (PMID: 18516046). Mouse strain: SV129. Mapping: published version
##ExperimentMetadata[Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal][strain] = SV129
##ExperimentMetadata[Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal][tissue] = Mouse Embryonic Stem Cells
eedb:chrom	eedb:start.0base	eedb:end	eedb:name	eedb:score	eedb:strand	exp.tagcount.Mouse_Embryoid_Body_RNAseq_exonic	exp.tagcount.Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal	exp.tpm.Mouse_Embryoid_Body_RNAseq_exonic	exp.tpm.Mouse_Embryonic_Stem_Cell_RNAseq_exonic_signal
chr7	52253690	52253691	block_chr7:52253691..52253691+	186.00	+	84.00	102.00	8.40	10.20
chr7	52253691	52253692	block_chr7:52253692..52253692+	184.00	+	83.00	101.00	8.30	10.10
chr7	52253692	52253693	block_chr7:52253693..52253693+	185.00	+	83.00	102.00	8.30	10.20
chr7	52253693	52253694	block_chr7:52253694..52253694+	180.00	+	81.00	99.00	8.10	9.90
chr7	52253694	52253695	block_chr7:52253695..52253695+	174.00	+	77.00	97.00	7.70	9.70

Column name spaces

official OSCtable column names

The official OSCtable specification has very few predefined column names

  • chrom -- chromosome name
  • start.0base -- chromosome start in a 0base coordinate system
  • start.1base -- chromosome start in a 1base coordinate system
  • end -- chromosome end location
  • strand -- chromosome strand
  • ID -- interpreted as the name of the Feature

While ZENBU uses a 1-based-inclusive coordinate space internally, it can automatically handle the conversion between coordinate spaces at load time.

ZENBU additional column namespaces

  • eedb:name -- the name of the Feature
  • eedb:score -- is stored in the Feature significance.
  • bed::blockCount, bed::blockSizes, bed::blockStarts -- are taken from the BED file specification. These three columns work together and are interpreted into SubFeatures on the primary Feature. Each of these SubFeatures are created with a FeatureSource category of block.
  • bed:thickStart and bed::thickEnd -- are taken from the BED file specification.
    • if bed::thickStart is not equal to start then the region from start to bed:thickStart is interpreted into a SubFeature of category 5utr
    • if bed:thickEnd is not equal to end then the region from bed:thickEnd to end is interpreted into a SubFeature of category 3utr

Aliased column names

Several column names have aliases to other column name spaces

  • name -- same as eedb:name
  • score -- same as eedb:score
  • eedb::chrom -- same as chrom
  • eedb:start.0base -- same as start.0base
  • eedb:start.1base -- same as start.1base
  • eedb:end -- same as end'
  • eedb:strand -- same as strand

ignoring columns

The OSCtable allows for easy wrapping of any tab-texted file into an OSCtable by simply pre-pending a header with the appropriate column names. But sometimes these original files contain columns which one might not really need. To simplify the process of loading, ZENBU added a special column-name

  • ignore.xxxx -- ignore this column, where xxxx would be the original column name

If a column is labeled as such, on loading this column will be stripped from the data file and thrown away. This can simplify the process for bioinformatician and avoid un-needed data-scripting to munge data prior to loading.