Data Abstraction Model

From ZENBU documentation wiki
(Redirected from DataModel)
Jump to: navigation, search

Although the internal data abstraction model is not obviosuly exposed to the users of the system, understanding the internal model can help to understand how data is stored and processed by the system. For advanced users of the script processing system, understanding the data model is important to write your own custom processing scripts.


ZENBU internal data model

The data model is an evolution of the model first described in the FANTOM4 EdgeExpress system (Genome Biol. 2009;10(4):R39. Epub 2009 Apr 19).

The ZENBU data model is composed of

  • data sources (FeatureSource, Experiment, EdgeSource)
  • genomic location information (Features)
  • numerical signal-based value data (Signal)
  • connections between Features (Edges)
  • and descriptive metadata.

EEDB DataModel.jpg

Features and associated SubFeatures

Features

The Feature is the central element in the data model.
It represents a generic object in the system. A Feature must belong to a FeatureSource. The primary attributes of a Feature include a name, a significance, and genomic coordinates.
Genomic coordinates are defined as:

  • chromosome assigned to a specific species assembly
  • chrom_start
  • chrom_end
  • strand

For Features to be visualized in the ZENBU genome browser, genomic coordinates are mandatory.
ZENBU Feature genomic coordinates are 1base-exclusive which means that chromosomes starts at 1 and features of length 1bp have the same chrom_start and chrom_end.

In addition a Feature can have Signal and Metadata attached to it.

SubFeatures

In addition a Feature can have other Features attached under it which are called SubFeatures.
The most common use for SubFeatures is to define exon/intron/UTR spliced gene-model details of the primary Feature, but any type of category can be defined for the FeatureSources of attached SubFeatures. For example one could define protein domains SubFeature regions of the primary Feature with different categories in addition to the exon structure.
SubFeatures are allowed to overlap each other and do not need to be exclusive.
Currently (as of version 2.5) SubFeatures cannot have another layer of SubFeatures under them.

Here is example of very complete Feature with subfeatures and signal (here displayed in a ZENBU XML export/interchange format)

<feature name="NM_001964" start="137801181" end="137805004" strand="+" >
    <chrom chr="chr5" asm="hg19" ucsc_sm="hg19" ncbi_asm="GRCh37" taxon_id="9606" length="180915260"/>
    <featuresource category="refgene" name="UCSC_hg19_refgene" feature_count="35067"/>
    <subfeatures count="4">
        <feature category="5utr" start="137801181" end="137801451" strand="+"/>
        <feature category="block" start="137801181" end="137801757" strand="+"/>
        <feature category="block" start="137802446" end="137805004" strand="+"/>
        <feature category="3utr" start="137803770" end="137805004" strand="+"/>
    </subfeatures>
</feature>

Signal data

Represents a single signal-based data element. A Signal element must be attached to a Feature. In addition to the actual signal value, each Signal element has a mandatory DataType. The Signal DataType is used to describe and categorize the values so that signal from many FeatureSources and many Experiments can be pooled together for comparison. Example DataTypes include "tagcount", "tpm", "mapquality" "score" "pvalue" "rle" to name a few.

By definition each Signal data element has one Feature, one Experiment, one FeatureSource, one DataType and one value (number).

Edge

A connection between two Features in the system. Currently Edges are rarely used in the ZENBU system, but they have been retained from the EdgeExpress system for backward compatibility and possible future expansions.

Data Sources

These represent a collection of data of a certain class in the system and are made visible to the users in the data explorer interface.
Every DataSource has metadata describing the source which also allows for users to search and find data sets so that the data can be manipulated, and visualized.

FeatureSource

A collection of Features. Each Feature is part of only one FeatureSource.
Often used to represent a collection of annotation like "Human hg19 Entrez genes". But in addition, every file uploaded into the system is assigned a primary FeatureSource to represent that file as a collection of data.
FeatureSources can be dynamically generated by processing modules of the system to represent dynamically created Features.

Experiment

A collection of Signal data, and by connection a collection of Features.
Since a Feature can have many Signal objects attached, the Experiment is critical to describing the Signal.

EdgeSource

A collection of Edges.
This is rarely used by the current ZENBU system, but has been retained for backward compatibility to EdgeExpressDB and for future expansion capabilities.

Metadata system

Metadata is descriptive text which can be attached to any object in the ZENBU datamodel. Metadata is divided into two concepts. Metadata and Symbols.

Metadata

Metadata elements are not searchable but represent a blob of text or data.
The ZENBU system provides automatic keyword symbol extraction from Metadata text so that effectively to the user, the Metadata appears searchable.
In general Metadata is used for descriptive text, but it can also be XML or uuencoded data.

Symbols

Symbols are small atomic text units which can be searched.
These are often keywords or controlled vocabulary terms. Symbols can be ad-hoc or from controlled Ontologies.

Search system

The ZENBU system provides a complete metadata search system modeled on google/yahoo searching capabilities, with the addition of rigorous logic control - and, or, not and parenthesis ( )