Data Stream Processing

From ZENBU documentation wiki
(Redirected from DataProcessing)
Jump to: navigation, search

One of unique features of the ZENBU system is the ability to apply data processing and analysis on-demand at query time and as part of the visualization process. This means that raw or unprocessed data can be loaded into the ZENBU system which translates it into the internal Data Model, and then ZENBU can perform many of the data manipulations and analysis that previously required bioinformatics experts with knowledge of the unix command line and a collection of bioinformatics tools.

The data processing system is applied on a track level at query time. This means that no intermediary result needs to be stored in a database or on disk. This allows the user to modify processing parameters and immediately see the effect of the change in the visualization. It also makes the system very fast since data is processed in memory and there is no overhead of reading and writing to slow disks.

Because data processing is applied on each track, and tracks are loaded independently, there is a level of parallelism inherent in the design of the system. The processed data result generated by ZENBU on-demand can also be downloaded into data files for further analysis by external systems like R, BioConductor, or BioPython.

Data processing is controlled through a Scripting system based on chaining Processing modules together in a manner similar to digital signal processing [1]

Sorted Data Stream

The central concept of any track in the ZENBU system is that all data comes through the system as a single stream of data. This single data stream is often the result of pooling multiple data sources together.

This central data stream concept means that any object of the Data Model can be passed on this stream. This gives the processing and visualization systems a great deal of flexibility since all information can be made available on the data stream.

For genomic Features, every data stream in the system preserves a region-location sort order. When multiple sources are merged together in a Pool, the Features are "merge sorted" so that this sort order is preserved. When Features are processed by different processing modules the sort order is also preserved. By forcing all data streams to be required to follow this sort-order, it becomes very easy to write signal-processing modules which can efficiently take advantage of the fact of this sort-order. This means that many processing operation can be performed without buffering data or requiring massive amount of memory. This is one of the key features of the ZENBU system which allows it to work with Terrabytes of data yet still be able to run on modest hardware computers.

The genomic location sort order for Features appearing on the stream is as follows

  • chrom_start
  • chrom_end
  • strand

This means that location takes priority over stand. One advantage of this sort order is that it becomes very easy to flip between stranded and strandless analysis without requiring buffering or resorting.

Scripting

ZENBU data processing scripts are an XML description language. The basic form of the script starts with an outer XML tag structure of

  <zenbu_script>
    ...
  </zenbu_script>

Configure custom XML script.png

Within that structure there are several sections

  • <datastream> : allows specification of alternate "virtual Data Source pools" for use in coordination with Proxy modules. Each different "datastream" gets its own tag section
  • <stream_processing>: Defines the streaming-chain of processing modules which are injected between DataSource of the track and the Visualization. Data processing happens in a signal-processing style by daisy-chaining[1] multiple processing modules together. Some processing modules operate by combining data from multiple data-streams through the use of a <side_stream> specification inside the module configuration.
    In the above script example the data on the primary stream is first processed by TemplateCluster against a side-stream of gencode data sources which collates the expression into the genocode annotation features, followed by the second module NormalizeRPKM which normalized the expression, and then followed by the third module in the chain CalcFeatureSignificance which recalculates the combined expression of all experiments into the significance of each Feature on the stream.
  • <track_defaults> : defines default options in the track configuration panel when used with "sharing a predefined script". When a predefined script with a track_defaults section is loaded into a track, those parameters of the "Track configuration panel" are toggled into this new default state. This makes it easy for script writers to defined a package of both processing and visualization as a "saved predefined script". Only one is <track_defaults> tag is allowed to be defined inside a script. Attribute options are:
    • source_outmode : sets the "feature mode" in the Data Source section
    • datatype : sets the "data source type" in the Data Source section
    • glyphStyle : sets the visualization style
    • scorecolor: sets the score_color by name
    • backColor: sets the background color
    • hidezeroexps : sets the state of the hide zero experiments checkbox
    • exptype : sets the display datatype
    • height : sets the "track pixel height" for express style tracks
    • expscaling : sets the "express scale" option for express style tracks
    • strandless : sets the "strandless" option for express style tracks
    • logscale : sets the "log scale" option for express style tracks


After a script has been created and is working as desired, it can be saved and shared with other users through the save script button in the track reconfigure panel.

Please check out each of the processing modules below. Every module's wiki page includes an example script of how that module can be used and shows the structure of the scripting XML language. Many module pages also contain a hyper link to an active ZENBU view page as a live example of the script in action.

Processing modules

Processing is accomplished by chaining a series processing modules (or plugins) together between the pooled data source and the visualization / data download output. In addition some modules may provide for side chaining addition data streams into the main signal processing data stream. Side chains can be simple or complex chains of processing modules like in this case study

The processing modules can be broken down into several concept categories

Infrastructure modules

These modules provide access to additional data sources for use on side-streams

  • Proxy: Provide security-checked access to data sources loaded into ZENBU.
  • FeatureEmitter: Create regular grids of features dynamically.

Clustering, collation, peak calling

These modules provide for high-level manipulations of data to reduce the number of features on the data stream by grouping them into related concepts.

Filtering

These modules remove data from the stream based on filtering criteria

  • TemplateFilter: Use a side-chain-stream as a mask to filter features on the primary stream.
  • CutoffFilter: Filter features using simple cutoff filters (high pass, low pass, band pass).
  • ExpressionDatatypeFilter: Filter expression from features based on datatype.
  • FeatureLengthFilter: Filter Features based on min/max length criteria.
  • TopHits: Filter neighborhood-regions based on best feature significance.
  • NeighborCutoff: Noise filtering relative to strongest signal within a neighborhood-region.
  • StrandFilter: Filtering features based on strand

Data normalization and rescaling

These modules alter the expression in a stream based on normalization or rescaling algorithms.

  • NormalizeByFactor: Normalize expression with respect to experiments associated metadata.
  • NormalizePerMillion: Normalize expression with respect to the total expression of the associated experiments (stored as metadata at upload time).
  • NormalizeRPKM: Reads Per Kilobase per Million (RPKM) based expression normalization.
  • RescalePseudoLog: pseudo-log Transformation of expression value.

Metadata manipulation

General manipulation

These modules are general purpose lego blocks to manipulate objects on the stream to help with getting data in the right format for the next module in the stream.