Data Stream Processing Concept

From ZENBU documentation wiki
Revision as of 12:00, 1 October 2012 by Nicolas.bertin (talk | contribs)
Jump to: navigation, search

One of unique features of the ZENBU system is the ability to apply data processing and analysis on-demand at query time and as part of the visualization process. This means that raw or unprocessed data can be loaded into the ZENBU system which translates it into the internal Data Model, and then ZENBU can perform many of the data manipulations and analysis that previously required bioinformatics experts with knowledge of the unix command line and a collection of bioinformatics tools.

The data processing system is applied on a track level at query time. This means that no intermediary result needs to be stored in a database or on disk. This allows the user to modify processing parameters and immediately see the effect of the change in the visualization. It also makes the system very fast since data is processed in memory and there is no overhead of reading and writing to slow disks.

Because data processing is applied on each track, and tracks are loaded independently, there is a level of parallelism inherent in the design of the system. The processed data result generated by ZENBU on-demand can also be downloaded into data files for further analysis by external systems like R, BioConductor, or BioPython.

Data processing is controlled through a Scripting system based on chaining Processing modules together in a manner similar to digital signal processing [1]

Sorted Data Stream

The central concept of any track in the ZENBU system is that all data comes through the system as a single stream of data. This single data stream is often the result of pooling multiple data sources together.

This central data stream concept means that any object of the Data Model can be passed on this stream. This gives the processing and visualization systems a great deal of flexibility since all information can be made available on the data stream.

For genomic Features, every data stream in the system preserves a region-location sort order. When multiple sources are merged together in a Pool, the Features are "merge sorted" so that this sort order is preserved. When Features are processed by different processing modules the sort order is also preserved. By forcing all data streams to be required to follow this sort-order, it becomes very easy to write signal-processing modules which can efficiently take advantage of the fact of this sort-order. This means that many processing operation can be performed without buffering data or requiring massive amount of memory. This is one of the key features of the ZENBU system which allows it to work with Terrabytes of data yet still be able to run on modest hardware computers.

The genomic location sort order for Features appearing on the stream is as follows

  • chrom_start
  • chrom_end
  • strand

This means that location takes priority over stand. One advantage of this sort order is that it becomes very easy to flip between stranded and strandless analysis without requiring buffering or resorting.