Difference between revisions of "Data Stream Processing"
|Line 79:||Line 79:|
=== Metadata manipulation ===
=== Metadata manipulation ===
* [[OverlapAnnotate | '''OverlapAnnotate'''
* [[OverlapAnnotate | '''OverlapAnnotate''' ]]metadata between overlapping Features
* [[MetadataFilter | '''MetadataFilter'''
* [[MetadataFilter | '''MetadataFilter''' ]]Features based on matching metadata
* [[MetadataManipulate | '''MetadataManipulate'''
* [[MetadataManipulate | '''MetadataManipulate''' ]]the metadata of stream objects (features or sources).
* [[RenameExperiments | '''RenameExperiments'''
* [[RenameExperiments | '''RenameExperiments''' ]]new Experiment ''name'' based on concatenating metadata
* [[FeatureRename | '''FeatureRename'''
* [[FeatureRename | '''FeatureRename''' ]]the Feature to its FeatureSource name
=== General manipulation ===
=== General manipulation ===
Revision as of 16:39, 27 November 2012
One of unique features of the ZENBU system is the ability to apply data processing and analysis on-demand at query time and as part of the visualization process. This means that raw or unprocessed data can be loaded into the ZENBU system which translates it into the internal Data Model, and then ZENBU can perform many of the data manipulations and analysis that previously required bioinformatics experts with knowledge of the unix command line and a collection of bioinformatics tools.
The data processing system is applied on a track level at query time. This means that no intermediary result needs to be stored in a database or on disk. This allows the user to modify processing parameters and immediately see the effect of the change in the visualization. It also makes the system very fast since data is processed in memory and there is no overhead of reading and writing to slow disks.
Because data processing is applied on each track, and tracks are loaded independently, there is a level of parallelism inherent in the design of the system. The processed data result generated by ZENBU on-demand can also be downloaded into data files for further analysis by external systems like R, BioConductor, or BioPython.
Data processing is controlled through a Scripting system based on chaining Processing modules together in a manner similar to digital signal processing 
Sorted Data Stream
The central concept of any track in the ZENBU system is that all data comes through the system as a single stream of data. This single data stream is often the result of pooling multiple data sources together.
This central data stream concept means that any object of the Data Model can be passed on this stream. This gives the processing and visualization systems a great deal of flexibility since all information can be made available on the data stream.
For genomic Features, every data stream in the system preserves a region-location sort order. When multiple sources are merged together in a Pool, the Features are "merge sorted" so that this sort order is preserved. When Features are processed by different processing modules the sort order is also preserved. By forcing all data streams to be required to follow this sort-order, it becomes very easy to write signal-processing modules which can efficiently take advantage of the fact of this sort-order. This means that many processing operation can be performed without buffering data or requiring massive amount of memory. This is one of the key features of the ZENBU system which allows it to work with Terrabytes of data yet still be able to run on modest hardware computers.
The genomic location sort order for Features appearing on the stream is as follows
This means that location takes priority over stand. One advantage of this sort order is that it becomes very easy to flip between stranded and strandless analysis without requiring buffering or resorting.
ZENBU data processing scripts are an XML description language. The basic form of the script starts with an outer XML tag structure of
<zenbu_script> ... </zenbu_script>
Within that structure there are several sections
- <datastream> : allows specification of alternate "virtual Data Source pools" for use in coordination with Proxy modules. Each different "datastream" gets its own tag section
- <stream_processing>: Defines the streaming-chain of processing modules which are injected between DataSource of the track and the Visualization. Data processing happens in a signal-processing style by daisy-chaining multiple processing modules together. Some processing modules operate by combining data from multiple data-streams through the use of a <side_stream> specification inside the module configuration.
In the above script example the data on the primary stream is first processed by TemplateCluster against a side-stream of gencode data sources which collates the expression into the genocode annotation features, followed by the second module NormalizeRPKM which normalized the expression, and then followed by the third module in the chain CalcFeatureSignificance which recalculates the combined expression of all experiments into the significance of each Feature on the stream.
- <track_defaults> : defines default options in the track configuration panel when used with "sharing a predefined script". When a predefined script with a track_defaults section is loaded into a track, those parameters of the "Track configuration panel" are toggled into this new default state. This makes it easy for script writers to defined a package of both processing and visualization as a "saved predefined script". Only one is <track_defaults> tag is allowed to be defined inside a script. Attribute options are:
- source_outmode : sets the "feature mode" in the Data Source section
- datatype : sets the "expression datatype" in the Data Source section
- glyphStyle : sets the visualization style
- scorecolor: sets the score_color by name
- backColor: sets the background color
- hidezeroexps : sets the state of the hide zero experiments checkbox
- exptype : sets the display datatype
- height : sets the "track pixel height" for express style tracks
- expscaling : sets the "express scale" option for express style tracks
- strandless : sets the "strandless" option for express style tracks
- logscale : sets the "log scale" option for express style tracks
After a script has been created and is working as desired, it can be saved and shared with other users through the save script button in the track reconfigure panel.
Please check out each of the processing modules below. Every module's wiki page includes an example script of how that module can be used and shows the structure of the scripting XML language. Many module pages also contain a hyper link to an active ZENBU view page as a live example of the script in action.
Processing is accomplished by chaining a series processing modules (or plugins) together between the pooled data source and the visualization / data download output. In addition some modules may provide for side chaining addition data streams into the main signal processing data stream. Side chains can be simple or complex chains of processing modules like in this case study
The processing modules can be broken down into several concept categories
These modules provide access to additional data sources for use on side-streams
- Proxy: provide security-checked access to data sources loaded into ZENBU
- FeatureEmitter: create regular grids of features dynamically
Clustering and collation
These modules provide for high-level manipulations of data to reduce the number of features on the data stream by grouping them into related concepts.
- TemplateCluster: use side-chain-stream as template to collate expression.
- UniqueFeature: cluster and count features matching 'unique' criteria
- Paraclu: hierarchical clustering algorithm http://www.cbrc.jp/paraclu/
These modules remove data from the stream based on filtering criteria
- TemplateFilter: use side-chain-stream as mask to filter primary stream features
- CutoffFilter: filter features using simple cutoff filters (high pass, low pass, band pass)
- ExpressionDatatypeFilter: filter expression from features based on datatype
- FeatureLengthFilter: filter Features based on mix/max length criteria
- TopHits: Filter neighborhood-regions based on best feature significance.
- NeighborCutoff: noise filtering relative to strongest signal within a neighborhood-region
Data normalization and rescaling
These modules alter the expression in a stream based on normalization or rescaling algorithms
- NormalizeByFactor :Normalize expression with respect to experiments associated metadata
- NormalizePerMillion :Normalize expression with respect to the total expression of the associated experiments (stored as metadata at upload time)
- NormalizeRPKM : RPKM-based expression normalization
- RescalePseudoLog : pseudo-log transformation of expression value
- OverlapAnnotate : Transfer metadata between overlapping Features
- MetadataFilter : Filter Features based on matching metadata
- MetadataManipulate : Alters the metadata of stream objects (features or sources).
- RenameExperiments : Create new Experiment name based on concatenating metadata
- FeatureRename : Rename the Feature to its FeatureSource name
These modules are general purpose lego blocks to manipulate objects on the stream to help with getting data in the right format for the next module in the stream.
- CalcFeatureSignificance : Aggregate the associated expression values onto the score of a feature
- CalcInterSubfeatures : Stream the region between subfeature of a parent feature (i.e. intron)
- StreamSubfeatures : Stream the sub-features rather than the parent feature
- FilterSubfeatures : Rebuild a feature/subfeature structure by filtering subfeatures
- ResizeFeatures : Alter the boundaries of a feature (shrink toward 5', 3', start and end)
- MakeStrandless : Alter the strand of a feature