From ZENBU documentation wiki
Jump to: navigation, search

Data Stream Processing > Processing Modules > Infrastructure Modules


The Proxy a special place holder processing module designed to work in coordination with the <datastream> section of the ZENBU scripting system.

Each <datastream> has a name attribute and a pool of data sources with tag <source>. Each data source is defined by their ZENBU system id . The other attributes of each <source> are ignored, but can be helpful for script writers as comments. Here is an example of a data stream pool of 4 RNAseq experiments from the Encode project from HepG2 cells.

<datastream name="encode_wold_hepg2" output="skip_metadata" datatype="tagcount" >
   <source id="904A696A-62EC-4665-85B9-4F92DDFA9814::2:::Experiment" platform="RNA-seq"/>
   <source id="8EB257B8-6B26-4DB7-8470-07A708EC7CEF::2:::Experiment" platform="RNA-seq"/>
   <source id="A80763D0-F12C-449D-AFEA-288BEBE55C4A::2:::Experiment" platform="RNA-seq"/>
   <source id="74E98401-90F9-4FE4-B534-2AC4D3955753::2:::Experiment" platform="RNA-seq"/>

The matching Proxy for this datastream in a script would look like this

  <spstream module="Proxy" name="encode_wold_hepg2"/>

By separating the data sources and proxy place-holders it is possible to provide

  • makes it easy to copy/paste commonly used <datastream> blocks between different scripts.
  • security checking that the current user is allowed access to the data sources defined in the <datastream> sections
  • allows the pooled data sources to be reused in different sections of the same script by placing multiple Proxy modules with the same name.

Datastream attributes

Proxy and <datastream> are a special module pairing in the ZENBU script system and use these attributes to control the nature of the data on the datastream.

  • name : name of the <datastream> which will be injected in place of this proxy at query time. Name must match between a <datastream> definition for a Proxy to correctly initialize.
  • output : defines the level of data which will be provided on this datastream. By limiting the level of data loading, performance can be increased. Valid values are:
    • full_feature : Features are loaded with all available data -- genome coordinates, name, subfeatures, expression and feature metadata.
    • simple_feature : Features are loaded with only genome coordinates and names. No subfeatures, nor expression nor feature metadata. Default if no specified.
    • subfeature : Features are loaded with -- genome coordinates, name, subfeatures. No expression nor metadata.
    • expression : Features are loaded with -- genome coordinates, name, expression. No subfeatures nor metadata.
    • skip_metadata : Features are loaded with -- genome coordinates, name, subfeatures, and expression. No metadata.
    • skip_expression : Features are loaded with -- genome coordinates, name, subfeatures, and metadata. No expression.
  • datatype : defineds the expression datatype for the datastream. If not specified, no expression will be available on the datastream.

Getting datastream xml definitions from tracks

The easiest way to get the XML for a Proxy Datastream pool is to use an already existing track configured with the desired data sources.

For example below we can see a Gencode annotation track on top which we want to use as a Proxy datastream to collate expression from the second track in order to create the third track. Gencode collation RNAseq.png

First access the Track Reconfiguration panel Track controls-reconfigure track.jpg and then select the datastream xml control. This will bring up a pop-up panel with the XML datastream definition for this track which can be copy-pasted into your script in another track. Please note that when using this interface that the default name of the datastream is an arbitrary track-number. It is best to rename the datastream-pool to something easier to remember after copying into your script.

Datastream xml widget.png


This is a script which incorporates a Proxy / TemplateCluster to collate expression into Gencode V10 gene models. The expression is then normalized via the NormalizeRPKM normalization module. The script finishes with CalcFeatureSignificance so that the Features can be displayed via score-coloring.

	<datastream name="gencode" output="full_feature">
		<source id="D71B7748-1450-4C62-92CB-7E913AB12899::19:::FeatureSource"/>
		<spstream module="TemplateCluster">
				<spstream module="Proxy" name="gencode"/>

		<spstream module="NormalizeRPKM"/>

		<spstream module="CalcFeatureSignificance"/>

Here is a ZENBU view showing this script in use