Clustering CAGE along predefined regions

From ZENBU documentation wiki
Jump to: navigation, search


In this case study we will illustrate how to vizualize CAGE based expression (yellow background track) along predefined clusters (blue background track) and collate expression onto themc (green background track). The example below uses the FANTOM4 CAGE data and significant clusters boundaries geenrated by the FANTOM4 consortium in The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line Nat Genet. 2009 May;41(5):553-62.
http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=hXV1oi_zCdOjH3oPeHd2VC;loc=hg18::chr19:54860670..54861358
Clustering CAGE along predefined region.png

Creating a track with the expression datastream to be clustered

Using the Expression/Experiment tab in Data Explorer Interface, we select all FANTOM4 (searched keyword) CAGE (further filtred by plateform).

We will name this track "FANTOM4 CAGE" and render it as an wiggle plot.
http://fantom.gsc.riken.jp/zenbu/dex/#section=Experiments;search=CAGE%20FANTOM4

Selecting F4CAGE.1.png Selecting F4CAGE.2.png

We also select from the Annotation tab the FANTOM CAGE promoters L2 dataset to construct a second track.
http://fantom.gsc.riken.jp/zenbu/dex/#section=Annotation;search=%20FANTOM4

Selecting F4CAGE.3.png Selecting F4CAGE.4.png

And finally, vizualize those 2 tracks.
http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=6BRs_cRFPlhiqufjip_ey;loc=hg18::chr19:54853066..54862509
Selecting F4CAGE.5.png




Modifying the track to collate expression along FANTOM4 L2 clusters boundaries

The script "FANTOM4 (hg18) CAGE promoter expression collation" allows to collate an expression stream along the boundaries defined as valid CAGE promoter region by the FANTOM4 consortium (Nat Genet. 2009 May;41(5):553-62). Users interested in applying the same approach to their own predefined set of regions can wrote their own TemplateCluster based expression collation script.


Using the predefined "FANTOM4 (hg18) CAGE promoter expression collation" script

As a first step, let's duplicate the track (this is not strictly speaking necessary but it will help illustrate the outcome of the expression collation).
On the duplicated track, clicking on the "configure track" (grey gear) icon, opens up the panel reconfigure panel.
The section "Stream Processing script" describes how the CAGE data stream is currently processed as a wiggle plot.
http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=9fOol9rSBRjlHDB0jdcDkB;loc=hg18::chr19:54853066..54862509
Cluster collating F4CAGE.1.png Cluster collating F4CAGE.2.png

We change it to process the data collating CAGE based expression levels along the FANTOM4 (hg18) CAGE promoter regions.
A script performing such operation is already stored into zenbu and can be recalled by selecting "predefined script" in the drop down menu.
The "predefined script" selection opens up a search panel which interrogate ZENBU for existing scripts matching the search criteria.

Cluster collating F4CAGE.3.png Cluster collating F4CAGE.4.png

We select the processing script called "FANTOM4 (hg18) CAGE promoter expression collation".
To reflect the different processing we also modify the current title of the track and provide a quick description of the new content.
Note that the detailed content of this script can be brought up by editing it (See below).


http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=zW_uLdNRF5FqL0Q5V3zYNC;loc=hg18::chr19:54853066..54862509
Cluster collating F4CAGE.5.png Cluster collating F4CAGE.6.png

...Zooming in on the region :
http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=zW_uLdNRF5FqL0Q5V3zYNC;loc=hg18::chr19:54860464..54861262
Clicking on on of the cluster provides its expression level across all the streamed in experiments
Cluster collating F4CAGE.7.png Cluster collating F4CAGE.8.png

Writing your own cluster/region based expression stream collation script

We provide an already written script which allows the exploration of any expression data stream arising from reads mapped onto hg18.
http://fantom.gsc.riken.jp/zenbu/dex/#section=Scripts;search=FANTOM4%20CAGE%20promoter%20expression%20collation

Below is the complete script, further detailed in the next paragraph of this page.

<zenbu_script>
	<note> track default vizualization parameters : thick-arrow with color-coded total expression</note>
	<track_defaults source_outmode="skip_metadata" scorecolor="fire1" backColor="" hidezeroexps="true" glyphStyle="thick-arrow"/>

	<note> Lists the sources against which we will collate the expression </note>
	<datastream name="cluster" output="simple_feature">
		<source id="72DA22E8-B95F-48B8-B7E3-3698E820E331::48:::FeatureSource" category="L2_promoter" name="CAGE_L2_promoter_april2008"/>
	</datastream>

	<stream_processing>

		<note> Get the Transcriptional Start Sites (TSS) revealed by the 5'extremity of CAGE derived reads </note>
		<spstream module="ResizeFeatures">
                       <mode>shrink_5prime</mode>
                </spstream>

                <note> Collate the CAGE TSS along regions defined by FANTOM4 L2 clusters </note>
                <spstream module="TemplateCluster">
                       <ignore_strand value="false"/>
                       <side_stream>
                            <spstream module="Proxy" name="cluster"/>
                       </side_stream>
		 </spstream>

                 <note> Sum up the expression over all samples and save the value as the refseq score to color it accordingly </note>
	         <spstream module="CalcFeatureSignificance">
		    <expression_mode>sum</expression_mode>
	         </spstream>

	</stream_processing>
</zenbu_script>



Let's describe the construction of the "FANTOM4 (hg18) CAGE promoter expression collation" script in order to be able to easily write your own cluster/region based expresion stream collation script. The script will use the following processing modules :

  • The Proxy : a special place holder processing module designed to work in coordination with the <datastream> section of the ZENBU scripting system. Each <datastream> has a name attribute and a pool of data sources with tag <source>. Each data source is defined by their ZENBU system id . The other attributes of each <source> are ignored, but can be helpful for script writers as comments.
  • The ResizeFeatures processing module : designed to work on Features to alter their genomic coordinates. The module will resort features on the data stream as needed to preserve the stream integrity. Its ypical use cases illustrated here shrink the feature to its 5' end and make it 1bp wide (Transcriptional Start Sites revelaed by CAGE reads 5' extremity).
  • The TemplateFilter processing module which takes a stream of template features on a side stream defined here by the Proxy and performs overlap comparison against features on the primary data stream (here the CAGE data stream). When an overlap occurs, the primary stream primary-stream feature is either passed through this filter (default behaviour used herein) or blocked based on this module's parameter settings.
  • The CalcFeatureSignificance processing module : designed to sit in the middle of a processing stream and transform the multiple Experiment / Expression data of a Feature into the single significance for that Feature.



Let's breakdown the script into its basic content.

  • The first part of the zenbu script defines the default vizualization parameters that will be used to render the processed track :
	<note> track default vizualization parameters : thick-arrow with color-coded total expression</note>
	<track_defaults source_outmode="skip_metadata" scorecolor="fire1" backColor="" hidezeroexps="true" glyphStyle="arrow"/>

The default rendering will be thick-arrows, color-coded using the "fire1" scaling (grey -> yellow -> orange -> red)
Empty expression will automatically be hidden and metadata associated with the regions onto which the expression stream will be collated will not be reported


  • The second part of the script defines the CAGE_L2_promoter regions onto which the expression stream will be collated :


Cluster collating F4CAGE.2.3.png

	<note> Lists the sources against which we will collate the expression </note>
	<datastream name="cluster" output="simple_feature">
		<source id="72DA22E8-B95F-48B8-B7E3-3698E820E331::48:::FeatureSource" category="L2_promoter" name="CAGE_L2_promoter_april2008"/>
	</datastream>

The datasource id(s) referencing the regions onto which the expression steam will be collated must be defined herein.
It will be used further in the processing directive as a Proxy named "cluster".
Obtaining the source_id of datasteam can be done thru DEX
http://fantom.gsc.riken.jp/zenbu/dex/#section=Annotation;search=FANTOM4%20CAGE
Cluster collating F4CAGE.2.1.png


	<stream_processing>
           ...
	</stream_processing>

First, the stream is modified such that only the 5'end of the streamed data source (i.e. CAGE reads' associated TSS) is to be considered and the data is "resized" to only exptract the 5' extremity of reads.

		<note> Get the Transcriptional Start Sites (TSS) revealed by the 5'extremity of CAGE derived reads </note>
		<spstream module="ResizeFeatures">
                       <mode>shrink_5prime</mode>
                </spstream>

Second, this 5'end resized stream is intersected with a side stream defined by the datastream above. To do so we employ the TemplateCluster module coupled with the Proxy module referered to by its name

                <note> Collate the CAGE TSS along regions defined by FANTOM4 L2 clusters </note>
                <spstream module="TemplateCluster">
                       <ignore_strand value="false"/>
                       <side_stream>
                            <spstream module="Proxy" name="cluster"/>
                       </side_stream>
		 </spstream>
  • Finally, we modify the score of the cluster to be the sum of the expression over all experiments in the input stream.

This score is then used to color code the relative intensity of the signal (CalcFeatureSignificance)

                 
                 <note> Sum up the expression over all samples and save the value as the refseq score to color it accordingly </note>
	         <spstream module="CalcFeatureSignificance">
		    <expression_mode>sum</expression_mode>
	         </spstream>