ShortRNA denovo clustering

From ZENBU documentation wiki
Revision as of 19:50, 28 November 2012 by Nicolas.bertin (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Zenbu enables the dynamic creation of short RNA data cluster, which allows researcher to delineate the structure and expression levels of short RNA from the raw mapping of short RNA fraction sequenced reads. To cluster the reads we employ Paraclu.

Paraclu finds clusters in data attached to sequences. It was first applied to transcription start counts in genome sequences (see: A code for transcription initiation in mammalian genomes, MC Frith, E Valen, A Krogh, Y Hayashizaki, P Carninci, A Sandelin, Genome Research 2008 18(1):1-12.). Paraclu is intended to explore the data, imposing minimal prior assumptions, and letting the data speak for itself. One consequence of this is that paraclu can find clusters within clusters. Real data sometimes exhibits clustering at multiple scales: there may be large, rarefied clusters; and within each large cluster there may be several small, dense clusters. The original paraclu is a perl script, which is available here. A newer C++ based version working identically to the original, but which is much faster and copes with much bigger data can be retrieved here.

We have implemented a similar approach and adapted it to ZENBU's data streaming for very efficient local clustering of data attached to any sequence. In the specific context of mapped reads obtained from sequenced short RNA fractions, we provide tailored processing scripts. In this case study, we propose to exemplify the basics of ShortRNA de-novo clustering by exploring K562 cells sub-cellular and sub-nuclear short RNA fractions from ENCODE

http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=BEcqkfmQNxHoGp5vS-sRt;loc=hg19::chr19:54069805..54407320


Retrieving K562 cells sub-cellular and sub-nuclear short RNA fractions from ENCODE

DEX based datasource selection

The Data Explorer tab enable to retrieve all the K562 cells sub-cellular and sub-nuclear short RNA fractions from ENCODE. We first select the "Expression experiments" sub-tab and search for wgEncodeCshlShortRnaSeqK562.
This retrieves 19 datasets containing the raw mappings of short RNA and not os short fractions from CSHL extracted from various sub-cellular and sub-nuclear fractions (cytosol, nucleus, nucleolus, ...).

CaseStudy ShortRNAClustering DataSelection1.png

Track creation

We then create a single track from the union of all those sources (we will see further in this case study, how can can very easily breakdown this track into) by selecting them all and clicking on the "build tracks" button located on the upper right corner of the DEX page.

Doing so, opens a configure new track from data sources dialogue box allowing us to specify how we want the data to be displayed.
We will display the data as histogram representing the summed level of expression rather than displaying each tags and select area as the region rather than the default "5'end".
We also rename this track as Encode Cshl ShortRna K562

CaseStudy ShortRNAClustering DataSelection2.png

After clicking on "accept config" we are back to the DEX page and the upper right panel now show "1 tracks" ready to be displayed.

View creation

Clicking on "visualize", sends us to the gLyphs genome browser page with one visible track displaying the piling up of K562 cells sub-cellular and sub-nuclear short RNA fractions mapped reads. Let's go to the microRNA rich region : hg19::chr19:54134256..54310581

Novomir.2.png

Clustering the mapped short RNA reads

We wish known as well as novel short RNA species by clustering mapped reads and visualize their expression levels for each of the cells sub-cellular and sub-nuclear fractions.
We duplicate the "Encode Cshl ShortRna K562" track and edit the duplicated track by clicking on the yellow square icon.
And reconfigure the duplicated track by clicking on the grey gear icon.

Mirbased exp.2.png Mirbased exp.3.png

From the "Stream Processing script" section we select "predefined script" and search for "shortRNA".
We select the "shortRNA Paraclu Clustering" script which is said to "works from primary mapping results, merge reads based on 1bp overlap, clusters with Paraclu (cluster length < 200 and expression level > 30 reads)"

Mirbased exp.4.png Mirbased exp.5.png


Because we are interested here in the span of short RNA reads clusyer we change the select "gLyph style" to "medium arrow". In addition ticking the "color by score" will alter the color of the arrow reflecting the total number of tags in each cluster.

Novomir.3.png

http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=BEcqkfmQNxHoGp5vS-sRt;loc=hg19::chr19:54069805..54407320

fine-tuning the "ShortRNA Paraclu clustering" script default paramters

Comparing de novo clustering to known mir

We first will modify the "ShortRNA Paraclu clustering.1" to seprate the de novo clusters that are known from novel one
A text box allowing one to edit the processing of the streamed source data and have quick look at the script

The first steps instruct ZENBU to map the signal on a 1bp grid, a necessary condition for the Paraclu module to operate optimally

		<spstream module="TemplateCluster">
			<overlap_mode>area</overlap_mode>
			<expression_mode>sum</expression_mode>
			<side_stream>
				<spstream module="FeatureEmitter">
					<width>1</width>
					<fixed_grid>true</fixed_grid>
					<both_strands>true</both_strands>
				</spstream>
			</side_stream>
		</spstream>

As we will be operating on the basis of retaining clusters with at least 30 tags The second step : CalcFeatureSignificance sums up the the tag count from all sourced experiments

		<spstream module="CalcFeatureSignificance"/>

Finally we call on Paraclu instructing it to split clusters to the next most stable level if their lenght is greater than 200bp and to only retain clusters with at least 30 reads

		<spstream module="Paraclu">
			<min_cutoff>30</min_cutoff>
			<max_cluster_length>200</max_cluster_length>
		</spstream>

We wish to create 2 tracks : one with clusters overlapping known microRNA, and one with novel loci using TemplateFilter
For this purpose we instruct the scripting system to set a novel additional datasource with known mir that we couple with TemplateFilter
Obtaining the featureSourceID of UCSC_sno-miRNAgene can be easily obtain thru DEX, searching for miR and looking up the associated metadata

Novomir.11.png

Let us first filter FOR known microRNAs.
The following 2 pieces of code need to be added the the current script

  • Declaring an additional datasource with known mir
	<datastream name="mir">
		<source id="D71B7748-1450-4C62-92CB-7E913AB12899::10:::FeatureSource" name="UCSC_sno-miRNAgene_hg19_20120101"/>
	</datastream>
  • using it to filter FOR known microRNA amonst the de novo clustered shortRNA signal
		<spstream module="TemplateFilter">
			<overlap_mode>area</overlap_mode>
			<overlap_subfeatures>false</overlap_subfeatures>
			<side_stream>
				<spstream module="Proxy" name="mir"/>
			</side_stream>
		</spstream>

Novomir.10.png Novomir.12.png

Let us now also create a track filter AGAINST known mircroRNA
After dupolicating the latest track we created. we edit its script to filter against known mircroRNA with the very same approach stating that we want Template filter to act as a mask

		<spstream module="TemplateFilter">
			<overlap_mode>area</overlap_mode>
			<overlap_subfeatures>false</overlap_subfeatures>
                        <inverse>true</inverse>
			<side_stream>
				<spstream module="Proxy" name="mir"/>
			</side_stream>
		</spstream>

Novomir.13.png Novomir.14.png

http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=BEcqkfmQNxHoGp5vS-sRt;loc=hg19::chr19:54069805..54407320