Paraclu finds clusters in data attached to sequences. It was first applied to transcription start counts in genome sequences (see citation below), but it can be applied to any genomic signal (ShortRNA, CAGE, RNAseq, ChipSeq)
Paraclu is intended to explore the data, imposing minimal prior assumptions, and letting the data speak for itself. One consequence of this is that paraclu can find clusters within clusters. Real data sometimes exhibits clustering at multiple scales: there may be large, rarefied clusters; and within each large cluster there may be several small, dense clusters.
The ZENBU implementation reproduces the hierarchical clustering of the original paraclu, plus the paraclu-cut.sh filter/selection process. The main difference is that clusters above the max_cluster_length are never reported. In addition the ZENBU implementation offers two additional selection/cut modes for picking a set of non-overlapping clusters out of the hierarchy.
Because the Paraclu algorthim was designed for a genome 1bp resolution signal strength input, it is important to follow the script example below where a combination 1bp-wide FeatureEmitter/TemplateCluster is prepended before Paraclu.
- <min_cutoff> : clusters must have more than min_cutoff signal in order to be selected. If not, Paraclu will select a larger cluster higher in the hierarchy which does have sufficient signal. Regions which are greater than max_cluster_size and less then min_cutoff are discarded as background noise and not clustered.
- <max_cluster_length> : clusters longer than max_cluster_length are not outputed. Thus cluster regions greater than max_cluster_length are always sub-divided. Since ZENBU uses streaming buffers to implement paraclu, increasing the max_cluster_size also effects the memory usage and performance of the algorithm. ZENBU buffers at least 8x max_cluster_size to ensure sufficient hierarchy above the output clusters to ensure correctness of the results.
- <stability> : Paraclu is based on density of signal. Stability is the ratio of the density of a child-cluster relative to its most-dense-parent. Only children more dense than their parents are considered as stable clusters. The stability parameters is only used in modes stabilty_cut and small_stable and has a different effect in each mode. stability is always >= 1.0.
- <mode> : defines the "selection" mode of which layer of the full hierarchy of clusters to cut at.
- full_hierarchy : will return all nested clusters in hierarchy above min_cutoff signal and below max_cluster_length. Ignores the "stability" parameter.
- stability_cut : the orginal paraclu-cut selection method based on walking down the hierarchy. Picks the largest stable cluster in the hierarchy above min_cutoff signal and below max_cluster_length and with a child/parent density ratio greater than stability. Increasing the stability parameter above 1.0 will cause less stable clusters to be filtered out of the hierachy.
- most_stable : a zenbu variation of paraclu-cut. Uses the same full hierarchy, but selects the most stable child within each branch of the hierarchy tree. Ignores the "stability" parameter.
- small_stable : a zenbu variation of paraclu-cut which chooses the smallest stable cluster in the hierarchy (walking up from the bottom) which is above min_cutoff signal, below max_cluster_length and with a child/parent density ratio greater than stability. Lowering the stability parameter will choose smaller clusters (deeper children) in the hierarchy. Setting a large min_stability will not cause clusters to be filtered (unlike the original paraclu-cut and mode stability_cut), but instead will push the selection toward the largest cluster in the full_hierarchy.
Paraclu shortRNA - putative novel miRNA
Example showing how Paraclu can be used with shortRNA RNAseq alignment data to identify potentially novel microRNA clusters.
This is a complex script which incorporates a FeatureEmitter / TemplateCluster expression histogram binning with de-novo clustering via Paraclu followed by several filtering steps including NeighborCutoff, CutoffFilter, and a final FeatureLengthFilter to remove very tiny clusters.
<zenbu_script> <stream_processing> <spstream module="TemplateCluster"> <overlap_mode>area</overlap_mode> <expression_mode>sum</expression_mode> <side_stream> <spstream module="FeatureEmitter"> <width>1</width> <fixed_grid>true</fixed_grid> <both_strands>true</both_strands> </spstream> </side_stream> </spstream> <spstream module="Paraclu"> <mode>stability_cut</mode> <min_cutoff>10</min_cutoff> <stability>1</stability> <max_cluster_length>100</max_cluster_length> </spstream> <spstream module="CalcFeatureSignificance"/> <spstream module="NeighborCutoff"> <ratio>300</ratio> <distance>100</distance> </spstream> <spstream module="CutoffFilter"> <min_cutoff>100</min_cutoff> </spstream> <spstream module="FeatureLengthFilter"> <max_length>50</max_length> </spstream> </stream_processing> </zenbu_script>
Example ZENBU view showing this script in use with shortRNA RNAseq alignment data showing potentially novel microRNA clusters. This example shows ParaClu clustering followed by different levels of post-filtering in different tracks.
Paraclu for ChiPSeq peak calling
In this example we use Paraclu as a simple peak-calling algorithm for ChiPSeq data.
<zenbu_script> <track_defaults source_outmode="full_feature" backcolor="" scorecolor="chakra" hidezeroexps="false" glyphStyle="thick-arrow"/> <stream_processing> <spstream module="TemplateCluster"> <overlap_mode>area</overlap_mode> <ignore_strand>true</ignore_strand> <expression_mode>sum</expression_mode> <side_stream> <spstream module="FeatureEmitter"> <width>1</width> <fixed_grid>true</fixed_grid> <both_strands>false</both_strands> </spstream> </side_stream> </spstream> <spstream module="CalcFeatureSignificance"/> <spstream module="Paraclu"> <min_cutoff>50</min_cutoff> <stability>1.15</stability> <max_cluster_length>500</max_cluster_length> <mode>small_stable</mode> </spstream> </stream_processing> </zenbu_script>
Example ZENBU view showing this script in use with ChipSeq alignment data showing peak calling capabilities of Paraclu. This view also demonstrates the different selection modes and the effect of different parameters on the clustering/peak-calling.
A code for transcription initiation in mammalian genomes, MC Frith, E Valen, A Krogh, Y Hayashizaki, P Carninci, A Sandelin, Genome Research 2008 18(1):1-12