Data Stream Pool

From ZENBU documentation wiki
Jump to: navigation, search

The ZENBU system allows for the dynamic creation of merged virtual Data Sources referred to as "data stream pools". This provides for a great deal of flexibility both in terms of data loading and data processing. With data pooling, there is no need to load new data every time a different "mix" is needed when configuring ZENBU tracks. One can simply use the data already loaded in the ZENBU system and create a new virtual DataSource mix.

Data pooling can be on a mix of either annotation FeatureSources or a mix of expression Experiment data sources. In both cases the mixed pool is a union or merging of the data.

The main advantage of data pooling is for data processing and analysis. It becomes possible to pool many experiments from many samples or across replicates for differential expression visualization and analysis. And with the data download capabilities these processed data pools can be exported into statistical systems like R and BioConductor for more advanced analysis. It also is easy to created merged annotation datasets without requiring a special upload. For example it is possible to create a merged data pool of all gene models (gencode, refseq, ensembl, ucsc known gene) in a single track.

But data pooling also can help with the data organization and the data loading process. Since ZENBU can merge data on demand, it becomes possible to organize data prior to upload at a more atomic level. For example we can keep each sequencing replicate of a sample in separate BAM files and allow ZENBU to created the virtual mix of all data from the same sample. This provides a high level of flexibility for being able to load the data files as they exist, rather than requiring pre-processing of the data prior to loading. This also gives the user great flexibility in creating new groupings of data after the data has been loaded even if the grouping was not in the original experimental design.


To better illustrate the concept of data pooling, we present several examples.

Pooling annotation FeatureSources - different mixes of repeat sets

There are many different classes of repeats in the genome. Sometimes we need to work with specific repeat class, sometimes we want to work with a broader class of repeats and sometimes we don't care about the class and are only concerned if any repeat is present. The data pool works very well here.

For example we have loaded the mouse mm9 repeatmasker data from UCSC where each class of repeat is mapped into a different annotation FeatureSource;asm=mm9;search=repeat

In this example we have created three different tracks with different data pooling of the repeat annotation FeatureSources.

  1. repeat track is mix of all RNA-based repeats (RNA, rRNA, scRNA, snRNA, srpRNA, tRNA)
  2. repeat track is mix of only LINE or SINE repeats
  3. repeat track is mix of all 16 different repeat classes


link to genome browser view referenced above;loc=mm9::chr10:20835721..20889803

Pooling expression Experiments - FANTOM3 differential promoter expression

Differential expression is one of the key aspects when studying RNA. RNA is inherently expressed at different levels in different tissues and samples. One available RNA expression technique is called CAGE which not only records the expression level of RNAs but also identifies the RNA's 5' end location on the genome which is interpreted as the "transcription start site" for the RNA. In the FANTOM3 project, there were 465 different mouse samples which were analyzed with CAGE and sequenced. In this example we create three different tracks with different mixes of these 465 sample Experiments.

In the following view we have created three different expression signal-based tracks with different data pooling of the FANTOM3 CAGE expression Experiments.

  1. expression signal-based track with virtual mix of all 465 FANTOM3 CAGE expression samples. In the view on the GSN gelsolin gene we can see two distinct CAGE expression peaks which correspond to different transcription starting sites and thus expression of different splicing isoforms of the GSN gene.
  2. expression signal-based track with only 26 blood related FANTOM3 CAGE samples. In this track we can see that the blood related samples exclusively expression the left most CAGE transcription start site. By comparing to the known annotation tracks, we can see there is a long Ensembl transcript/gene (ENSMUST00000113016gene) which aligns perfectly with this CAGE peak. It is thus easy to infer that this splicing form is the one expressed in the blood.
  3. expression signal-based track with only 28 lung related FANTOM3 CAGE samples. In this track we can see that the lung related samples exclusively express the right most CAGE transcription start site and thus the main splicing isoform of the GSN gene.


link to genome browser view referenced above;loc=mm9::chr2:35103454..35167609