Repeat associated Transcription Start Sites
Case study focused on extracting repetitive elements sub-cellular compartment specific expression from ENCODE K562 cell line analyzed by CAGE.
- 1 Finding the relevant Expression and Annotation data in DEX
- 2 Extracting Transcriptional Start Sites (TSS) originating from repeats
- 3 TSS clustering (TC)
- 4 Extracting Transcriptional Start Site Cluster (TC) overlapping from reapeats
- 5 Downloading the results for local post processing
Finding the relevant Expression and Annotation data in DEX
Sub-cellular compartment specific expression from ENCODE K562 cell line analyzed by CAGE
In order to find all the CAGE experiments related to K562 cell line in encode, we simply go to the "DEX - expression/experiment" tab and type "K562" in the upper right search box. The lower drop down menu allows us the further inspect all the "experiment platforms" under which K562 related experiments have been stored and further filter the results of the "K562" query for a specific experimental platform. Since we are interesting in exploring the relation between transposable elemant and transctiption initiation, we further limit the retrieved results to "CAGE" as a platform.
Stored in ZENBU are mapped CAGE tags, the 5' extremity of which represent by design an event of transcription initiation (for a review on the CAGE technology, we refer the reader to "Cap-Analysis Gene Expression (CAGE) - the Science of Decoding Genes Transcription by Piero Carninci" and the references herein).
ZENBU offers the possibility to display this data as histogram of the 5'ends of CAGE tags by selecting ;
- glyph style "express" (simple wiggle-plot like histogram binning)
- region : "5'end" (only the very first 5'ending base will be graphed)
- binning : "sum" which will sum up overlapping 5'ends from tags origitating from all the samples selected
In addition we will display values as raw tagcount. Note that we could have also selected tagcount_pm which normalize the expression for each libraries by dividing the expression seen at a given TSS by the total expression level across the genome.
Repetitive elements annotation data
Similarly we can search annoation in DEX which would correspond the repetitive elements. The most common source for such data is the outcome of the genome analysis of repeats using sofware such as RepeatMasker and repeat definition such as RepBase.
If looking up in DEX for the latest RepeatMasker based mapping of repetitive elements, does not retrieve any thing, then we can upload this data ourselves. The simplest source of such mapped data is UCSC.
downloading UCSC rmsk mysql table dump
Full table dump of UCSC track content can be obtained from UCSC table broswer.
The full data content can be seen by clicking the "describe table schema".
The rmsk RepeatMasker (rmsk) track can be exported as a tab delimited file by selecting
- the assembly "Feb.2009 GRCh37/hg19"
- the group "repeats and variations"
- the track "RepeatMasker"
- and finally the table "rmsk"
As we desire the complete repetitive elements genome-wide to be loaded into ZENBU, therefore we select
- region: "genome"
We want the complete RepeatMasker track table dump, so we will further select :
- output format: "all filed from selected table"
- output file: we will name the file "UCSC_rmsk.hg19.table.gz"
- file type returned: "gzip compressed"
This will allow us the download locally the complete data available as a tab delimited text file.
creating a custom OSCheader
In order to load the data as an OSC table we need to prepend an OSCheader to the tab delimited "UCSC_rmsk.hg19.table.gz" that we have just retrieved from UCSC table dump.
Generic wrapping of standart format (BED, GFF, ...) are described in the ZENBU interpretation of OSCtable files wrapping section.
In this case, where the UCSC table dump does not correspond to any of those generic format, a quick look at the content of the table schema (see screenshot above) tells us that the most simple OSCheader corresponding to the definition of each column in terms understood by the OSCtable parser will be :
- genoName -> chrom
- genoStart -> start.0base (all start coordinates in UCSC database are 0-based)
- genoEnd -> end
If we want the repeat family to be the primary name of the feature in zenbu we then modify
- repFAmily -> name
In addition we may which to ignore the columns "bin" and "id" which are internal to UCSC by adding to those column the prefixe "ignore."
- bin -> ignore.bin
- id -> ignore.id
To do so, you can then the edit file with your favorite text editor to which you will have to modify (or add -- since line starting by # will be ignored -- ) the first line.
Here is an overview of the first lines of the thus modified file.
For the UNIX savvy, this can easily be done with the following simple commands
zcat UCSC_rmsk.hg19.table.gz | head -n 1 \ | sed -e 's/#//' \ | sed -e 's/bin/ignore.bin/' \ | sed -e 's/genoName/chrom/' \ | sed -e 's/genoStart/start.0base/' \ | sed -e 's/genoEnd/end/' \ | sed -e 's/repFamily/name/' \ | sed -e 's/id/ignore.id/' \ | gzip -c > UCSC_rmsk.hg19.table.oscheader.gz zcat UCSC_rmsk.hg19.table.oscheader.gz UCSC_rmsk.hg19.table.gz > UCSC_rmsk.hg19.table.osc.gz
uploading UCSC rmsk OSCtable
In order to load annotation or expression/experiment into zenbu, we need to be logged-in as a zenbu user (as uploaded files need to have a owner).
Clicking onto the "User" tab of ZENBU interface, brings us the "user profile" if we are already logged into ZENBU page or the log-in page. The "Data Upload" tab provides us with the interface for file uploading.
As we have named our file "UCSC_rmsk.hg19.table.osc.gz", ZENBU automatically recognized that this is a OSCtable formatted file.
This data does not contain relevent expression information, therefore, we leave both check-boxes "single-best-mapping expression" unchecked.
Once uploaded the "my data" section shows the uploaded OSCtable file and offers us the possibility to share it with collaboration.
Extracting Transcriptional Start Sites (TSS) originating from repeats
First overview of the CAGE signal
Stored in ZENBU are mapped CAGE tags, the 5' extremity of which represent by design an event of transcription initiation. The data is rendered in this track as a simple wiggle-plot like histogram binning with summed up overlapping 5'ends from tags origitating from all the samples selected middle top part of the figure below.
In addition the bottom glyph tab provides the summary of the expression level over the displayed region split up by samples. Selecting a given area will trigger the display of such sample-wide expression summary under the selcted area.
Adding Entrez gene as convenient landmarks
It is often convenient to display the location of genes along the observed signal, in order to 1) get an idea of scale 2) in the case of CAGE where the signal corresponds to transcription initiation events and thus is expected to co-localize with the start position of known genes and transcripts. 3) search for a given gene name using the search box located on top of the tracks view.
The bottom glyph panel allows for adding tracks either by searching for already stored tracks or by building one similarly to what the DEX interface allows by within the context of the glyph tab interface. We choose here to exemplify the former, cliking on "add predifined track" open a search box that allow us to find any predifined track of interest by looking their name or associated metadata. Exemple of the latter can be seen in the section adding Repeatmasker annotations
Keeping track of the work in progress / saving views
ZENBU allows researcher to save the work view for personal reference but also to share observation with collaborators, by clicking on the "save configuration".
Such saved configuration can be given a name and description, the content of which being parsed to offer easy retrieval from the DEX or glyph interfaces. The URL of the config also can be used to share the data, the config id being a unique string identifying the track content and processing ; http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=lcptFGiOAFLhJKEmM_JCBD;loc=hg19::chr19:50161252..50170707
A first approach : TemplateCluster-based strategy
As a first approach we will filter and report TSS located within repeat elements boundaries and export the results grouped by individual repeated elements. This latter can be used to produce easily summaries of the transcription initiation events arising from transposable elements grouped by their family or class.In the section Extracting TSS Cluster overlapping reapeats we will details an alternative stragey that will allow for counting tags that albeit initiated within repeats strictly can be viewed as belonging to the same transcription initiation events
Adding Repeatmasker annotations
In the previous section we have uploaded in ZENBU the Repeatmasker track retrieved from UCSC. We wish here to add this track, which can be done by using the "configure new track".
We simply search for an annotation containing the word "repeat" in its name or associated metadata and select among the matching data sources, that of interest. Note that we could also have further narrow the search by adding additional keywords, or restricted the seach to data that we (as a registered user) have upload.
We decide to display the repats as thick arrow, which will allow us to view their name along with their orientation (visually encode by the default color -- green on the plus strand, purple on the minus strand-- and the direction in which the arrowhead points)
Notice how the name of the configuration has now been prepended with the word "temporary", signaling us that the latest saved configuration and the current ones are not similar (also see the new config id). At any pointy in time, users can browse the cintent of this config using the newly provided URL, but this configuration is not yet stored in DEX in a way that wouold allow for it retrieval using its name or keywrods extracted from its description or provided metadata.
Obtaining TSS colocated within repeats
In the next spte we wish to soloely display and retrieve TSS initiated within repeat elements. To do so, we will use the dynamic processing capabililtes of ZENBU.
- First, the signal we wich to analyse is the transcription initiation as measured by CAGE in the various samples and conditions we have selected. Therefore we simply duplicate the CAGE tyrack and we will operate the processing on this duplicated track thus keeping a look on the original repeat and TSS wiggle tracks. Duplication is done simply by clicking of the appropriate orange square icon located on the upper right corner of every track.
- Second, we will edit the duplicated track by cliking on the little grey gear icon located also on the upper right corner of every track. This opens up an edit panel
- Third, we proceed to create a custom script that will call on TemplateCluster to merge CAGE signal and repeat elements.
we select custom script from the edit drop-down menu and enter the following script :
<zenbu_script> <parameters> <source_outmode>skip_metadata</source_outmode> <skip_default_expression_binning>true</skip_default_expression_binning> </parameters> <datastream name="rmsk"> <source id="44C2532A-2922-4AAD-9397-0D855B943256::1:::FeatureSource" category="" name="RepeatMasker (UCSC rmsk complete table dump)"/> </datastream> <stream_stack> <spstream module="TemplateCluster"> <overlap_mode>5end</overlap_mode> <expression_mode>sum</expression_mode> <overlap_subfeatures>true</overlap_subfeatures> <side_stream> <spstream module="Proxy" name="rmsk"/> </side_stream> </spstream> </stream_stack> </zenbu_script>
Details on the content and organization of processing script are thoroughly details in the DataProcessing wiki pages.
In this script, the key points that deserve particular attention for this case study are the lines :
datastream / side_stream
<datastream name="rmsk"><source id="..."/></datastream> <side_stream> <spstream module="Proxy" name="rmsk"/> </side_stream>
This specifies the repeat data source that we have loaded previously as being the Proxy used as a template to consolidate the input signal (here the previously duplicated CAGE TSS track we have just duplicated) The source id (UUID) of a particular annotation set can be obtained by looking up the metadata associated with the source in DEX.
TemplateCluster module / side_stream
<spstream module="TemplateCluster"> <overlap_mode>5end</overlap_mode> <expression_mode>sum</expression_mode> <overlap_subfeatures>true</overlap_subfeatures> <side_stream> <spstream module="Proxy" name="rmsk"/> </side_stream> </spstream>
This specifies how the signal will be processed (overlap is between the 5'end extremity of the main source stream (CAGE signal), expression needs to be summed, etc)
- In addition, we will set the output to be formtted as """thick arrow""" and color then according to their "expression level"
- For conveniency purpose and to be able to reuse it in other context, we can save the script (in this example we will save it in our private folder, but it can be saved in the public folder giving the whole ZENBU user community the ability to use it (provided they have been granted access to the datastream).
- We, thus, obtain the following novel track where :
- only repeated elements overlapping with CAGE signal are reported
- note how the longest Alu elements in the center of the view which do not have any CAGE signal is not reported in the templateCluster output
- the color of the repeat element represent the number of TSS colocalized
- clicking on an ididual repeated element allows us to quickly visualize the repartition of the expression across samples
- only repeated elements overlapping with CAGE signal are reported
- Finally, we can download the entire track as oscdata for further processing with your favorite software (R, excel, etc...)