Uploading UCSC repetitive elements track

From ZENBU documentation wiki
Jump to: navigation, search

In this case study we will illustrate how to upload into zenbu annotations from third party sources. In this case uploading the latest hg19 repeat masker track available from UCSC.
The mysql table dump from UCSC provides the name and genomic location of repetitive elements but also the class and family repeats belongs to. We will illustrate how to upload onto ZENBU either a simple BED-based version of it (solely containing the repeats name and location) or the more complete mysql table dump containing all the information also available (alignment scores, repeat classes and families,... ) which can be taken advantage of by ZENBU for manipulation and processing.

This particular example is also used as part of a more comprehensive case study focussed on extracting repetitive elements sub-cellular compartment specific expression from ENCODE K562 cell line analyzed by CAGE.

UCSC RepeatMasker (rmsk) track

Track description

The RepeatMasker (rmsk) track was created by using Arian Smit's RepeatMasker program, which screens DNA sequences for interspersed repeats and low complexity DNA sequences. The program outputs a detailed annotation of the repeats that are present in the query sequence (represented by this track), as well as a modified version of the query sequence in which all the annotated repeats have been masked.

RepeatMasker uses the Repbase Update library of repeats from the Genetic Information Research Institute (GIRI). Data are generated using the RepeatMasker -s flag. UCSC also used the Tandem Repeat Finder (trf) program, masking out repeats of period 12 or less. The repeats are just "soft" masked. Alignments may extend through repeats, but are not permitted to initiate in them.

Track content

This track contains, among others, the following classes of repeats:

  • Short interspersed nuclear elements (SINE), which include ALUs
  • Long interspersed nuclear elements (LINE)
  • Long terminal repeat elements (LTR), which include retroposons
  • DNA repeat elements (DNA)
  • Simple repeats (micro-satellites)
  • Low complexity repeats
  • Satellite repeats
  • RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA)
  • Other repeats, which includes class RC (Rolling Circle)
  • Unknown

A "?" at the end of the "Family" or "Class" (for example, DNA?) signifies that the curator was unsure of the classification. At some point in the future, either the "?" will be removed or the classification will be changed.

References and credits

Thanks to UCSC for providing the track and to Arian Smit and GIRI for providing the tools and repeat libraries used to generate it.

References

For a discussion of repeats in mammalian genomes, see:


Simple BED based upload

downloading UCSC rmsk data as BED

BED formatted UCSC track content can be obtained from UCSC table broswer.
The rmsk RepeatMasker (rmsk) track can be exported as BED file by selecting

  • the assembly "Feb.2009 GRCh37/hg19"
  • the group "repeats and variations"
  • the track "RepeatMasker"
  • and finally the table "rmsk"

As we desire the complete repetitive elements genome-wide to be loaded into ZENBU, therefore we select

  • region: "genome"

ZENBU enable gzip compressed bed files to be loaded directly, so we will further select :

  • output format: "BED - browser extensibke format"
  • output file: we will name the file "UCSC_rmsk.hg19.bed.gz"
  • file type returned: "gzip compressed"

UCSC.TableDump.rmsk.asBED.1.png UCSC.TableDump.rmsk.asBED.2.png

Finally we click "get output", which opens up a novel window offering to download "one BED record per: Whole Gene" or extend each entry by some fixed length segment.
The compressed BED file should be upoladed locally and ready to be transfered as is into ZENBU.

uploading UCSC rmsk downloaded data

In order to load annotation or expression/experiment into zenbu, we need to be logged-in as a zenbu user (as uploaded files need to have a owner).

ZENBU.login.1.png ZENBU.login.2.png

Clicking onto the "User" tab of ZENBU interface, brings us the "user profile" if we are already logged into ZENBU page or the log-in page. The "Data Upload" tab provides us with the interface for file uploading.

ZENBU.Upload.1.png ZENBU.Upload.rmsk.asBED.1.png

As we have named our file "UCSC_rmsk.hg19.bed.gz", ZENBU automatically recognized that this is a BED formatted file.
UCSC table dump provided us with a score column containing the Smith Waterman alignment score. Since we are simply interested in the location of the repetitive elements, this score will not be relevant to our use of the data as a full fledged "experiment" (in which case zenbu provides for automatic computation of per-million expression normalization, with or without multimapping correction).
We, therefore, leave both check-boxes "BED.score column has expression values" and "single-best-mapping expression" unchecked. Note that keeping the BED score as a mere score associated to each entry will still enable us to use it to use it (for example to filter repeats on the basis of its SW alignment score).
Once uploaded the "my data" section shows the uploaded BED file and offers us the possibility to share it with collaboration.

ZENBU.Upload.rmsk.asBED.2.png

Comprehensive OSC table based upload

By retrieving the data from UCSC in BED format, along with their genomic location, the sole repeat name (repName) is obtained.
Valuable information, which can be used by ZENBU for manipulation and processing, such as the class (repClass) and the family (repFamiliy) the repeats belong to is not retrieved.

In order to get all the information provided within this track into ZENBU, one can alternatively upload the data as an OSCtable which allow for more information to be associated to each repeat (feature) to be stored into ZENBU as (feature-associated) metadata.

downloading UCSC rmsk mysql table dump

Full table dump of UCSC track content can be obtained from UCSC table broswer.
The full data content can be seen by clicking the "describe table schema".

UCSC.TableDump.rmsk.content.1.png

The rmsk RepeatMasker (rmsk) track can be exported as a tab delimited file by selecting

  • the assembly "Feb.2009 GRCh37/hg19"
  • the group "repeats and variations"
  • the track "RepeatMasker"
  • and finally the table "rmsk"

As we desire the complete repetitive elements genome-wide to be loaded into ZENBU, therefore we select

  • region: "genome"

We want the complete RepeatMasker track table dump, so we will further select :

  • output format: "all filed from selected table"
  • output file: we will name the file "UCSC_rmsk.hg19.table.gz"
  • file type returned: "gzip compressed"

UCSC.TableDump.rmsk.asTableDump.1.png

This will allow us the download locally the complete data available as a tab delimited text file.

creating a custom OSCheader

In order to load the data as an OSC table we need to prepend an OSCheader to the tab delimited "UCSC_rmsk.hg19.table.gz" that we have just retrieved from UCSC table dump.
Generic wrapping of standart format (BED, GFF, ...) are described in the ZENBU interpretation of OSCtable files wrapping section.

In this case, where the UCSC table dump does not correspond to any of those generic format, a quick look at the content of the table schema (see screenshot above) tells us that the most simple OSCheader corresponding to the definition of each column in terms understood by the OSCtable parser will be :

  • genoName -> chrom
  • genoStart -> start.0base (all start coordinates in UCSC database are 0-based)
  • genoEnd -> end

If we want the repeat family to be the primary name of the feature in zenbu we then modify

  • repFAmily -> name

In addition we may which to ignore the columns "bin" and "id" which are internal to UCSC by adding to those column the prefixe "ignore."

  • bin -> ignore.bin
  • id -> ignore.id

To do so, you can then the edit file with your favorite text editor to which you will have to modify (or add -- since line starting by # will be ignored -- ) the first line.
Here is an overview of the first lines of the thus modified file.

ZENBU.Upload.rmsk.asTableDump.OSCheader edition1.png


For the UNIX savvy, this can easily be done with the following simple commands

zcat  UCSC_rmsk.hg19.table.gz | head -n 1 \
  | sed -e 's/#//' \
  | sed -e 's/bin/ignore.bin/' \
  | sed -e 's/genoName/chrom/' \
  | sed -e 's/genoStart/start.0base/' \
  | sed -e 's/genoEnd/end/' \
  | sed -e 's/repFamily/name/' \
  | sed -e 's/id/ignore.id/' \
  | gzip -c > UCSC_rmsk.hg19.table.oscheader.gz
zcat  UCSC_rmsk.hg19.table.oscheader.gz UCSC_rmsk.hg19.table.gz > UCSC_rmsk.hg19.table.osc.gz

uploading UCSC rmsk OSCtable

In order to load annotation or expression/experiment into zenbu, we need to be logged-in as a zenbu user (as uploaded files need to have a owner).

ZENBU.login.1.png ZENBU.login.2.png

Clicking onto the "User" tab of ZENBU interface, brings us the "user profile" if we are already logged into ZENBU page or the log-in page. The "Data Upload" tab provides us with the interface for file uploading.

ZENBU.Upload.1.png ZENBU.Upload.rmsk.asOSCtable.1.png

As we have named our file "UCSC_rmsk.hg19.table.osc.gz", ZENBU automatically recognized that this is a OSCtable formatted file.
This data does not contain relevent expression information, therefore, we leave both check-boxes "single-best-mapping expression" unchecked.
Once uploaded the "my data" section shows the uploaded OSCtable file and offers us the possibility to share it with collaboration.

Error creating thumbnail: File missing