Data loading

From ZENBU documentation wiki
Jump to: navigation, search

ZENBU supports several files types for uploading primary data into the system. Since ZENBU provides built in data processing capabilities, it is possible to upload data in a more raw or primary format. When data is loaded into the system it is first translated into the internal ZENBU Data Model which allows the ZENBU system to manipulate that data as genomic annotation, expression data, and descriptive metadata.



File formats

Since ZENBU can process data internally to create its visualization, it does not need to support many visualization file formats, but instead can focus on just a few data interchange file formats which are commonly used for bioinformatics analysis. The leading tools in bioinformatics these days are bedtools (BED files) and samtools (BAM/SAM files) which makes BED and BAM the most important data interchange files. It is a benefit of ZENBU that we only have a few file formats. It means that the bioinformatics pipelines that would feed into ZENBU only need to worry about a handful for already common file formats.

The file types currently supported by ZENBU upload are:

  • BAM & SAM sequence alignment files. These are the primary data files produced by sequence alignment and are the starting point for next generation sequencing (DNA/RNA) based bioinformatics. ZENBU can work directly with these files to create many different tracks. Since all information is available via BAM/SAM this is the recommended format for loading your RNA/DNA sequencing data into ZENBU.
  • BED annotation files. This is a general purpose genome annotation format which has become very commonly used by bioinformaticians for genome coordinate data interchange. ZENBU can also interpret the BED score as an expression value.
  • GFF GFF2 GTF GFF3 annotation files. This is another common genome annotation file format primarily used by Ensembl and GBrowse.
  • OSCtable. This is highly flexible tabbed-text table format which is compatible with Excel, R and any program which can parse tabbed tables. OSCTable includes controlled vocabularly for column names and metadata to allow ZENBU to automatically parse these files into the internal data model. Even most custom bioinformatics analysis table output can be wrapped with an OSCTable header which can allow it to be loaded into ZENBU.
  • Other tab delimited formats such as BED+n fields ( ENCODE BroadPeaks : BED6+3, ENCODE NarrowPeaks : BED6+4) can easily be uploaded as OSCtable with a custom header to take advantage of the those additional n fields

The ZENBU track data download system can export data in these file formats

  • BED annotation files.
  • GFF GFF2 GTF annotation files
  • OSCtable.
  • ZENBU XML. This is the native ZENBU XML data interchange format which contains the full data model content.
  • DAS XML. The XML interchange format used by the DAS system http://www.biodas.org/

Secured data uploading

ZENBU provides for data loading throught the secured user profile system.
This guarantees that the data is only available to the specific users who should have access to it.
After a user has securely logged into the ZENBU system they can upload data for either private use or for sharing with specific collaborations.

User-upload.png

Uploading of data with associated experiment/expression

UCSC genome browser or the IGV genome browser tie the data upload format to its visualization. For example in UCSC, BED files are always display as annotation and wig files are always displayed as "wiggle" tracks. With UCSC or IGV, all processing must be performed externally to the system prior to creating their visualization files.
In constrast, ZENBU offers greater flexibility : typical annotations containing files (ESTs, gene models, ...) in BED format can be used to produce wiggle tracks or heatmaps, bam files can be displayed as annotations (so as to see individual reads), etc...

Experiment expression data can be loaded via different means.

  • as BED files
    • BED file based data uploading offers the option to use the score column and assign its value to a specific expression data type by clicking the [BED.score column has expression values] option and selecting the datatype associated to those expression values.
    • If the expression is simply a count of '1' for every feature (for example, used when loading mapped reads), then one can use BED or GFF style files and simple check the [single-best-mapping expression] option.
  • as OSCtable files
    • ZENBU OSCtable parser is able to parse both tab-separated and space-separated files.
    • OSCtables provide a rich set of control vocabulary to specify multiple experiments within a single file, experiment metadata and multiple datatypes in multiple columns in the file.
    • OSCtable based data uploading allows all possible mapping of data into the internal data model. It is possible for the ZENBU OSCtable parser to have an extended vocabulary of metadata directives and column name spaces. Details can be found in the OSCtable specifications page.

Uploading of data with associated metadata

Uploading of data with associated hyperlinked metadata

  • as OSCtable file
    • oscheader column name must be `zenbu:hyperlink`
    • the column content must be of the form <a href="http://somewhere.com/blah">name to appear in panel</a>
eedb:chrom eedb:start.0base eedb:end eedb:name eedb:score eedb:strand zenbu:hyperlink FF:cluster_id FF:promter_name
chr19 50167762 50167789 p1.IRF3 8362 - <a href="http://somewhere.com/blah">name to appear in panel</a> chr19:50167762..50167789,- p1.IRF3
chr19 50168861 50168873 p2.IRF3 4721 - <a href="http://somewhere.com/blah">name to appear in panel</a> chr19:50168861..50168873,- p2.IRF3
chr19 50169013 50169062 p1.IRF3 25266 - <a href="http://somewhere.com/blah">name to appear in panel</a> chr19:50169013..50169062,- p1.IRF3
chr19 50169064 50169121 p1.IRF3 63287 - <a href="http://somewhere.com/blah">name to appear in panel</a> chr19:50169064..50169121,- p1.IRF3