Data loading

From ZENBU documentation wiki
Jump to: navigation, search

ZENBU supports several files types for uploading primary data into the system. Since ZENBU provides built in data processing capabilities, it is possible to upload data in a more raw or primary format. When data is loaded into the system it is first translated into the internal ZENBU Data Model which allows the ZENBU system to manipulate that data as genomic annotation, expression data, and descriptive metadata.



File formats

Since ZENBU can process data internally to create its visualization, it does not need to support many visualization file formats, but instead can focus on just a few data interchange file formats which are commonly used for bioinformatics analysis. The leading tools in bioinformatics these days are bedtools (BED files) and samtools (BAM/SAM files) which makes BED and BAM the most important data interchange files. It is a benefit of ZENBU that we only have a few file formats. It means that the bioinformatics pipelines that would feed into ZENBU only need to worry about a handful for already common file formats.

The file types currently supported by ZENBU upload are:

  • BAM & SAM sequence alignment files. These are the primary data files produced by sequence alignment and are the starting point for next generation sequencing (DNA/RNA) based bioinformatics. ZENBU can work directly with these files to create many different tracks. Since all information is available via BAM/SAM this is the recommended format for loading your RNA/DNA sequencing data into ZENBU.
  • BED annotation files. This is a general purpose genome annotation format which has become very commonly used by bioinformaticians for genome coordinate data interchange. ZENBU can also interpret the BED score as an expression value.
  • GFF GFF2 GTF GFF3 annotation files. This is another common genome annotation file format primarily used by Ensembl and GBrowse.
  • OSCtable. This is highly flexible tabbed-text table format which is compatible with Excel, R and any program which can parse tabbed tables. OSCTable includes controlled vocabularly for column names and metadata to allow ZENBU to automatically parse these files into the internal data model. Even most custom bioinformatics analysis table output can be wrapped with an OSCTable header which can allow it to be loaded into ZENBU.
  • Other tab delimited formats such as BED+n fields ( ENCODE BroadPeaks : BED6+3, ENCODE NarrowPeaks : BED6+4) can easily be uploaded as OSCtable with a custom header to take advantage of the those additional n fields

The ZENBU track data download system can export data in these file formats

  • BED annotation files.
  • GFF GFF2 GTF annotation files
  • OSCtable.
  • ZENBU XML. This is the native ZENBU XML data interchange format which contains the full data model content.
  • DAS XML. The XML interchange format used by the DAS system http://www.biodas.org/

Secured data uploading

ZENBU provides for data loading throught the secured user profile system.
This guarantees that the data is only available to the specific users who should have access to it.
After a user has securely logged into the ZENBU system they can upload data for either private use or for sharing with specific collaborations.

User-upload.png

Uploading of data with associated experiment/expression

UCSC genome browser or the IGV genome browser tie the data upload format to its visualization. For example in UCSC, BED files are always display as annotation and wig files are always displayed as "wiggle" tracks. With UCSC or IGV, all processing must be performed externally to the system prior to creating their visualization files.
In constrast, ZENBU offers greater flexibility : typical annotations containing files (ESTs, gene models, ...) in BED format can be used to produce wiggle tracks or heatmaps, bam files can be displayed as annotations (so as to see individual reads), etc...

Experiment expression data can be loaded via different means.

  • as BED files
    • BED file based data uploading offers the option to use the score column and assign its value to a specific expression data type by clicking the [BED.score column has expression values] option and selecting the datatype associated to those expression values.
    • If the expression is simply a count of '1' for every feature (for example, used when loading mapped reads), then one can use BED or GFF style files and simple check the [single-best-mapping expression] option.
  • as OSCtable files
    • ZENBU OSCtable parser is able to parse both tab-separated and space-separated files.
    • OSCtables provide a rich set of control vocabulary to specify multiple experiments within a single file, experiment metadata and multiple datatypes in multiple columns in the file.
    • OSCtable based data uploading allows all possible mapping of data into the internal data model. It is possible for the ZENBU OSCtable parser to have an extended vocabulary of metadata directives and column name spaces. Details can be found in the OSCtable specifications page.

Uploading of data with associated metadata

  • as BED file
    • the RGB column will be automatically stored as a metdata with key bed:itemRgb
  • as GFF GFF2 GTF
    • all GTF attributes present will be stored as key/value metadata
  • as OSCtable file
    • ZENBU OSCtable parser is able to parse both tab-separated and space-separated files.
    • OSCtables provide a rich set of control vocabulary to specify multiple experiments within a single file, experiment metadata and multiple datatypes in multiple columns in the file.
    • OSCtable based data uploading allows all possible mapping of data into the internal data model. It is possible for the ZENBU OSCtable parser to have an extended vocabulary of metadata directives and column name spaces. Details can be found in the OSCtable specifications page.

Uploading of data with associated hyperlinked metadata

  • as OSCtable file
    • similar to uploading of data with associated metadata, but with a special column named as zenbu:hyperlink
    • the column content must be of the form <a title="prefix-title" href="http://somewhere.com/blah">name to appear in panel</a> where the title="" is optional. If no title="" is specified, a generic hyperlink: will appear as the prefix before the hyperlink.
    • note that special characters like & should be escaped in the URL like shown in the example below
eedb:chrom  eedb:start.0base  eedb:end  eedb:name  eedb:score  eedb:strand  zenbu:hyperlink 
chr1  38327991  38327992  rs980  0  -  <a prefix="dbSNP" href="http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?searchType=adhoc_search%26type=rs%26rs=rs980">rs980</a>
chr1  236140736 236140737 rs982  0  -  <a prefix="dbSNP" href="http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?searchType=adhoc_search%26type=rs%26rs=rs982">rs982</a>
chr1  92148235  92148236  rs990  0  -  <a prefix="dbSNP" href="http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?searchType=adhoc_search%26type=rs%26rs=rs990">rs990</a>
chr1  160968208 160968209 rs993  0  +  <a prefix="dbSNP" href="http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?searchType=adhoc_search%26type=rs%26rs=rs993">rs993</a>

Bulk command-line upload of datafiles : zenbu_upload

In addition to the web interfaces for loading data, we provide (as of version 2.8.2) a command line tool for bulk or scripted uploading of data. The program must be compiled from the source code (https://sourceforge.net/projects/zenbu/?source=directory). After installation on your servers, the program (zenbu_upload) can be at /zenbu/bin/zenbu_upload, /usr/local/bin/zenbu_upload or /zenbu/src/ZENBU_2.x.x/c++/tools/zenbu_upload. zenbu_upload is designed to be compiled on linux computation servers to enable bulk loaded to a remote zenbu. To enable zenbu_upload, each user needs to create a directory ~/.zenbu/ and a file ~/.zenbu/id_hmac. This file should contain one line which is tab separated: the users email_address and their hmac key for zenbu server they wish to upload to. The users hmac key is visible on the users profile page on zenbu server, for example https://fantom.gsc.riken.jp/zenbu/user. If there is no hmac key visible, click the [generate random hmac key] button.

Your ~/.zenbu/id_hmac file content should look like this (with your email and your hmac key)

john.smith@gmail.com     b38046a646db4ed492b88dc18ba8a78f273febbb8d363c4f98d140b3d819665d6912cd7bfd6ba87defffa147e23fedd86d4e1e35e9ccd5216123f618b70faf2

The hmac key also allows users to synchronize their private and collaboration data among remote zenbu servers. Just copy your hmac key among all your users accounts (with the same email address) and the remote zenbu servers will synchronize the user accounts.

An example of bulk loading all the BAM files in a directory would be

 for FILE in `ls /quality_control/delivery/2710/Mapping-version21001/RDhi*/*/genome_mapped/*bam`; do echo $FILE; zenbu_upload -url "https://fantom.gsc.riken.jp/zenbu" -collab_uuid DktADfFGRUj5GxDI -assembly hg38 -platform RNAseq -file $FILE; done

To see your available collaborations

zenbu_upload -url "https://fantom.gsc.riken.jp/zenbu" -collabs

To search your previous uploads for data files

zenbu_upload -url "https://fantom.gsc.riken.jp/zenbu" -list -filter "THP1 RNAseq"

bulk uploading with control file

zenbu_upload also includes an option (-filelist) to use a control file for performing bulk uploading, for example 20,000 single cell experiment files. In addition to making the loading process more manageable, the -filelist option also performs a "duplicate file check" based on the original full path of the origin file. This prevents duplicate reloads and also allows for double checking that all the files in the batch were loaded. Sometimes dues to network glitches a bulk upload will occasionally fail to not send some files from the batch correctly (maybe 0.1%), and rerunning the same control-file will catch the missing files and only resend those for upload.

Here is an example command

 n208(1025 bin)> zenbu_upload -url "https://fantom.gsc.riken.jp/zenbu" -collab_uuid Xxxxxxxx -assembly hg38 -platform CAGE -desc "FANTOM5 remap hg38" -filelist f5-remap-hg38-load-files2

The control file has a 4 tab-separated columns format

  • col1- full path to file
  • col2- display name - will default to the filename if empty just like the command line -name
  • col3- description
  • col4- GFF-attribute style metadata [ tag1=value1;tag2=value2a,value2b;tag3=value3; ]

The minimum requirement for the control file is just column-1 the path to the file. So an easy way to make a file is with an ls with the full path to get the column-1. This can also be edited later with programs like Excel if one wants to add the display names and description columns.

[jessica@n208 LS2831]$ ls `pwd`/*bam > upload-col1

An alternate method to create the control-file is using some shell commands like like basename, awk...

[jessica@n208 LS2831]$ for bed in *bed ; do printf "$(pwd)/$bed\t$(basename $bed .bed)\tsome-long-desc $(pwd) $bed\t\n" ;done > touploadalittle

Here is an example using advanced shell scripting and a metadata file to create a control-file with rich experimental metadata for column-4

firstHalf() {
  BASEDIR=/home/me/myproject/
  for file in $(ls $BASEDIR/mybedfiles/*bed )
  do
    echo -e "$file\t$(basename $file .bed)"
  done
}

secondHalf() {
  IFS=$'\t'
  sed -e 1d \
      -e 's/"//g' \
      -e 's/,/\t/g' \
      /home/me/myproject/metadata-table.csv |
    while read sample_id Control Comment sampleQC discard timePoint
    do
      echo -n $timePoint $sample_id myProject upload1
      echo -ne "\t"
      echo -n "Control=$Control;Comment=$Comment;sampleQC=$sampleQC;"
      echo    "discard=$discard;timePoint=$timePoint"
    done
}

paste <(firstHalf) <(secondHalf) > zUpload.tsv

with some example lines from this control file

/home/me/myproject/mybedfiles/1772-144-108_A01.bed   1772-144-108_A01        day0 1 1772-144-108_A01 myProject upload1    Control=NA;Comment=NA;sampleQC=TRUE;discard=FALSE;timePoint=day0
/home/me/myproject/mybedfiles/1772-144-108_A02.bed   1772-144-108_A02        day0 1 1772-144-108_A02 myProject upload1    Control=NA;Comment=NA;smapleQC=TRUE;discard=FALSE;timePoint=day0
/home/me/myproject/mybedfiles/1772-144-108_A03.bed   1772-144-108_A03        day0 1 1772-144-108_A03 myProject upload1    Control=NA;Comment=NA;sampleQC=TRUE;discard=FALSE;timePoint=day0
/home/me/myproject/mybedfiles/1772-144-108_A04.bed   1772-144-108_A04        day0 1 1772-144-108_A04 myProject upload1    Control=NA;Comment=NA;sampleQC=FALSE;discard=TRUE;timePoint=day0

In an ideal situation the user can use scripting languages (perl,python) or R to access databases to find files with related experimental metadata to create a control-file to load which includes a rich set of metadata in column-4.