Data Sources

From ZENBU documentation wiki
Jump to: navigation, search

The ZENBU system was designed on first principle to be a collaborative OMICS data integration system where primary data is dynamically uploaded by users of the system. Because of the data processing capabilities designed into ZENBU, this uploaded data is used as sources for input into data processing scripts where the result of that processing can then be downloaded or visualized in the ZENBU genome browser.

The primary data which has been uploaded into the system are collectively referred to as Data Sources. When data is loaded into the system, each data file is translated into the internal ZENBU Data Model and grouped into one or more annotation data sources and/or Expression experiment data sources depending in the uploaded data file format and upload parameter options.

Since ZENBU is based on loading data into an abstract data model concept, it is important to be able to find our data after upload since there is not always a direct one-to-one mapping of data upload file to DataSource. To accomplish this ZENBU utilized a metadata search system. When data is uploaded, user are asked to provide a name and description of the file and its data content. By providing good descriptions, it not only allows for easily finding your data at a latter time, but also makes it easier for your collaborators to understand the content and nature of your data.

Uploadable data file formats

Before data can become available as a Data Source in ZENBU, it must first be uploaded into the system through one of the supported file types. The file types currently supported by ZENBU upload are:

  • BAM & SAM sequence alignment files.
  • BED UCSC style genome annotation files
  • GFF GFF2 GTF Ensembl/gbrowse style genome annotation files
  • OSCTable open format tab separated tables for genome annotation and multiple experiment expression (RIKEN/ZENBU format).

Annotation Data Sources

An annotation data source [also called a FeatureSource] is a collection of genomic features. This corresponds to a data set like "gene sets" or "promoter sets" or "micro array probe set" or "repeat elements". In the UCSC genome browser this is what they refer to as a UCSC track (i.e. data track) which was loaded from a BED file. When uploading data into ZENBU, the data in each file is mapped into a one or more FeatureSources.

BED is the simplest file format where each BED file maps to a single FeatureSource. There is an option when uploading a BED file to extract expression from it and optionally create an Experiment for the the file. The options include mapping the score onto expression or to simply count each location with an expression value of 1.

BAM/SAM files since they always contain sequence alignments are always mapped into a single FeatureSource and single Experiment.

GFF/GTF files can be mapped into one or more FeatureSources depending on the content of the 3rd column of the file. This 3rd column encapsulates the GFF feature concept which is the same as the ZENBU FeatureSource concept. Every different GFF/GTF feature type is mapped into a different ZENBU FeatureSource.

OSCTable files are also mapped into one or more FeatureSources depending on how they are configured. The OSCTable format allows for complete flexibility to control how the data is mapped into FeatureSources upon loading into ZENBU.

Experiment Data source

A signal-based experiment data source [also simply called a Experiment] is a collection of signal data which can things like RNA expression signal, pvalues from calculations, or ChIP-seq signal. By the definition of the ZENBU data model, each signal data element is attached to a genomic feature which is also part of a FeatureSource. Since a genomic Feature can have many signal data points attached, the Experiment is critical to describing the signal.

Expression refers to a single measurement data element with an associated signal-based DataType (eg: "tagcount", "tpm", "mapquality" "score" "pvalue" "rle" to name a few). Depending on the data file used, there can be none to many Experiments associated to each uploaded data file. Although the data model calls this "expression" it can represent any type of numerical signal.

BED is primary used only for genomic annotations, but ZENBU allows for optional expression experiments to be defined. If enabled at upload time, an Experiment can be created and associated with the data file. There are options to interpret the BED score as expression or to simply count each bed line location with an expression value of 1.

GFF/GTF in the same way as BED files, ZENBU allows the score column to be mapped onto Expression and to create an optional Experiment associated with each GFF/GTF file.

BAM/SAM files since they always contain sequence alignments are always mapped into a single FeatureSource and single Experiment.

OSCTable format allows for complete flexibility to control how the data is mapped into Experiments and Expression. There can be many Experiments defined within a single OSCTable file.

Types of data which can be loaded

ZENBU was designed with a data model abstraction which allows all types of genomic annotation and expression data to be uploaded into the ZENBU system. Here are examples of the different types of experimental and analysis results data which can be loaded by users into ZENBU using the available upload file types.

Genome mapped RNA/DNA sample sequences

This class of data includes RNAseq, shortRNA, CAGE, CHiP-seq. The nature of this data is a sample of DNA or RNA which is processed by a molecular biology protocol and then sequenced. It is now very common to use next generation sequencing instruments like Illumina HiSeq2000, Illumina G3, SOLiD, 454 or Heliscope for this sequencing. This class of instrument produces millions to 100s of millions of short sequences often referred to as sequence-tags. Because of the short nature of these tags, often the best way to analyze them is to first map them onto a reference genome assembly with a program like BWA or TopHat.

The ZENBU system can directly load these genome aligned sequences with out need for additional processing. A common format for these alignments is BAM.

Genome annotations

This class of data is often the result of a bioinformatics analysis pipeline or through manual curation efforts. Common data file formats for genome annotations include BED and GFF/GTF. Since the nature of this data is descriptive, it is sometimes very useful to include descriptive metadata along with genomic location information. The OSCTable format is a highly flexible format which allows for attaching very complex metadata, expression, and numerical values onto genomic annotation features.

Microarray expression

The ZENBU system provides a means to load micro-array data. Once microarray data is loaded, it can be processed and visualized as either "expression tracks" or as "annotated-expression hybrid tracks".

Currently the loading of micro-array data is a little complicated, but we hope in the future to make this process easier for users. ZENBU currently has several micro-array probe-sets from Illumina and Affymetrix mapped onto the genome. To load micro-array expression, one needs to download one of the probe-sets into OSCTable format and then to extend the columns of the file to add the raw/normalized microarray expression for each probe. Then upload the new modified OSCTable file back into ZENBU.

Annotated Expression analysis results

The ZENBU system is able to work with very complex analysis results which can often consist of genomic locations, descriptive metadata, expression signal from multiple samples, and numerical analysis results. The OSCTable format allows any complex table of data to be mapped with column names and loaded into ZENBU. Since each analysis process/result is often unique, the flexibility of OSCTable allows the data to be loaded in its original form, rather than having to convert it into a standard file format and thereby have to throw away some information.

After loading complex analysis data from OSCTable files, ZENBU can process and manipulate any and all aspects of the dataset. It can be treated as simply as genomic annotation, or in complex ways with data processing and hybrid annotated-expression visualization tracks

Novel genomes

The ZENBU system was designed in the era where many novel genomes are being generated, so we made sure the system could expand into the range of 1000s of genomes. In ZENBU, genomes are treated as a "namespace" which means that the process of creating a new genome is as easy as naming it. For example if a data file names genome "human-bobsmith" chromosome "chr3" it will create it if it does not exist. The reference genome sequence and size of the chromsome are treated as additional data loaded into that "genome namespace". Once uploaded the ZENBU system will become aware of the new genome name.

Although currently not available through the web interfaces, novel genome sequences can be loaded by system administrators through ZENBU command line tools. In a near future upgrade this functionality will be available via the web interface for user upload .

Virtual DataSources - Track Data pooling

Although the DataSources are defined at data load time, the ZENBU system provides a dynamic flexible data mixing technology called Data Pooling . This allows for virtual DataSources to be created when configuring ZENBU tracks by mix-and-match from the uploaded physical DataSources. This provides for a great deal of flexibility both in terms of data loading and data processing without requiring external data processing and additional data loading.