ZDX file

From ZENBU documentation wiki
Jump to: navigation, search

The ZDX (zenbu data exchange) file format was designed primarily for internal ZENBU use to provide a file-based persistence [1] layer to enable the ZENBU track caching system. ZENBU was designed primarily as a dataflow system where processed and manipulated data only existed in computer memory as it flowed through the data stream processing system. And while this works extremely well for medium-weight tracks (dozens of datasource with simple data processing), we started to see that heavy-tracks (1000s or data sources or very complex scripting with many side-streams) could be slow for users who just wanted to see the results of someone elses views. Thus we designed the TrackCaching system and the ZDX file.

Because ZENBU is designed around random-access the ZDX file needed to provide very fast random-access write capabilities. For example if a user configures a new track which the track-cache system has never seen before and looks at the data around EGR1 (hg19:: chr5 137800224-137805959) and then at RUNX1(hg19:: chr21 36094723-364869689) the TrackBuilding system would be writing the result from these two streaming queries into the ZDX and all the other locations in the genome would still be empty.

To enable this ability to randomly write data at specific genomic locations in a random-access manner, we designed the ZDX file as a binary format using the computer science principles of how filesystems are designed. In particular we based the design around the ideas of inode, directory tables and file blocks.

ZDX uses ZENBU data model

Because the ZDX file is designed for ZENBU data persistence, it uses the ZENBU data model. The ZENBU data model exists as this abstract design, as a series of c++ classes for server side systems, and as an XML representation for data transport.

Assembly.cpp	 Datatype.cpp        EdgeSource.cpp  Feature.cpp         MetadataSet.cpp    
Assembly.h	 Datatype.h          EdgeSource.h    Feature.h		 MetadataSet.h
Chrom.cpp	 Edge.cpp	     Experiment.cpp  FeatureSource.cpp	 Metadata.cpp   
Chrom.h          Edge.h              Experiment.h    FeatureSource.h     Metadata.h
DataSource.cpp   EdgeSet.cpp         Expression.cpp  	                 Symbol.cpp
DataSource.h	 EdgeSet.h           Expression.h    	                 Symbol.h*

For example here is the ZENBU XML version for a GENCODE gene

<feature id="D0040C4F-6DC4-466A-9E86-AE85037937F5::10003" name="ENST00000377139.3" start="50162829" end="50169132" strand="-">
 <chrom chr="chr19" asm="hg19"/> 
 <featuresource id="D0040C4F-6DC4-466A-9E86-AE85037937F5::1:::FeatureSource" category="bed_region" 
   name="UCSC gencode v12 comprehensive hg19" create_date="Mon Nov 26 14:26:20 2012" create_timestamp="1353907580" 
   feature_count="167536" owner_openid="https://id.mixi.jp/28555316"/> 
 <mdata type="eedb:display_name">ENST00000377139.3</mdata> 
 <mdata type="bed:itemRgb">0</mdata> 
 <subfeatures count="11"> 
   <feature ctg="3utr" start="50162829" end="50162904" strand="-" name="ENST00000377139.3_3utr" fsrc_id="D0040C4F-6DC4-466A-9E86-AE85037937F5::3:::FeatureSource"/> 
   <feature ctg="block" start="50162829" end="50163090" strand="-" name="ENST00000377139.3_block1" fsrc_id="D0040C4F-6DC4-466A-9E86-AE85037937F5::2:::FeatureSource"/> 
   <feature ctg="block" start="50163970" end="50164085" strand="-" name="ENST00000377139.3_block2" fsrc_id="D0040C4F-6DC4-466A-9E86-AE85037937F5::2:::FeatureSource"/> 
   <feature ctg="block" start="50165205" end="50165585" strand="-" name="ENST00000377139.3_block3" fsrc_id="D0040C4F-6DC4-466A-9E86-AE85037937F5::2:::FeatureSource"/> 
   <feature ctg="block" start="50165682" end="50165874" strand="-" name="ENST00000377139.3_block4" fsrc_id="D0040C4F-6DC4-466A-9E86-AE85037937F5::2:::FeatureSource"/>
   <feature ctg="block" start="50166445" end="50166515" strand="-" name="ENST00000377139.3_block5" fsrc_id="D0040C4F-6DC4-466A-9E86-AE85037937F5::2:::FeatureSource"/> 
   <feature ctg="block" start="50166600" end="50166771" strand="-" name="ENST00000377139.3_block6" fsrc_id="D0040C4F-6DC4-466A-9E86-AE85037937F5::2:::FeatureSource"/> 
   <feature ctg="block" start="50167931" end="50168103" strand="-" name="ENST00000377139.3_block7" fsrc_id="D0040C4F-6DC4-466A-9E86-AE85037937F5::2:::FeatureSource"/> 
   <feature ctg="5utr" start="50168096" end="50168103" strand="-" name="ENST00000377139.3_5utr" fsrc_id="D0040C4F-6DC4-466A-9E86-AE85037937F5::4:::FeatureSource"/> 
   <feature ctg="block" start="50168888" end="50169132" strand="-" name="ENST00000377139.3_block8" fsrc_id="D0040C4F-6DC4-466A-9E86-AE85037937F5::2:::FeatureSource"/> 
   <feature ctg="5utr" start="50168888" end="50169132" strand="-" name="ENST00000377139.3_5utr" fsrc_id="D0040C4F-6DC4-466A-9E86-AE85037937F5::4:::FeatureSource"/> 
 </subfeatures> 
</feature>

LZ4 compression of XML

Although XML appears very verbose in its uncompressed form, XML compresses extremely well (sometimes up to 20x compression ratio). For ZDX, ZENBU uses the LZ4 compression algorithm for reading and writing data into blocks of the ZDX file. http://code.google.com/p/lz4/
LZ4 is designed to have exceptionally fast compression and decompression speeds with very good compression ratios. LZ4 was designed primarily for compressing data for transport where the data is transient in nature. This fits perfectly to the needs of ZDX where we use it in a caching system where compression/decompression speeds are more important than absolutely smallest file size.

But even LZ4 compress of ZENBU XML yields excellent final file sizes and is actually similar in size to BAM files.

As a test we took a BAM alignment file from the ENCODE project wgEncodeUwDnaseMonocd14ro1746AlnRep1.bam with 33,322,702 alignments and converted it into a ZDX file. Because the ZENBU data model is flexible, it is possible to also strip the alignments down to their minimal components (alignment chrom,start,end,strand, name, score) and store that into a ZDX file.

original BAM file wgEncodeUwDnaseMonocd14ro1746AlnRep1.bam 1.3GBytes
ZDX full alignment information wgEncodeUwDnaseMonocd14ro1746AlnRep1.zdx 1.3GBytes
ZDX minimal alignment information wgEncodeUwDnaseMonocd14ro1746AlnRep1_simple.zdx 788MBytes

ZDX performance

Due to the map-reduce parallel-processing capabilities of ZENBU and the ZDX file format, ZDX files can be created very quickly.

Taking the same example ENCODE BAM file wgEncodeUwDnaseMonocd14ro1746AlnRep1.bam (33,322,702 alignments, 1.3GBytes of file space). We ran the following timing tests.
The first set of samtools commands are to show equivalent steps on the file to get a perspective of ZENBU and ZDX build performance.


command process time comments
samtols view -h wgEncodeUwDnaseMonocd14ro1746AlnRep1.bam 62.9 seconds convert BAM into SAM
samtols view -bS wgEncodeUwDnaseMonocd14ro1746AlnRep1.sam 5min 5 sec (305 seconds) convert SAM into BAM
samtools sort wgEncodeUwDnaseMonocd14ro1746AlnRep1.bam 7min 19 sec (439 seconds) sorting of BAM file required for indexing
samtools index wgEncodeUwDnaseMonocd14ro1746AlnRep1.bam 30.1 seconds build index on sorted BAM file
ZENBU create ZDX from wgEncodeUwDnaseMonocd14ro1746AlnRep1.bam 82.3 seconds 8 threads, builds ZDX sorted and index file. equivalent of samtools sort and index
ZENBU export wgEncodeUwDnaseMonocd14ro1746AlnRep1.zdx to BED 2min 39sec (159 seconds) 1 thread, read from ZDX and export to BED format

Although these tests were done on a single BAM file to allow for comparisons, ZDX is best suited to the storage of processed results due to its completely flexible data modeling ability rather than sequence alignments. In contrast BAM was designed and optimized for the storage of sequence alignments.