BAM and SAM file support
SAM/BAM files are commonly produced by sequence alignment programs like BWA and TopHat. Most next generation sequencing systems provide BAM alignments as the final output of their systems. ZENBU provides both native BAM/SAM file support, and extended column namespace of the OSCtable format to allow 'wrapping' of SAM files with an OSCtable header.
ZENBU interpretation of BAM alignments
Since BAM is an alignment file format, it is required for ZENBU to interpret these alignments into the ZENBU data model. The ZENBU data model is composed of data sources (FeatureSource, Experiment), genomic location information (Features), expression count data (Expression), and descriptive metadata. The BAM alignments are mapped into the ZENBU system as follows:
- The genomic bounds of the alignment on the genome are mapped to the Feature genome locations.
- Currently the BAM CIGAR is not interpreted
- Each BAM alignment is given an expression of "1 tagcount". If sequences are aligned to a single location this is an accurate interpretation. If the alignment program aligns to multi-locations, these sequence tags can be over-represented.
- The BAM map quality (MAPQ) is mapped into both the Feature score and into an expression value of type "mapquality"
- A single FeatureSource and Experiment are created for each BAM file
- The BAM header is parsed for descriptive metadata which is transferred into the FeatureSource and Experiment. The following SAM/BAM header tags are parsed into ZENBU metadata
- ReadGroup ID => sam:id
- ReadGroup CN (SequencingCenter) => sam:center
- ReadGroup SM (sample) => sam:sample
- ReadGroup LB (library) => sam:library
- ReadGroup DS (description) => sam:description
- ReadGroup FO (flow order) => sam:flow_order
- ReadGroup PL (platform/technology) => sam:platform
- ReadGroup PI (predicted median insert size) => sam:predicted_insert_size
- ReadGroup PU (platform unit) => sam:platform_unit
- ReadGroup KS (key sequence) => sam:key_sequence
- ReadGroup DT (date of production) => sam:production_date
- ReadGroup PG (processing program) => sam:program
In the future we hope to extend the ZENBU interpretation of the BAM alignments. These future extension will not require reloading of data since the BAM files are retained in their original form and interpretation is performed on-demand at query time.
SAM as OSCtable header
Since SAM files are tab-separated text files, it is possible to represent them with an OSCtable column header line using the ZENBU extended column namespace and load them directly into the ZENBU system.
eedb:name eedb:sam_flag eedb:chrom eedb:start.1base eedb:score eedb:sam_cigar sam:mrnm sam:mpos sam:isize eedb:seqread sam:qual eedb:sam_opt
The official SAM/BAM specification is available here http://samtools.sourceforge.net/SAM1.pdf