BAM and SAM file support

From ZENBU documentation wiki
(Redirected from BAM)
Jump to: navigation, search

SAM/BAM files are commonly produced by sequence alignment programs like BWA and TopHat. Most next generation sequencing systems provide BAM alignments as the final output of their systems. ZENBU provides both native BAM/SAM file support, and extended column namespace of the OSCtable format to allow 'wrapping' of SAM files with an OSCtable header.

ZENBU interpretation of BAM alignments

Since BAM is an alignment file format, it is required for ZENBU to interpret these alignments into the ZENBU data model. The ZENBU data model is composed of data sources (FeatureSource, Experiment), genomic location information (Features), expression count data (Expression), and descriptive metadata. The BAM alignments are mapped into the ZENBU system as follows:

  • The genomic bounds of the alignment on the genome are mapped to the Feature genome locations.
    • Currently the BAM CIGAR is not interpreted
  • Each BAM alignment is given an expression of "1 tagcount". If sequences are aligned to a single location this is an accurate interpretation. If the alignment program aligns to multi-locations, these sequence tags can be over-represented.
  • The BAM map quality (MAPQ) is mapped into both the Feature score and into an expression value of type "mapquality"
  • A single FeatureSource and Experiment are created for each BAM file
  • The BAM header is parsed for descriptive metadata which is transferred into the FeatureSource and Experiment. The following SAM/BAM header tags are parsed into ZENBU metadata
    • ReadGroup ID => sam:id
    • ReadGroup CN (SequencingCenter) => sam:center
    • ReadGroup SM (sample) => sam:sample
    • ReadGroup LB (library) => sam:library
    • ReadGroup DS (description) => sam:description
    • ReadGroup FO (flow order) => sam:flow_order
    • ReadGroup PL (platform/technology) => sam:platform
    • ReadGroup PI (predicted median insert size) => sam:predicted_insert_size
    • ReadGroup PU (platform unit) => sam:platform_unit
    • ReadGroup KS (key sequence) => sam:key_sequence
    • ReadGroup DT (date of production) => sam:production_date
    • ReadGroup PG (processing program) => sam:program

Future extensions

In the future we hope to extend the ZENBU interpretation of the BAM alignments. These future extension will not require reloading of data since the BAM files are retained in their original form and interpretation is performed on-demand at query time.

SAM as OSCtable header

Since SAM files are tab-separated text files, it is possible to represent them with an OSCtable column header line using the ZENBU extended column namespace and load them directly into the ZENBU system.

eedb:name	eedb:sam_flag	eedb:chrom	eedb:start.1base	eedb:score	eedb:sam_cigar	sam:mrnm	sam:mpos	sam:isize	eedb:seqread	sam:qual	eedb:sam_opt

SAM specification

The official SAM/BAM specification is available here