Installation guide: loading data

From ZENBU documentation wiki
Jump to: navigation, search

Most data load can be performed through the web interface “upload” system. But there are several command-line projects for loading data.

Loading new genomes

In the current version of ZENBU (2.9.1), new genomes must be loaded into mysql databases use a command-line perl script. In future versions, we will be adding genome creation/loading into the upload system. First create a new mysql database to hold the new genome sequence. Genome sequences are often very large (Human is 3billion bases and thus requires 3GB for the mysql database). Please be aware of this.

From inside the mysql server

CREATE DATABASE zenbu_susScr3_pig;
GRANT SELECT, CREATE TEMPORARY TABLES, LOCK TABLES on zenbu_susScr3_pig.* to 'read'@"%";
GRANT SELECT, CREATE TEMPORARY TABLES, LOCK TABLES, INSERT, UPDATE, CREATE, ALTER,  DELETE, INDEX on zenbu_susScr3_pig.* to 'zenbu_admin'@"%";

From the command-line

cmdline> mysql -hmysql_hostname –uzenbu_admin –pzenbu_admin –P3306 zenbu_susScr3_pig </zenbu/src/ZENBU_2.9.1/sql/schema.sql
cmdline> /zenbu/bin/zenbu_register_peer -url "mysql://zenbu_admin:zenbu_admin@mysql_hostname:3306/zenbu_susScr3_pig" -newpeer

From inside the mysql server

 use zenbu_susScr3_pig;
 INSERT INTO `assembly` (`assembly_id`, `taxon_id`, `ncbi_version`, `ucsc_name`,`osc_name`, `release_date`, `taxon_name`, `sequence_loaded`) 
      VALUES (1,9823,'Sscrofa10.2','susScr3','susScr3','2011-09-07','Sus scrofa','y');
 INSERT INTO `taxon` (`taxon_id`, `genus`, `species`, `sub_species`, `common_name`,`classification`) 
      VALUES (9823,'Sus','scrofa',NULL,'pig', 'cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Laurasiatheria; Cetartiodactyla; Suina; Suidae; Sus');

The taxon information can be found at the NCBI Taxonomy Browser http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9825

After the database has been created, you can now load the genome sequence. For the above example, Pig genome Sscrofa10.2 can be found at NCBI at this location http://www.ncbi.nlm.nih.gov/assembly/GCF_000003025.5

and the actual sequence fasta file are here (by clicking the “GenBank FTP site” link) ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Sus_scrofa/Sscrofa10.2/Primary_Assembly/assembled_chromosomes/FASTA/

download the chr-­‐-­‐.fa.gz files into a local directory on your server. For example into a directory /zenbu/genomes/susScr3_pig

From the command line, run the eedb_chromChunkTool.pl script. This may be copied into /zenbu/bin or it may be in the source directory

 cmdline> /zenbu/src/ZENBU_2.9.1/scripts/eedb_chromChunkTool.pl -url "mysql://zenbu_admin:zenbu_admin@mysql_hostname.gsc.riken.jp:3306/zenbu_susScr3_pig" -assembly susScr3 -seqdir /zenbu/genomes/susScr3_pig/ -withseq -create -store

And finally modify the ZENBU web-service configuration XML file. Add the new genome into the section

<zenbu_server_config>
  . . . . . snip . . .
  <federation_seeds>
     <seed> mysql://read:read@mysql_hostname.gsc.riken.jp:3306/zenbu_susScr3_pig/</seed>
     <seed>zenbu://fantom.gsc.riken.jp/zenbu/</seed>
  </federation_seeds>
  . . . . . snip . . .
</zenbu_server_config>

make sure that the remote-connection to the RIKEN ZENBU server is at the bottom of the federation seeds list. This will ensure that the local databases are used before doing remote searches back to RIKEN.

Bulk command-line upload of datafiles

In addition to the web interfaces for loading data, we provide (as of version 2.8.2) a command line tool for bulk or scripted uploading of data. The program must be compiled from the source code (https://sourceforge.net/projects/zenbu/?source=directory). After installation on your servers, the program (zenbu_upload) can be at /zenbu/bin/zenbu_upload, /usr/local/bin/zenbu_upload or /zenbu/src/ZENBU_2.x.x/c++/tools/. zenbu_upload is designed to be compiled on linux computation servers to enable bulk loaded to a remote zenbu. To enable zenbu_upload, each user needs to create a directory ~/.zenbu/ and a file ~/.zenbu/id_hmac. This file should contain one line which is tab separated: the users email_address and their hmac key for zenbu server they wish to upload to. The users hmac key is visible on the users profile page on zenbu server, for example https://fantom.gsc.riken.jp/zenbu/user. If there is no hmac key visible, click the [generate random hmac key] button.

Your ~/.zenbu/id_hmac file content should look like this (with your email and your hmac key)

john.smith@gmail.com     b38046a646db4ed492b88dc18ba8a78f273febbb8d363c4f98d140b3d819665d6912cd7bfd6ba87defffa147e23fedd86d4e1e35e9ccd5216123f618b70faf2

The hmac key also allows users to synchronize their private and collaboration data among remote zenbu servers. Just copy your hmac key among all your users accounts (with the same email address) and the remote zenbu servers will synchronize the user accounts.

An example of bulk loading all the BAM files in a directory would be

 for FILE in `ls /quality_control/delivery/2710/Mapping-version21001/RDhi*/*/genome_mapped/*bam`; do echo $FILE; zenbu_upload -url "https://fantom.gsc.riken.jp/zenbu" -collab_uuid DktADfFGRUj5GxDI -assembly hg38 -platform RNAseq -file $FILE; done

To see your available collaborations

zenbu_upload -url "https://fantom.gsc.riken.jp/zenbu" -collabs

To search your previous uploads for data files

zenbu_upload -url "https://fantom.gsc.riken.jp/zenbu" -list -filter "THP1 RNAseq"