Introduction
ParaBWT is a new and practical parallelized Burrows-Wheeler transform (BWT) and suffix array construction algorithm for big genome data, which has a linear space complexity with a small constant factor. The performance of ParaBWT has been evaluated using two sequences generated from two human genome assemblies: the Ensembl Homo sapiens assembly and the human reference genome, on a workstation with two Intel Xeon X5650 hex-core CPUs and 96 GB RAM, running the Ubuntu 12.04 LTS operating system. Our performance comparison to FMDindex and Bwt-disk reveals that on 12 CPU cores, ParaBWT runs up to 2.2 times faster than FMD-index, reducing the runtime from 26.56 hours to 12.34 hours for a sequence of about 60 billion nucleotides, and up to 99.0 times faster than Bwt-disk.
Downloads
- latest release (v1.0.8)
Only executable binaries are available for download at present. More details about the changes in this version are available at ChangeLog.
Citation
- Yongchao Liu, Thomas Hankeln and Bertil Schmidt: "Parallel and space-efficient construction of Burrows-Wheeler transform and suffix array for big genome data". IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2016, 13(3): 592-598
Other related papers
- Yongchao Liu, Bertil Schmidt, and Douglas L. Maskell: " CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform". Bioinformatics, 2012, 28(14): 1830-1837
- Yongchao Liu and Bertil Schmidt: "Long read alignment based on maximal exact match seeds". Bioinformatics, 2012, 28(18): i318-324
- Yongchao Liu and Bertil Schmidt: "CUSHAW2-GPU: empowering faster gapped short-read alignment using GPU computing". IEEE Design & Test, 2014, 31(1): 31-39
- Yongchao Liu, Bernt Popp, and Bertil Schmidt: "CUSHAW3: sensitive and accurate base-space and color-space short-read alignment with hybrid seeding." PLOS ONE, 2014, 9(1):e86869
Parameters
- -o < str> (prefix of output file names)
- -t < int> (overall number of threads, default = automatically detected)
- -p < int> ((maximum number of threads distributed for coprocessing, default = 6)
- -d < float> (proportion of threads given to coprocessing, default = 0.5)
This option has been bsolete and has already been removed from version 1.0.8.
- -w < int> (genome partition size, default = 20971520)
- -s < int> (genome strands used for the BWT and FM-index construction, default = 1)
1: forward strand
2: reverse strand
3: both forward and reverse strands (i.e 2 times number of bases) - -r < int> (force to overwrite the file, default = 1)
- -a < int> (build the suffix array using BWT, default = 0)
Installation and Usage
Prerequisites
- Intel Threading Building Blocks (TBB) library, which is available at here.
Download and compiling
- Download the binary tarball and uncompress using the "tar" command.
Typical Usage
This algorithm accepts FASTA/FASTQ format and the typical usage is like "ParaBWT [options] infile.fa [infile1.fa]".
- Get the command line options
type command line: parabwt.
- Build the Burrows-Wheeler transform and suffix array
type command line: parabwt -o parabwt_test -s 3 -a 1 -t 4 genome.fa
Change Log
- June 16, 2014 (v1.0.8)
- Removed the option "-d".
- Adding a new option "-p".
- Set a maximum threshold on the number of slave threads and further optimized the scheduling of these slaves.
- Re-written and optimized some core code in order to improve the speed.
- December 03, 2013 (v1.0.6)
- First release of our algorithm
- The corresponding paper is submitted.
Contact
If any questions or improvements, please feel free to contact Liu, Yongchao.