ParaBWT - Parallel Burrows Wheeler Transform Construction

Introduction

ParaBWT is a new and practical parallelized Burrows-Wheeler transform (BWT) and suffix array construction algorithm for big genome data, which has a linear space complexity with a small constant factor. The performance of ParaBWT has been evaluated using two sequences generated from two human genome assemblies: the Ensembl Homo sapiens assembly and the human reference genome, on a workstation with two Intel Xeon X5650 hex-core CPUs and 96 GB RAM, running the Ubuntu 12.04 LTS operating system. Our performance comparison to FMDindex and Bwt-disk reveals that on 12 CPU cores, ParaBWT runs up to 2.2 times faster than FMD-index, reducing the runtime from 26.56 hours to 12.34 hours for a sequence of about 60 billion nucleotides, and up to 99.0 times faster than Bwt-disk.

Downloads

latest release (v1.0.8)
Only executable binaries are available for download at present. More details about the changes in this version are available at ChangeLog.

Citation

Yongchao Liu, Thomas Hankeln and Bertil Schmidt: "Parallel and space-efficient construction of Burrows-Wheeler transform and suffix array for big genome data". IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2016, 13(3): 592-598

Other related papers

Yongchao Liu, Bertil Schmidt, and Douglas L. Maskell: " CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform". Bioinformatics, 2012, 28(14): 1830-1837
Yongchao Liu and Bertil Schmidt: "Long read alignment based on maximal exact match seeds". Bioinformatics, 2012, 28(18): i318-324
Yongchao Liu and Bertil Schmidt: "CUSHAW2-GPU: empowering faster gapped short-read alignment using GPU computing". IEEE Design & Test, 2014, 31(1): 31-39
Yongchao Liu, Bernt Popp, and Bertil Schmidt: "CUSHAW3: sensitive and accurate base-space and color-space short-read alignment with hybrid seeding." PLOS ONE, 2014, 9(1):e86869

Parameters

-o < str> (prefix of output file names)
-t < int> (overall number of threads, default = automatically detected)
-p < int> ((maximum number of threads distributed for coprocessing, default = 6)
-d < float> (proportion of threads given to coprocessing, default = 0.5)
This option has been bsolete and has already been removed from version 1.0.8.
-w < int> (genome partition size, default = 20971520)
-s < int> (genome strands used for the BWT and FM-index construction, default = 1)
1: forward strand
2: reverse strand
3: both forward and reverse strands (i.e 2 times number of bases)
-r < int> (force to overwrite the file, default = 1)
-a < int> (build the suffix array using BWT, default = 0)

Installation and Usage

Prerequisites

Intel Threading Building Blocks (TBB) library, which is available at here.

Download and compiling

Download the binary tarball and uncompress using the "tar" command.

Typical Usage

This algorithm accepts FASTA/FASTQ format and the typical usage is like "ParaBWT [options] infile.fa [infile1.fa]".

Get the command line options
type command line: parabwt.
Build the Burrows-Wheeler transform and suffix array
type command line: parabwt -o parabwt_test -s 3 -a 1 -t 4 genome.fa

Change Log

June 16, 2014 (v1.0.8)
1. Removed the option "-d".
2. Adding a new option "-p".
3. Set a maximum threshold on the number of slave threads and further optimized the scheduling of these slaves.
4. Re-written and optimized some core code in order to improve the speed.
December 03, 2013 (v1.0.6)
1. First release of our algorithm
2. The corresponding paper is submitted.

Contact

If any questions or improvements, please feel free to contact Liu, Yongchao.

ParaBWT - Parallel Burrows Wheeler Transform Construction

Site Map

Project Links

List of My Software

Big Data

Machine Learning

Scientific Computing

Sequence Alignment

Motif Discovery

NGS Read Alignment

NGS Read Error Correction

NGS de novo Assembly

NGS SNV calling

NGS Metagenomics

Inspire Innovation