antvaset.com
/
blast-wasm

BLAST WASM

Why do bioinformaticians make great DJs? Because their BLAST results are always a hit!

BLAST is a family of tools used to compare biological sequences, such as protein or DNA sequences, against a database of sequences. It is used in bioinformatics applications, including genome annotation, gene discovery, and phylogenetic analysis of DNA and protein sequences.

The BLAST source code is written in C++ and is part of the NCBI C++ Toolkit—a collection of open-source libraries and applications for bioinformatics.

I've compiled the BLAST programs to WebAssembly using Emscripten to make a version of each program that can run fully in the browser. This means that you can run these programs locally on your computer, yet without having to install anything.

You can try it out in the shell below by running the makeblastdb, blastdbcmd, blastp, blastn, blastx, tblastn, and tblastx commands.

Demo

WASM shell
Select a file to view or edit.

The process of compiling BLAST to WebAssembly

The NCBI C++ Toolkit codebase is large and complex, so it took me a while to figure out how to make it work with Emscripten.

First, because I had to experiment with different compilation options, I needed a way to quickly build only the BLAST binaries without building the entire NCBI C++ Toolkit. The build process is thoroughly documented in The NCBI C++ Toolkit Book, which mentions multiple ways to limit what is built. For me, the easiest way was to use the flat makefile and then selectively target only the BLAST executables for compilation.

Second, as with many other large C++ projects, the NCBI C++ Toolkit uses a self-executing build system in which it builds the project_tree_builder and DATATOOL build tools used later in the build. This process fails when compiling with Emscripten because you can't directly run the Emscripten-compiled project_tree_builder or DATATOOL. My solution was to run two separate builds: one native build to build the flat makefile and the DATATOOL executable, and one Emscripten build that picks up the flat makefile and DATATOOL executable from the native build.

Third, there were some compilation errors due to incompatibilities with Emscripten that I had to fix by modifying the source code.

Fourth, I had to enable exception handling in WASM because the BLAST executables use exceptions in the normal program flow. For example, they throw exceptions to report input errors which are subsequently caught and printed to the console.

Finally, I noticed during testing that the WASM version of makeblastdb was giving an error when executed in the browser (in the Emscripten runtime environment) with the -parse_seqids option, unlike the native version which was working fine. It took a lot of digging through the source code and debugging both the native and WASM executables to figure out the cause of the error. In the process, I learned a lot about BLAST's use of LMDB, LMDB itself (and its creator Howard Chu who is an unsung hero of open source), the Emscripten runtime environment, and debugging WASM and C++ in general. The problem had to do with how Emscripten handled memory-mapped files.

See also

Made by Anton Vasetenkov.

If you want to say hi, you can reach me on LinkedIn or via email. If you really-really like my work, you can support me by buying me a coffee.