Scalable genomic alignment with Progressive Cactus

An important method in comparative genomics and evolutionary studies, multiple genomic alignments attempt to map all regions in each of the input genomes to the corresponding segments in every other genome. Such alignments help understand the relationships between those segments and unlock key insights into genome evolution.

With the growing number of published genomic sequences, many studies seek to analyse increasingly large sets of complex genomes. This means that multiple genome alignment tools need to scale to handle the ever growing sets of input genomes.

An important class of multiple genome alignment tools are reference-free aligners, also known as non-reference-based aligners, which do not require a reference sequence for constructing the alignment. One such tool, Cactus, provides highly accurate alignment results and has been shown to outperform it peers.

The original implementation of Cactus dates back to 2012 and since then, it has been used in many genomic projects and studies. The runtime requirements of Cactus, however, increase quadratically with the total number of input bases which means that it cannot, for example, be used to align any more than 10 large vertebrate genomes.

Progressive Cactus is the new extension of the Cactus aligner designed to perform well on large sets of input genomes (hundreds to thousands of large genomes). Unlike its predecessor, Progressive Cactus implements a linear-time "progressive" algorithm which recursively breaks down the multiple alignment problem into smaller subproblems with the resulting sub-alignments being aligned back together to form the final alignment output.

Bottom line

Genome alignment is the sine qua non of comparative genomics and evolutionary studies. Due to the increasing scale of such studies, genome alignment tools must continually improve to cope with the ever growing complexity of multiple genome alignment problems.

By implementing the progressive alignment strategy, Progressive Cactus becomes suitable for aligning hundreds to thousands of large input genomes and provides the opportunity to uncover new insights into genome evolution and natural history.

See also

Explore the human genome.
bioinformatics, genomics
An outline of the structure of the Gene Ontology RDF graph and ways to query it.
bioinformatics, RDF
Generating a complete human genome sequence, chromosome by chromosome.
bioinformatics
Convert FASTQ files to FASTA format.
FASTQ, FASTA, bioinformatics
A tool to generate plasmid maps from GenBank files.
bioinformatics, SVG
Create pretty sequence logo diagrams online.
bioinformatics, sequence analysis, FASTA
Open FASTA files in the browser.
bioinformatics, FASTA
A WASM port of the MUSCLE sequence alignment tool.
MUSCLE, WASM, bioinformatics
An overview of the collaboratively edited structured pathway encyclopedia.
bioinformatics
Learn about the FASTA format and its applications.
FASTA, bioinformatics
A WASM port of MrBayes.
MrBayes, WASM, bioinformatics
A WASM port of the BLAST sequence alignment tool.
BLAST, WASM, bioinformatics

Made by Anton Vasetenkov.

If you want to say hi, you can reach me on LinkedIn or via email. If you like my work, you can support me by buying me a coffee.