My blog
Scalable genomic alignment with Progressive Cactus

Scalable genomic alignment with Progressive Cactus

How progressive alignment makes it possible to efficiently align hundreds to thousands of large genomes.

Published on by Anton Vasetenkov

An important method in comparative genomics and evolutionary studies, multiple genomic alignments attempt to map all regions in each of the input genomes to the corresponding segments in every other genome. Such alignments help understand the relationships between those segments and unlock key insights into genome evolution.

With the growing number of published genomic sequences, many studies seek to analyse increasingly large sets of complex genomes. This means that multiple genome alignment tools need to scale to handle the ever growing sets of input genomes.

An important class of multiple genome alignment tools are reference-free aligners, also known as non-reference-based aligners, which do not require a reference sequence for constructing the alignment. One such tool, Cactus, provides highly accurate alignment results and has been shown to outperform it peers.

The original implementation of Cactus dates back to 2012 and since then, it has been used in many genomic projects and studies. The runtime requirements of Cactus, however, increase quadratically with the total number of input bases which means that it cannot, for example, be used to align any more than 10 large vertebrate genomes.

Progressive Cactus is the new extension of the Cactus aligner designed to perform well on large sets of input genomes (hundreds to thousands of large genomes). Unlike its predecessor, Progressive Cactus implements a linear-time "progressive" algorithm which recursively breaks down the multiple alignment problem into smaller subproblems with the resulting sub-alignments being aligned back together to form the final alignment output.

Bottom line

Genome alignment is the sine qua non of comparative genomics and evolutionary studies. Due to the increasing scale of such studies, genome alignment tools must continually improve to cope with the ever growing complexity of multiple genome alignment problems.

By implementing the progressive alignment strategy, Progressive Cactus becomes suitable for aligning hundreds to thousands of large input genomes and provides the opportunity to uncover new insights into genome evolution and natural history.

See also

WikiPathways: A Wikipedia for biological pathways
An overview of the collaboratively edited structured pathway encyclopedia.
The ambitious challenge of finishing the human genome
Generating a complete human genome sequence, chromosome by chromosome.
The RDF model of the Gene Ontology, demystified
An outline of the structure of the Gene Ontology RDF graph and ways to query it.
AstraZeneca's knowledge graph: Drug discovery is a lot about connections
The biomedical knowledge graph built by AstraZeneca helps the company find new drugs and drug targets.

Thanks for stopping by my digital playground! If you want to say hi, you can reach out to me on LinkedIn or via email. I'm always keen to chat and connect.

If you really-really like my work, you can support me by buying me a coffee.