BWT construction and search at the terabase scale.

Bioinformatics (Oxford, England)
Authors
Abstract

MOTIVATION: Burrows-Wheeler Transform (BWT) is a common component in full-text indices. Initially developed for data compression, it is particularly powerful for encoding redundant sequences such as pangenome data. However, BWT construction is resource intensive and hard to be parallelized, and many methods for querying large full-text indices only report exact matches or their simple extensions. These limitations have hampered the biological applications of full-text indices.RESULTS: We developed ropebwt3 for efficient BWT construction and query. Ropebwt3 indexed 320 assembled human genomes in 65 hours and indexed 7.3 terabases of commonly studied bacterial assemblies in 26 days. This was achieved using up to 170 gigabytes of memory at the peak without working disk space. Ropebwt3 can find maximal exact matches and inexact alignments under affine-gap penalties, and can retrieve similar local haplotypes matching a query sequence. It demonstrates the feasibility of full-text indexing at the terabase scale.AVAILABILITY AND IMPLEMENTATION: .

Year of Publication
2024
Journal
Bioinformatics (Oxford, England)
Date Published
11/2024
ISSN
1367-4811
DOI
10.1093/bioinformatics/btae717
PubMed ID
39607778
Links