TY - GEN
T1 - Computing the Original eBWT Faster, Simpler, and with Less Memory
AU - Sciortino, Marinella
AU - Boucher, Christina
AU - Lipták, Zsuzsanna
AU - Rossi, Massimiliano
AU - Cenzato, Davide
PY - 2021
Y1 - 2021
N2 - Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the BWT to a collection of strings. However, since this introduction, it has been used more generally to describe any BWT of a collection of strings, and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our new eBWT construction with a variation of prefix-free parsing to allow for scalable construction of the eBWT. We evaluate our algorithm (pfpebwt) on sets of human chromosomes 19, Salmonella, and SARS-CoV2 genomes, and demonstrate that it is the fastest method for all collections, with a maximum speedup of 7.6 × on the second best method. The peak memory is at most 2 × larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1 × improvement in peak memory. The source code is publicly available at https://github.com/davidecenzato/PFP-eBWT.
AB - Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the BWT to a collection of strings. However, since this introduction, it has been used more generally to describe any BWT of a collection of strings, and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our new eBWT construction with a variation of prefix-free parsing to allow for scalable construction of the eBWT. We evaluate our algorithm (pfpebwt) on sets of human chromosomes 19, Salmonella, and SARS-CoV2 genomes, and demonstrate that it is the fastest method for all collections, with a maximum speedup of 7.6 × on the second best method. The peak memory is at most 2 × larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1 × improvement in peak memory. The source code is publicly available at https://github.com/davidecenzato/PFP-eBWT.
UR - http://hdl.handle.net/10447/526395
M3 - Conference contribution
SN - 978-3-030-86691-4; 978-3-030-86692-1
T3 - LECTURE NOTES IN COMPUTER SCIENCE
SP - 129
EP - 142
BT - String Processing and Information Retrieval -
28th International Symposium, SPIRE 2021, Lille, France, October 4–6, 2021, Proceedings
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
ER -