Computing the Original eBWT Faster, Simpler, and with Less Memory

Marinella Sciortino, Christina Boucher, Zsuzsanna Lipták, Massimiliano Rossi, Davide Cenzato

Risultato della ricerca: Conference contribution

Abstract

Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the BWT to a collection of strings. However, since this introduction, it has been used more generally to describe any BWT of a collection of strings, and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our new eBWT construction with a variation of prefix-free parsing to allow for scalable construction of the eBWT. We evaluate our algorithm (pfpebwt) on sets of human chromosomes 19, Salmonella, and SARS-CoV2 genomes, and demonstrate that it is the fastest method for all collections, with a maximum speedup of 7.6 × on the second best method. The peak memory is at most 2 × larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1 × improvement in peak memory. The source code is publicly available at https://github.com/davidecenzato/PFP-eBWT.
Lingua originaleEnglish
Titolo della pubblicazione ospiteString Processing and Information Retrieval - 28th International Symposium, SPIRE 2021, Lille, France, October 4–6, 2021, Proceedings Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pagine129-142
Numero di pagine14
Stato di pubblicazionePublished - 2021

Serie di pubblicazioni

NomeLECTURE NOTES IN COMPUTER SCIENCE

All Science Journal Classification (ASJC) codes

  • ???subjectarea.asjc.2600.2614???
  • ???subjectarea.asjc.1700.1700???

Cita questo