Lightweight algorithms for constructing and inverting the BWT of string collections

Giovanna Rosone, Anthony J. Cox, Markus J. Bauer

Risultato della ricerca: Articlepeer review

100 Citazioni (Scopus)

Abstract

Recent progress in the field of \{DNA\} sequencing motivates us to consider the problem of computing the Burrows–Wheeler transform (BWT) of a collection of strings. A human genome sequencing experiment might yield a billion or more sequences, each 100 characters in length. Such a dataset can now be generated in just a few days on a single sequencing machine. Many algorithms and data structures for compression and indexing of text have the \{BWT\} at their heart, and it would be of great interest to explore their applications to sequence collections such as these. However, computing the \{BWT\} for 100 billion characters or more of data remains a computational challenge. In this work we address this obstacle by presenting a methodology for computing the \{BWT\} of a string collection in a lightweight fashion. A first implementation of our algorithm needs O ( m log m ) bits of memory to process m strings, while a second variant makes additional use of external memory to achieve \{RAM\} usage that is constant with respect to m and negligible in size for a small alphabet such as DNA. The algorithms work on any number of strings and any size. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. We take further steps toward making the \{BWT\} a practical tool for processing string collections on this scale. First, we give two algorithms for recovering the strings in a collection from its BWT. Second, we show that if sequences are added to or removed from the collection, then the \{BWT\} of the original collection can be efficiently updated to obtain the \{BWT\} of the revised collection.
Lingua originaleEnglish
pagine (da-a)134-148
Numero di pagine15
RivistaTheoretical Computer Science
Volume483
Stato di pubblicazionePublished - 2013

All Science Journal Classification (ASJC) codes

  • ???subjectarea.asjc.2600.2614???
  • ???subjectarea.asjc.1700.1700???

Fingerprint

Entra nei temi di ricerca di 'Lightweight algorithms for constructing and inverting the BWT of string collections'. Insieme formano una fingerprint unica.

Cita questo