Lightweight LCP Construction for Next-Generation Sequencing Datasets

Marinella Sciortino, Anthony J. Cox, Markus J. Bauer

Risultato della ricerca: Chapter

27 Citazioni (Scopus)

Abstract

The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets.In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and BWT of very large collections of sequences. Computational results on collections as large as 800 million 100-mers demonstrate that our algorithm scales to the vast sequence collections encountered in human whole genome sequencing experiments.
Lingua originaleEnglish
Titolo della pubblicazione ospiteALGORITHMS IN BIOINFORMATICS
Pagine326-337
Numero di pagine12
Stato di pubblicazionePublished - 2012

Serie di pubblicazioni

NomeLECTURE NOTES IN COMPUTER SCIENCE

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Entra nei temi di ricerca di 'Lightweight LCP Construction for Next-Generation Sequencing Datasets'. Insieme formano una fingerprint unica.

Cita questo