TY - CONF
T1 - The colored longest common prefix array computed via sequential scans
AU - Sciortino, Marinella
AU - Rosone, Giovanna
AU - Verzotto, Davide
AU - Rosone, Giovanna
PY - 2018
Y1 - 2018
N2 - Due to the increased availability of large datasets of biological sequences, tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most alignment-free approaches require the computation of statistics when comparing sequences, even if such computations may not scale well in in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-external memory. By using cLCP, we propose an efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances. Experimental results confirm the high practical efficiency of our approach.
AB - Due to the increased availability of large datasets of biological sequences, tools for sequence comparison are now relying on efficient alignment-free approaches to a greater extent. Most alignment-free approaches require the computation of statistics when comparing sequences, even if such computations may not scale well in in internal memory when very large collections of long sequences are considered. In this paper, we present a new conceptual data structure, the colored longest common prefix array (cLCP), that allows to efficiently tackle several problems with an alignment-free approach. In fact, we show that such a data structure can be computed via sequential scans in semi-external memory. By using cLCP, we propose an efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances. Experimental results confirm the high practical efficiency of our approach.
KW - Alignment-free methods
KW - Average common substring
KW - Burrows-wheeler transform
KW - Computer Science (all)
KW - Longest common prefix
KW - Matching statistics
KW - Theoretical Computer Science
KW - Alignment-free methods
KW - Average common substring
KW - Burrows-wheeler transform
KW - Computer Science (all)
KW - Longest common prefix
KW - Matching statistics
KW - Theoretical Computer Science
UR - http://hdl.handle.net/10447/337234
UR - https://link.springer.com/book/10.1007%2F978-3-030-00479-8
M3 - Other
SP - 153
EP - 167
ER -