Linear-size suffix tries

Chiara Epifanio, Filippo Mignosi, Roberto Grossi, Maxime Crochemore

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Suffix trees are highly regarded data structures for text indexing and string algorithms [MCreight 76, Weiner 73]. For any given string w of length n=|w|, a suffix tree for w takes O(n) nodes and links. It is often presented as a compacted version of a suffix trie for w, where the latter is the trie (or digital search tree) built on the suffixes of w. Here the compaction process replaces each maximal chain of unary nodes with a single arc. For this, the suffix tree requires that the labels of its arcs are substrings encoded as pointers to w (or equivalent information). On the contrary, the arcs of the suffix trie are labeled by single symbols but there can be Θ(n2) nodes and links for suffix tries in the worst case because of their unary nodes. It is an interesting question if the suffix trie can be stored using O(n) nodes. We present the linear-size suffix trie, which guarantees O(n) nodes. We use a new technique for reducing the number of unary nodes to O(n), that stems from some results on antidictionaries. For instance, by using the linear-size suffix trie, we are able to check whether a pattern p of length m=|p| occurs in w in O(mlog⁡|Σ|) time and we can find the longest common substring of two strings w1 and w2 in O((|w1|+|w2|)log⁡|Σ|) time for an alphabet Σ.
Original languageEnglish
Pages (from-to)171-178
Number of pages8
JournalTheoretical Computer Science
Volume638
Publication statusPublished - 2016

Fingerprint

Suffix
Data structures
Labels
Compaction
Suffix Tree
Vertex of a graph
Unary
Arc of a curve
Strings
Text Indexing
String Algorithms
Search Trees
Data Structures

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Epifanio, C., Mignosi, F., Grossi, R., & Crochemore, M. (2016). Linear-size suffix tries. Theoretical Computer Science, 638, 171-178.

Linear-size suffix tries. / Epifanio, Chiara; Mignosi, Filippo; Grossi, Roberto; Crochemore, Maxime.

In: Theoretical Computer Science, Vol. 638, 2016, p. 171-178.

Research output: Contribution to journalArticle

Epifanio, C, Mignosi, F, Grossi, R & Crochemore, M 2016, 'Linear-size suffix tries', Theoretical Computer Science, vol. 638, pp. 171-178.
Epifanio C, Mignosi F, Grossi R, Crochemore M. Linear-size suffix tries. Theoretical Computer Science. 2016;638:171-178.
Epifanio, Chiara ; Mignosi, Filippo ; Grossi, Roberto ; Crochemore, Maxime. / Linear-size suffix tries. In: Theoretical Computer Science. 2016 ; Vol. 638. pp. 171-178.
@article{f9c3ffd842424ff88f0a73b3ec7cd3ef,
title = "Linear-size suffix tries",
abstract = "Suffix trees are highly regarded data structures for text indexing and string algorithms [MCreight 76, Weiner 73]. For any given string w of length n=|w|, a suffix tree for w takes O(n) nodes and links. It is often presented as a compacted version of a suffix trie for w, where the latter is the trie (or digital search tree) built on the suffixes of w. Here the compaction process replaces each maximal chain of unary nodes with a single arc. For this, the suffix tree requires that the labels of its arcs are substrings encoded as pointers to w (or equivalent information). On the contrary, the arcs of the suffix trie are labeled by single symbols but there can be Θ(n2) nodes and links for suffix tries in the worst case because of their unary nodes. It is an interesting question if the suffix trie can be stored using O(n) nodes. We present the linear-size suffix trie, which guarantees O(n) nodes. We use a new technique for reducing the number of unary nodes to O(n), that stems from some results on antidictionaries. For instance, by using the linear-size suffix trie, we are able to check whether a pattern p of length m=|p| occurs in w in O(mlog⁡|Σ|) time and we can find the longest common substring of two strings w1 and w2 in O((|w1|+|w2|)log⁡|Σ|) time for an alphabet Σ.",
author = "Chiara Epifanio and Filippo Mignosi and Roberto Grossi and Maxime Crochemore",
year = "2016",
language = "English",
volume = "638",
pages = "171--178",
journal = "Theoretical Computer Science",
issn = "0304-3975",
publisher = "Elsevier",

}

TY - JOUR

T1 - Linear-size suffix tries

AU - Epifanio, Chiara

AU - Mignosi, Filippo

AU - Grossi, Roberto

AU - Crochemore, Maxime

PY - 2016

Y1 - 2016

N2 - Suffix trees are highly regarded data structures for text indexing and string algorithms [MCreight 76, Weiner 73]. For any given string w of length n=|w|, a suffix tree for w takes O(n) nodes and links. It is often presented as a compacted version of a suffix trie for w, where the latter is the trie (or digital search tree) built on the suffixes of w. Here the compaction process replaces each maximal chain of unary nodes with a single arc. For this, the suffix tree requires that the labels of its arcs are substrings encoded as pointers to w (or equivalent information). On the contrary, the arcs of the suffix trie are labeled by single symbols but there can be Θ(n2) nodes and links for suffix tries in the worst case because of their unary nodes. It is an interesting question if the suffix trie can be stored using O(n) nodes. We present the linear-size suffix trie, which guarantees O(n) nodes. We use a new technique for reducing the number of unary nodes to O(n), that stems from some results on antidictionaries. For instance, by using the linear-size suffix trie, we are able to check whether a pattern p of length m=|p| occurs in w in O(mlog⁡|Σ|) time and we can find the longest common substring of two strings w1 and w2 in O((|w1|+|w2|)log⁡|Σ|) time for an alphabet Σ.

AB - Suffix trees are highly regarded data structures for text indexing and string algorithms [MCreight 76, Weiner 73]. For any given string w of length n=|w|, a suffix tree for w takes O(n) nodes and links. It is often presented as a compacted version of a suffix trie for w, where the latter is the trie (or digital search tree) built on the suffixes of w. Here the compaction process replaces each maximal chain of unary nodes with a single arc. For this, the suffix tree requires that the labels of its arcs are substrings encoded as pointers to w (or equivalent information). On the contrary, the arcs of the suffix trie are labeled by single symbols but there can be Θ(n2) nodes and links for suffix tries in the worst case because of their unary nodes. It is an interesting question if the suffix trie can be stored using O(n) nodes. We present the linear-size suffix trie, which guarantees O(n) nodes. We use a new technique for reducing the number of unary nodes to O(n), that stems from some results on antidictionaries. For instance, by using the linear-size suffix trie, we are able to check whether a pattern p of length m=|p| occurs in w in O(mlog⁡|Σ|) time and we can find the longest common substring of two strings w1 and w2 in O((|w1|+|w2|)log⁡|Σ|) time for an alphabet Σ.

UR - http://hdl.handle.net/10447/213681

M3 - Article

VL - 638

SP - 171

EP - 178

JO - Theoretical Computer Science

JF - Theoretical Computer Science

SN - 0304-3975

ER -