Substring index - Reference.org

On this page

Substring index

Data structure

In computer science, a data structure called a substring index enables efficient substring search within texts or document collections, achieving sublinear query times. After construction, it locates all pattern occurrences in time proportional to the pattern length, with minimal dependence on document size. Though often called a full-text index, this term is ambiguous because it also refers to structures like inverted files used in document retrieval. For more details, see full text search.

We don't have any images related to Substring index yet.

You can add one yourself here.

We don't have any YouTube videos related to Substring index yet.

You can add one yourself here.

We don't have any PDF documents related to Substring index yet.

You can add one yourself here.

We don't have any Books related to Substring index yet.

You can add one yourself here.

We don't have any archived web articles related to Substring index yet.

You can submit a link to a page to archive here.

General considerations

These data structures typically treat their text and pattern as strings over a fixed alphabet, and search for locations where the pattern occurs as a substring of the text. The symbols of the alphabet may be characters (for instance in Unicode) but in practical applications for text retrieval it may be preferable to treat the (stemmed) words of a document as the symbols of its alphabet, because doing this reduces the lengths of both the text and pattern as measured in numbers of symbols.²

Examples

Specific data structures that can be used as substring indexes include:

The suffix tree, a radix tree of the suffixes of the string, allowing substring search to be performed symbol-by-symbol³ ⁴
The suffix automaton, the minimal deterministic finite automaton that recognizes substrings of a given text, closely related to the suffix tree and constructable by variants of the same algorithms.⁵
The suffix array, a sorted array of the starting positions of suffixes of the string, allowing substring search to be performed by binary search ⁶ ⁷ Augmenting a suffix array with an LCP array of the lengths of common prefixes of consecutive suffixes allows the search to be performed symbol-by-symbol, matching the search time of the suffix tree.⁸
The compressed suffix array, a data structure that combines data compression with the suffix array, allowing the structure to be stored in space sublinear in the text length⁹ ¹⁰
The FM-index, another compressed substring index based on the Burrows–Wheeler transform and closely related to the suffix array¹¹

References

Barsky, Marina; Stege, Ulrike; Thomo, Alex (2012), "Chapter 1: Structures for Indexing Substrings", Full-Text (Substring) Indexes in External Memory, Synthesis Lectures on Data Management, Springer International Publishing, pp. 1–15, doi:10.1007/978-3-031-01885-5_1, ISBN 9783031018855 9783031018855 ↩
Risvik, Knut Magne (1998), "Approximate word sequence matching over sparse suffix trees", in Farach-Colton, Martin (ed.), Combinatorial Pattern Matching, 9th Annual Symposium, CPM 98, Piscataway, New Jersey, USA, July 20–22, 1998, Proceedings, Lecture Notes in Computer Science, vol. 1448, Springer, pp. 65–79, doi:10.1007/BFB0030781, ISBN 978-3-540-64739-3 978-3-540-64739-3 ↩
Barsky, Marina; Stege, Ulrike; Thomo, Alex (2012), "Chapter 1: Structures for Indexing Substrings", Full-Text (Substring) Indexes in External Memory, Synthesis Lectures on Data Management, Springer International Publishing, pp. 1–15, doi:10.1007/978-3-031-01885-5_1, ISBN 9783031018855 9783031018855 ↩
Grossi, Roberto; Vitter, Jeffrey Scott (2005), "Compressed suffix arrays and suffix trees with applications to text indexing and string matching" (PDF), SIAM Journal on Computing, 35 (2): 378–407, doi:10.1137/S0097539702402354, hdl:1808/18962, MR 2191449 /wiki/Jeffrey_Vitter ↩
Blumer, Anselm; Blumer, J.; Ehrenfeucht, Andrzej; Haussler, David; McConnell, Ross M. (1984), "Building the minimal DFA for the set of all subwords of a word on-line in linear time", in Paredaens, Jan (ed.), Automata, Languages and Programming, 11th Colloquium, Antwerp, Belgium, July 16–20, 1984, Proceedings, Lecture Notes in Computer Science, vol. 172, Springer, pp. 109–118, doi:10.1007/3-540-13345-3_9, ISBN 978-3-540-13345-2 978-3-540-13345-2 ↩
Barsky, Marina; Stege, Ulrike; Thomo, Alex (2012), "Chapter 1: Structures for Indexing Substrings", Full-Text (Substring) Indexes in External Memory, Synthesis Lectures on Data Management, Springer International Publishing, pp. 1–15, doi:10.1007/978-3-031-01885-5_1, ISBN 9783031018855 9783031018855 ↩
Grossi, Roberto; Vitter, Jeffrey Scott (2005), "Compressed suffix arrays and suffix trees with applications to text indexing and string matching" (PDF), SIAM Journal on Computing, 35 (2): 378–407, doi:10.1137/S0097539702402354, hdl:1808/18962, MR 2191449 /wiki/Jeffrey_Vitter ↩
Manber, Udi; Myers, Gene (1993), "Suffix arrays: a new method for on-line string searches", SIAM Journal on Computing, 22 (5): 935–948, doi:10.1137/0222058, MR 1237156 /wiki/Udi_Manber ↩
Barsky, Marina; Stege, Ulrike; Thomo, Alex (2012), "Chapter 1: Structures for Indexing Substrings", Full-Text (Substring) Indexes in External Memory, Synthesis Lectures on Data Management, Springer International Publishing, pp. 1–15, doi:10.1007/978-3-031-01885-5_1, ISBN 9783031018855 9783031018855 ↩
Grossi, Roberto; Vitter, Jeffrey Scott (2005), "Compressed suffix arrays and suffix trees with applications to text indexing and string matching" (PDF), SIAM Journal on Computing, 35 (2): 378–407, doi:10.1137/S0097539702402354, hdl:1808/18962, MR 2191449 /wiki/Jeffrey_Vitter ↩
Ferragina, Paolo; Manzini, Giovanni (2005), "Indexing compressed text", Journal of the ACM, 52 (4): 552–581, doi:10.1145/1082036.1082039, MR 2164632 /wiki/Journal_of_the_ACM ↩