ngramindekser - Infinite Lexicon - Infinite Lexicon

ngramindekser

ngramindekser (often translated as n-gram indexer) is a data structure and set of algorithms used to build an index of n-grams—contiguous sequences of n items—from a collection of text. The index maps each n-gram to the documents in which it appears, typically including positional information to support more complex queries. This enables efficient substring search and, with additional scoring, approximate matching.

Two main flavors exist: character-level n-grams and word-level n-grams. Character n-grams are derived from the raw

Construction and querying usually follow a similar workflow. Text is normalized (lowercased, diacritics handled) and tokenized

Typical applications include full-text search with fuzzy matching, autocomplete, spell checking, and plagiarism detection. Key considerations

a

n

n

a