ngramindex - Infinite Lexicon - Infinite Lexicon

ngramindex

An ngram index, often simply called an n-gram index, is a data structure used in text retrieval and processing that stores occurrences of contiguous sequences of n characters (or, less commonly, n words) from a collection of documents. It supports fast substring search, approximate matching, and related text analysis tasks by enabling efficient lookup of documents containing specific sequences.

Construction typically involves two steps. First, each document is decomposed into overlapping n-grams using a sliding

Query processing relies on matching the query’s n-grams against the index. The system retrieves candidate documents

Applications include search engines, spell checkers, plagiarism detection, and DNA or biosequence analysis where short, exact

a

n

language-agnostic

implementations