kgram - Infinite Lexicon - Infinite Lexicon

kgram

K-gram, often written as k-gram or kgram, is a term used to denote a contiguous substring of length k drawn from a larger string. While k-grams are typically defined at the character level, the concept can also apply to word-level units in some contexts; when used for DNA or protein sequences, k-grams are commonly referred to as k-mers. For a string S of length n, the set or multiset of its k-grams consists of S[i..i+k-1] for i from 0 to n-k. If n < k, the string contains no k-grams. The total number of k-grams in a string is n-k+1, and the number of distinct k-grams depends on the alphabet and duplicates within the string.

An example: for the word "hello" and k = 2, the 2-grams are "he", "el", "ll", and "lo".

K-grams are used in a variety of applications. In text processing and information retrieval, they enable fuzzy

Key considerations include the choice of k: small k yields many overlapping k-grams and potential noise, while

representations

bioinformatics,

k

a

a

k