unigrams
Unigrams are the simplest unit in n-gram language models. A unigram is a single element of a sequence, most commonly a word. In word-based unigram models, the probability of a text is approximated by the product of the probabilities of its individual words. Unigrams can also refer to single characters in character-level modeling, where the alphabet letters are treated as tokens.
Use cases: In text classification and information retrieval, unigrams form the basis of bag-of-words representations, where
Advantages and limitations: Unigrams are simple and robust to small corpora, fast to compute, and provide a
Variants and related concepts: In character-level modeling, unigrams are single characters; higher-order n-grams (bigrams, trigrams) capture