Home

Unigrams

Unigrams are the simplest unit in n-gram language models. A unigram is a single element of a sequence, most commonly a word. In word-based unigram models, the probability of a text is approximated by the product of the probabilities of its individual words. Unigrams can also refer to single characters in character-level modeling, where the alphabet letters are treated as tokens.

Use cases: In text classification and information retrieval, unigrams form the basis of bag-of-words representations, where

Advantages and limitations: Unigrams are simple and robust to small corpora, fast to compute, and provide a

Variants and related concepts: In character-level modeling, unigrams are single characters; higher-order n-grams (bigrams, trigrams) capture

documents
are
converted
into
vectors
of
word
counts
or
frequencies.
They
are
used
in
term
frequency
(TF)
and
TF-IDF
weighting,
as
well
as
in
probabilistic
classifiers
such
as
Naive
Bayes
and
in
simple
language
models
that
estimate
P(w)
from
observed
frequencies.
broad
view
of
vocabulary
usage.
They,
however,
ignore
word
order
and
multiword
expressions,
and
the
resulting
models
can
miss
important
context.
Also,
rare
words
can
cause
high-dimensional
sparse
representations,
requiring
smoothing
or
dimensionality
reduction.
local
word
sequences.
A
unigram
model
assumes
independence
among
tokens,
which
is
a
strong
simplification
but
serves
as
a
useful
baseline
and
feature
set.