Home

Trigrams

Trigrams are sequences of three adjacent items drawn from a text or other sequence. They form a specific case of the n-gram concept used in linguistics and natural language processing to model local context. Trigrams can be defined at the word level, as three consecutive words, or at the character level, as three consecutive characters.

In language modeling, the probability of a word given its two predecessors is P(w3 | w1, w2). This

Trigrams have broad applications in text processing, including predicting the next word in a sentence, speech

Limitations of trigram models include data sparsity as the number of possible triples grows, and the fact

probability
is
typically
estimated
from
counts
in
a
large
corpus
as
c(w1,
w2,
w3)
divided
by
c(w1,
w2).
Because
many
trigrams
do
not
occur
in
a
given
dataset,
smoothing
methods
are
applied
to
assign
nonzero
probabilities
to
unseen
triples.
Common
techniques
include
add-one
(Laplace)
smoothing,
backoff,
and
Kneser-Ney
smoothing.
recognition,
spelling
correction,
machine
translation,
information
retrieval,
and
authorship
analysis.
Character
trigrams
are
especially
useful
for
language
identification
and
for
processing
morphologically
rich
languages,
while
word
trigrams
can
capture
more
semantic
relations
but
require
larger
vocabularies.
that
they
only
capture
context
from
the
two
immediately
preceding
items,
limiting
long-range
dependencies.
They
are
often
contrasted
with
higher-order
n-gram
models
and
with
neural
language
models,
which
can
model
longer-range
context
and
nonlinear
patterns.