Home

digrams

A digram, also called a bigram in many applications, is a pair of adjacent elements in a sequence. Digrams can refer to any two consecutive units, such as letters, phonemes, or words. In statistical text analysis, digrams are the second-order units used to study the structure and patterns of a language.

In language analysis, two common forms are letter-level digrams and word-level digrams. Letter-level digrams are pairs

Applications of digrams span several fields. In natural language processing, digrams are foundational to n-gram models

Computation typically involves sliding a two-element window across the text, tallying occurrences of each digram, and

Limitations include data sparsity for longer or less common digrams, sensitivity to preprocessing choices, and dependence

of
consecutive
characters,
for
example
in
English
the
digrams
“th,”
“he,”
“in,”
and
“er”
frequently
occur.
Word-level
digrams
consist
of
consecutive
words,
such
as
“to
be,”
“in
any,”
or
“the
end.”
used
for
language
modeling,
text
prediction,
spelling
correction,
and
auto-completion.
In
cryptography,
digram
frequency
analysis
examines
how
often
pairs
of
letters
occur
to
break
ciphers
and
infer
plaintext.
Digrams
also
appear
in
data
compression
and
authorship
attribution,
where
patterns
of
adjacent
items
help
distinguish
texts.
optionally
converting
counts
to
probabilities,
such
as
P(b|a)
=
count(ab)
/
count(a).
For
word-level
digrams,
tokens
are
generated
for
consecutive
words,
with
appropriate
preprocessing
(case
normalization,
punctuation
handling).
on
corpus
quality.
Digrams
capture
local
dependencies
but
not
long-range
structure,
so
they
are
often
complemented
by
longer
n-grams
and
smoothing
techniques
in
practical
applications.