Home

Subword

Subword refers to any linguistic or computational unit smaller than a word. In linguistics, subword units include morphemes (the smallest meaningful units like prefixes and suffixes), as well as syllables or other segments used in analyses of word formation. In computing and natural language processing, the term often denotes units produced by subword tokenization: a token that is not a full word but a shorter piece of a word. Subword modeling helps handle languages with rich morphology, as well as out-of-vocabulary words, by representing text as a fixed vocabulary of subword units rather than a fixed set of whole words.

Subword tokenization methods such as Byte-Pair Encoding (BPE), WordPiece, and SentencePiece learn a vocabulary of subword

For example, the word "unhappiness" might be tokenized as ["un", "happiness"] if "happiness" is in the vocabulary,

Advantages and limitations: Benefits include smaller vocabularies, better handling of neologisms and agglutinative languages, and improved

units
from
data.
They
typically
start
with
characters
or
short
symbols
and
iteratively
merge
frequently
co-occurring
sequences
to
form
longer
units,
stopping
at
a
predefined
vocabulary
size.
The
result
is
that
many
words
can
be
built
from
combinations
of
subword
units,
enabling
the
model
to
form
representations
for
unseen
words.
or
as
["un",
"hap",
"pi",
"ness"]
under
a
more
granular
scheme.
robustness
in
NLP
models.
Limitations
include
potential
mismatch
with
linguistic
morphology
and
the
need
for
large,
representative
training
data
to
learn
effective
subword
units.