Home

subtokens

Subtokens are the smaller units used to represent text in subword tokenization schemes. In many contemporary natural language processing systems, words are not stored as indivisible tokens; instead, they are decomposed into subwords or subtokens. Subtoken vocabularies are built by algorithms such as Byte Pair Encoding (BPE), WordPiece, or SentencePiece, which aim to cover a language with a compact yet expressive set of units.

During training, a tokenizer starts with a basic symbol set (often characters) and repeatedly merges the most

Subtokens reduce the size of the vocabulary and improve handling of morphology, making it easier to model

Limitations include dependence on the chosen segmentation scheme, which can affect interpretability and downstream performance. Subtoken

frequent
adjacent
pairs
to
form
new
subtokens.
The
resulting
vocabulary
contains
many
short
units
that
can
be
concatenated
to
form
most
words.
At
inference
time,
an
unseen
word
is
broken
into
subtokens
that
exist
in
the
vocabulary,
enabling
approximate
representation
without
requiring
a
new
token.
languages
with
rich
inflection
or
compounding.
They
enable
models
to
generalize
to
rare
or
unseen
words
by
composing
familiar
subwords,
and
they
help
manage
multilingual
or
cross-domain
data
where
fixed-word
vocabularies
would
be
insufficient.
boundaries
may
not
align
with
linguistic
morphemes,
and
processing
steps
add
computational
overhead.
Evaluating
model
behavior
with
subtokens
can
also
be
more
complex
than
with
traditional
word-based
representations.