Home

endoftext

Endoftext refers to a delimiter used in certain language model training datasets, most notably in OpenAI’s GPT-2 WebText corpus. It is the literal string "endoftext" appended to the end of each document to signal a boundary between texts when multiple documents are concatenated for model training.

Origins and usage

The endoftext marker originated with the WebText dataset created for GPT-2. The dataset was assembled by crawling

Technical context

Endoftext is a textual marker rather than a dedicated model-internal token by itself. During preprocessing, the

Significance

The marker illustrates a broader practice in dataset design: embedding explicit boundaries to help models learn

See also

End-of-file and end-of-sequence markers, corpus design, data preprocessing in language model training.

content
linked
from
Reddit
and
other
sources,
and
each
document
in
the
collection
was
separated
by
the
endoftext
marker.
This
delimiter
allowed
the
training
process
to
distinguish
where
one
document
ends
and
the
next
begins
within
long
input
sequences.
marker
is
included
in
the
raw
data
and
subsequently
tokenized
along
with
the
surrounding
text.
Depending
on
the
tokenizer
and
model
configuration,
the
marker
may
be
represented
as
one
or
more
tokens
in
the
model’s
vocabulary,
or
treated
as
a
special
boundary
indicator
within
the
sequence.
the
structure
of
long
corpora.
It
also
highlights
considerations
around
data
provenance,
boundary
handling,
and
reproducibility
in
large-scale
language
model
training.
Discussions
about
data
quality
and
boundary
markers
in
training
data
often
reference
endoftext
as
a
concrete
example.