endoftext
Endoftext refers to a delimiter used in certain language model training datasets, most notably in OpenAI’s GPT-2 WebText corpus. It is the literal string "endoftext" appended to the end of each document to signal a boundary between texts when multiple documents are concatenated for model training.
The endoftext marker originated with the WebText dataset created for GPT-2. The dataset was assembled by crawling
Endoftext is a textual marker rather than a dedicated model-internal token by itself. During preprocessing, the
The marker illustrates a broader practice in dataset design: embedding explicit boundaries to help models learn
End-of-file and end-of-sequence markers, corpus design, data preprocessing in language model training.