endoftext

Endoftext refers to a delimiter used in certain language model training datasets, most notably in OpenAI’s GPT-2 WebText corpus. It is the literal string "endoftext" appended to the end of each document to signal a boundary between texts when multiple documents are concatenated for model training.

Origins and usage

The endoftext marker originated with the WebText dataset created for GPT-2. The dataset was assembled by crawling

Technical context

Endoftext is a textual marker rather than a dedicated model-internal token by itself. During preprocessing, the

The marker illustrates a broader practice in dataset design: embedding explicit boundaries to help models learn

End-of-file and end-of-sequence markers, corpus design, data preprocessing in language model training.

a

reproducibility

a