Sprogmarkering
Sprogmarkering is the practice of marking or annotating the language used in a text or spoken utterance. It is used to identify which language a document, sentence, or token belongs to, often for purposes of search, processing, indexing, or analysis. The practice is common in data annotation, publishing, and digital systems that handle multilingual content. Language markings can be applied at different levels: document level (the primary language of a text), sentence level (each sentence has a language tag), or token level (individual words or phrases marked in multilingual segments). Markers typically use standardized codes such as ISO 639-1 two-letter codes, or other tagging schemes and can be encoded as metadata fields or as attributes like html lang, xml:lang, or in annotation schemas.
Methods include manual annotation by humans, automatic language identification (LID) using statistical or machine learning models,
Challenges include short or noisy text, code-switching, brand or place names that differ in language, script
Examples: in HTML, the lang attribute specifies the document language; subtitles tracks include a language code;
Sprogmarkering is foundational for multilingual information systems and language technology, aligning with standards such as ISO