Corpora
Corpora, plural of corpus, are structured collections of authentic language data assembled for linguistic analysis. They may include written texts, transcripts of spoken language, or both, and are typically annotated with metadata such as genre, date, author, and source. Corpora serve as empirical bases for description, variation study, and the development of language technologies.
Corpora vary by content and purpose. General corpora aim to represent broad language use, while specialized
Most corpora are annotated to facilitate analysis, with layers such as tokens, lemmas, part-of-speech tags, syntactic
Creating a corpus involves data collection, cleaning, annotation, and validation, along with consideration of licensing, representativeness,
Applications include linguistic description, lexicography, language education, and the development and evaluation of natural language processing