tekstikogude
Tekstikogude (text corpora) are large, structured collections of natural language text that are used for linguistic research and natural language processing. They typically consist of raw texts plus metadata (language, source, date, genre) and may include linguistic annotations such as tokenization, lemmatization, part-of-speech tags, syntactic parses, named entities, or semantic roles.
They can be monolingual, bilingual or multilingual, and vary by size, genre, domain, and time period. General-purpose
Creation involves collecting texts from publishers, websites, digitization of print sources, and preparing data through cleaning,
Uses include linguistic analysis (morphology, syntax, semantics), lexicography and language resource development, training and evaluating NLP
Examples of well-known corpora include general-language corpora and domain-specific collections, multilingual corpora, and web-scale corpora. In