Verkkokorporaa
Verkkokorporaa is a Finnish term used to describe a systematic approach to creating and using large-scale web-derived corpora for linguistic research and natural language processing. The concept encompasses the collection, cleaning, annotation, and governance of text data harvested from online sources to support empirical studies of the Finnish language and related NLP applications. The word combines verkk- meaning “web” with korpora, the Finnish plural of corpus, and the common suffix -aa.
The idea emerged in Finnish academia in the early 2010s as researchers sought more diverse, up-to-date language
Data construction in verkkokorporaa involves selecting sources such as news portals, blogs, forums, government and organizational
Uses of verkkokorporaa span lexical and syntactic research, frequency and collocation analyses, language modeling, and the
See also: corpora, natural language processing, data ethics.