Home

corpuslinguïstisch

Corpuslinguïstisch, or corpus linguistics, is a branch of linguistics that studies language by collecting and analyzing large electronic collections of texts called corpora. It combines quantitative methods, such as frequency counts and statistical analyses, with qualitative analysis of patterns in actual language use. The central idea is that language phenomena are best understood by examining large, authentic samples rather than isolated examples.

Corpora vary in size and scope and can be general, covering broad language use across genres, or

Common methods include concordance analysis, keyword and keyness analysis, collocation and dispersion studies, frequency profiling, and

Limitations include representativeness, sampling bias, annotation quality, and the resource demands of less-studied languages. Ethical considerations

specialized,
such
as
spoken
dialogue,
academic
prose,
legal
texts,
or
learner
language.
Notable
benchmark
corpora
include
the
Brown
Corpus,
the
British
National
Corpus,
and
the
Corpus
of
Contemporary
American
English,
with
many
language-specific
and
domain-specific
corpora,
as
well
as
web-derived
and
parallel
corpora
for
translation
studies.
Corpora
are
typically
annotated
for
features
such
as
part
of
speech,
lemmas,
morphology,
syntax,
and
semantics,
enabling
advanced
queries
and
analyses.
distributional
semantics.
Tools
range
from
corpus
query
systems
to
tagging
and
parsing
pipelines
and
scripting
libraries.
Applications
span
lexicography,
language
teaching,
translation
studies,
sociolinguistics,
and
natural
language
processing
resource
development.
involve
data
licensing
and
privacy
of
spoken
corpora.
The
field
continues
to
evolve
with
web
and
multimodal
corpora,
annotation
schemes,
and
practices
that
emphasize
reproducibility.