Home

quanteda

Quanteda is an open-source framework for quantitative text analysis in the R programming environment. It provides tools for the management, processing, and analysis of large text corpora, enabling researchers to prepare and summarize textual data efficiently. The package emphasizes memory-efficient data structures and fast operations on sparse matrices, making it suitable for large collections such as social media posts or historical texts.

Core objects in quanteda are corpus, tokens, and document-feature matrices (DFMs). A corpus stores documents and

Quanteda integrates with extensions such as quanteda.text for advanced tokenization and processing, quanteda.textmodels for topic modeling

It is an active open-source project with extensive documentation and community contributions, and it is commonly

metadata;
tokens
represent
tokenized
units;
a
DFM
records
feature
frequencies
or
weights
across
documents.
Typical
workflows
include
creating
a
corpus
from
raw
text,
cleaning
and
tokenizing
with
tokens(),
building
a
DFM
with
dfm(),
applying
preprocessing
such
as
lowercasing,
stopword
removal,
stemming
or
lemmatization,
and
generating
n-grams.
The
DFM
can
be
used
for
frequency
analysis,
keyword-in-context,
collocations,
and
a
range
of
statistical
metrics.
and
text
classification,
and
visualization
tools
in
quanteda.plot.
It
supports
parallel
processing,
efficient
interoperability
with
standard
R
data
structures,
and
conversions
to
matrix
formats
compatible
with
other
packages.
The
package
is
widely
used
in
digital
humanities,
political
science,
sociolinguistics,
and
social
science
research.
taught
in
text
mining
and
digital
humanities
courses
as
a
core
toolkit
for
reproducible
research.