Home

Documentterm

Document-term is a concept used in information retrieval and text mining to describe the relationship between a document and the terms it contains. It is central to the construction of representations that map text data to mathematical structures, such as vectors or matrices. Depending on context, document-term may refer to a pair (document, term) or to the term's occurrence within a document.

The document-term matrix (DTM) is a common representation. It is a two-dimensional sparse matrix with one row

Term weighting helps distinguish informative terms from common words. TF counts reflect how often a term appears;

Construction typically proceeds by collecting a corpus, tokenizing text, normalizing case, removing stopwords, and optionally applying

Applications include document classification, clustering, information retrieval, and topic modeling. Limitations include high dimensionality, sparsity, and

per
document
and
one
column
per
term
in
the
vocabulary.
Each
entry
records
a
weight
for
the
term
in
the
document,
most
commonly
a
raw
count
or
a
weighted
value
such
as
TF,
IDF,
or
TF-IDF.
IDF
downscales
terms
that
appear
across
many
documents;
TF-IDF
combines
both
effects.
Other
schemes
include
binary
indicators,
log
scaling,
and
sublinear
TF.
stemming
or
lemmatization.
The
resulting
vocabulary
defines
the
document-term
space,
and
the
chosen
weighting
scheme
determines
the
numeric
matrix
used
for
analysis.
potential
loss
of
semantic
information.
Dimensionality
reduction,
regularization,
and
more
advanced
representations
such
as
word
embeddings
can
mitigate
these
issues.