Home

lowcardinality

Low cardinality is a term used in statistics and data management to describe a categorical variable that has a relatively small number of distinct values compared with the size of the dataset. Cardinality is the count of unique values in a column. A low-cardinality feature may have a handful of categories such as gender, payment method, or regional codes, whereas high cardinality examples include user identifiers or precise timestamps.

In machine learning and data analysis, low-cardinality features are generally easier to encode and train with.

From a database perspective, low cardinality columns tend to be less selective, which can reduce index effectiveness

Practical handling often involves grouping rare categories into an “Other” bucket, creating derived features from the

See also: high cardinality, categorical encoding, feature engineering.

One-hot
encoding
is
commonly
used
for
such
features,
producing
a
matrix
with
one
column
per
category.
This
approach
works
well
when
there
are
only
a
few
categories,
but
it
can
become
unwieldy
as
cardinality
grows.
Alternatives
for
higher-cardinality
data
include
label
encoding,
target
encoding,
or
hashing
tricks,
which
help
manage
dimensionality
and
sparsity.
for
some
queries
but
may
enable
efficient
compression
and
specialized
indexing
approaches
such
as
bitmap
indexes
or
histograms
used
by
query
optimizers.
category,
or
ensuring
consistent
encoding
across
training
and
deployment
to
avoid
data
leakage.
The
characterization
of
“low”
cardinality
is
relative
and
depends
on
the
data
size
and
the
modeling
or
querying
task.