Home

Clustering

Clustering is a task in unsupervised learning and statistics that groups objects so that those within a cluster are more similar to each other than to objects in other clusters. The goal is to discover structure, patterns, or subgroups without predefined labels.

Clustering relies on similarity or distance and supports numerical, categorical, or mixed data. Algorithms fall into

Partitioning methods assign objects to a predefined number of clusters by optimizing cohesion. Hierarchical clustering builds

Model-based clustering assumes data arise from a mixture of distributions; parameters are inferred by methods such

Distance measures and scaling influence results. Common metrics include Euclidean, Manhattan, cosine, and Jaccard for binary

Evaluation uses internal indices, such as the silhouette coefficient and Davies-Bouldin index, or external indices like

Applications include market segmentation, image and text clustering, bioinformatics, anomaly detection, and social network analysis.

Limitations include choosing the number of clusters, sensitivity to outliers and scaling, difficulties in high-dimensional data,

partitioning,
hierarchical,
density-based,
grid-based,
and
model-based
families,
with
examples
such
as
k-means,
hierarchical
clustering,
DBSCAN,
and
Gaussian
mixture
models.
a
cluster
tree
through
merging
or
splitting
using
various
linkage
criteria,
while
density-based
methods
identify
dense
regions
separated
by
sparse
areas.
as
Expectation-Maximization.
Grid-based
methods
quantize
the
feature
space
into
cells
and
cluster
adjacent
dense
cells.
or
categorical
data.
the
Rand
index
and
Mutual
Information
when
a
ground
truth
is
available.
and
challenges
in
interpreting
and
validating
clusters.
The
field
dates
to
mid-20th
century
developments,
with
k-means
popularized
in
the
1950s–1980s.