Home

kMeans

K-means is a widely used unsupervised learning algorithm that partitions a dataset into k non-overlapping clusters. It seeks to minimize the within-cluster sum of squares, producing compact, roughly spherical groups when the data are suitable. The method operates on numeric feature vectors and is sensitive to feature scaling; standardization is typically recommended.

Algorithm steps begin with choosing k initial centroids, often at random or with smarter initialization such

as
k-means++.
Each
point
is
assigned
to
the
nearest
centroid
using
a
distance
measure
(usually
Euclidean).
Centroids
are
then
updated
as
the
mean
of
the
points
assigned
to
their
clusters.
The
process
repeats
until
convergence,
typically
when
assignments
or
centroids
stabilize
or
a
maximum
number
of
iterations
is
reached.
In
practice,
complexity
is
O(n
k
d)
per
iteration
for
n
points
in
d
dimensions,
with
t
iterations
giving
O(n
k
d
t).
K-means
is
efficient
for
large
data
sets
but
has
limitations:
it
requires
specifying
k
in
advance,
is
sensitive
to
initialization
and
scaling,
assumes
roughly
convex
and
equally
sized
clusters,
and
is
sensitive
to
outliers.
Variants
include
k-means++
initialization,
mini-batch
k-means
for
large
data,
and
alternatives
such
as
k-medoids
or
kernel
k-means.
Evaluation
commonly
uses
inertia
(within-cluster
sum
of
squares)
and
metrics
like
silhouette
or
Davies-Bouldin
index.