Home

coresets

A coreset is a small, weighted subset of a dataset that serves as a proxy for the entire set with respect to a chosen objective. Formally, for a dataset X and a family of queries Q with cost functions f_q, a coreset S with weights w satisfies f_q(X) ≈ f_q(S) for all q in Q within a prescribed error ε. In clustering, a common focus is the k-means or k-median objective: a coreset preserves the value of the objective for any choice of k centers within a factor of (1±ε).

Coreset constructions are designed to produce small summaries that still yield provable guarantees. Methods include sampling

Applications of coresets span large-scale data analysis and machine learning. They are used to accelerate approximate

Example: for k-means, a coreset is a weighted subset such that the sum of squared distances to

points
with
probabilities
related
to
their
“sensitivity”
to
the
objective,
geometric
or
grid-based
approaches,
and
streaming
or
distributed
techniques
that
allow
mergeable
or
composable
coresets.
Coresets
can
provide
strong
guarantees,
preserving
the
objective
for
all
center
sets,
or
weak
guarantees,
preserving
only
the
optimum
value
for
the
best
centers.
The
size
of
a
coreset
typically
increases
with
the
number
of
centers
k,
decreases
with
smaller
ε,
and
may
depend
on
the
data’s
dimension
or
intrinsic
complexity.
clustering,
regression,
and
subspace
approximation
on
big
data,
enabling
faster
experimentation
and
modeling.
In
streaming
and
distributed
computing,
coresets
facilitate
data
reduction
with
mergeable
properties,
allowing
efficient
aggregation
across
partitions
or
time.
By
reducing
data
while
preserving
essential
structure,
coresets
support
scalable
analytics
without
substantial
loss
of
accuracy.
any
set
of
k
centers
is
preserved
within
a
factor
of
(1±ε),
enabling
near-equivalent
clustering
results
on
the
smaller
summary.