Home

MLlib

MLlib is the machine learning library of Apache Spark, designed to run scalable machine learning tasks on large datasets by leveraging distributed computation across a cluster. It provides a broad set of algorithms and utilities for building, training, and evaluating models within the Spark ecosystem.

The library offers two APIs, with the DataFrame-based spark.ml package recommended for new projects. This API

MLlib covers supervised and unsupervised learning, including linear models (such as logistic and linear regression), tree-based

Modeling workflows in MLlib benefit from Spark’s distributed processing, enabling training and evaluation on large-scale data.

emphasizes
a
pipeline-centric
approach,
using
Estimators
and
Transformers
to
build
end-to-end
machine
learning
workflows.
An
older
RDD-based
API,
spark.mllib,
remains
for
backward
compatibility
but
is
deprecated
in
favor
of
spark.ml.
MLlib
supports
Java,
Scala,
Python,
and
R
through
Spark’s
language
bindings,
enabling
a
wide
range
of
developers
to
use
familiar
tools.
methods
(decision
trees,
random
forests,
gradient-boosted
trees),
support
vector
machines,
clustering
(k-means),
and
probabilistic
models
(Gaussian
mixtures).
It
also
provides
collaborative
filtering
via
alternating
least
squares
and
topic
modeling
with
latent
Dirichlet
allocation.
In
addition,
the
library
includes
feature
extraction
and
transformation
utilities
for
preparing
data,
such
as
tokenization,
indexing,
one-hot
encoding,
vector
assembly,
scaling,
and
dimensionality
reduction.
The
library
supports
model
selection
and
evaluation
through
cross-validation
and
train-validation
splits,
with
a
range
of
metrics
(accuracy,
RMSE,
MAE,
AUC,
R-squared).
Models
can
be
saved
and
reused,
and
pipelines
facilitate
reproducible,
scalable
machine
learning
workflows
within
Spark-based
data
processing
environments.