MLlib
MLlib is the machine learning library of Apache Spark, designed to run scalable machine learning tasks on large datasets by leveraging distributed computation across a cluster. It provides a broad set of algorithms and utilities for building, training, and evaluating models within the Spark ecosystem.
The library offers two APIs, with the DataFrame-based spark.ml package recommended for new projects. This API
MLlib covers supervised and unsupervised learning, including linear models (such as logistic and linear regression), tree-based
Modeling workflows in MLlib benefit from Spark’s distributed processing, enabling training and evaluation on large-scale data.