Home

Dask

Dask is an open-source Python library designed to scale analytics from a single computer to a cluster. It provides parallel computing capabilities and a set of high-level collections and APIs that resemble familiar NumPy and pandas interfaces, enabling large or out‑of‑core computations without changing much of the existing code.

The core components of Dask include Dask arrays, Dask dataframes, and Dask bags, which mirror NumPy arrays,

Dask relies on a distributed scheduler that coordinates multiple workers, a central scheduler, and a dashboard

Dask integrates with the broader Python data ecosystem, working seamlessly with NumPy, pandas, and scikit-learn. It

Originating as an open-source project led by Matthew Rocklin, Dask is maintained by a community of contributors

pandas
DataFrames,
and
Python
objects,
respectively.
There
is
also
a
lower-level
delayed
API
and
a
futures
interface
for
constructing
and
executing
arbitrary
task
graphs.
Computations
in
Dask
are
executed
lazily,
building
a
task
graph
that
is
processed
when
a
compute()
call
is
made
or
when
results
are
persisted
in
memory.
for
monitoring.
It
can
run
on
a
single
machine
or
across
a
cluster,
and
supports
deployment
via
local
processes,
SSH,
Kubernetes,
YARN,
or
cloud
environments.
This
flexibility
allows
scaling
from
multi-core
laptops
to
large-scale
clusters.
supports
reading
and
writing
various
data
formats,
including
Parquet,
CSV,
and
others,
and
provides
specialized
tools
through
Dask-ML
for
scalable
machine
learning
workflows.
The
project
emphasizes
out-of-core
and
parallelized
data
processing,
enabling
analyses
that
do
not
fit
into
memory
or
would
benefit
from
distributed
computation.
and
is
widely
used
in
data
science,
engineering,
and
scientific
computing
for
scalable
analytics
and
experimentation.