Home

Hadoop

Hadoop is an open-source framework for distributed storage and processing of large data sets across clusters of commodity hardware. It is designed to scale from single servers to thousands of machines, handling petabytes of data. The core components are the Hadoop Distributed File System (HDFS) for storage, and the MapReduce processing model, together with Yet Another Resource Negotiator (YARN) for resource management. While originally built around MapReduce, Hadoop now supports a variety of data processing engines that run on the same ecosystem.

History: Hadoop originated from research at Yahoo and the Google papers on distributed storage and processing.

Architecture: HDFS stores data on a cluster of data nodes, managed by a centralized NameNode. Data is

Usage and scope: Hadoop is well-suited for batch processing of very large data sets using inexpensive hardware.

It
was
created
by
Doug
Cutting
and
Mike
Cafarella
and
named
after
Cutting's
son’s
toy
elephant.
The
project
was
open-sourced
and
joined
the
Apache
Software
Foundation
in
2006.
Over
time,
Hadoop
evolved
from
a
two-component
system
to
a
broader
ecosystem,
with
Hadoop
2.x
introducing
YARN
to
separate
resource
management
from
job
execution.
split
into
blocks
and
replicated
to
provide
fault
tolerance.
MapReduce
provides
a
model
for
parallel
data
processing
by
distributing
work
across
nodes;
YARN
handles
cluster
resource
scheduling
and
monitoring.
The
ecosystem
includes
tools
such
as
Hive
for
SQL-like
queries,
Pig
for
dataflow
scripting,
and
HBase
for
columnar
storage,
enabling
a
range
of
analytics
tasks.
It
emphasizes
data
locality,
where
computation
is
moved
to
where
data
resides.
Modern
deployments
often
integrate
Hadoop
with
other
engines
such
as
Apache
Spark,
enabling
faster
in-memory
processing,
while
some
organizations
migrate
to
cloud-based
storage
and
processing
services.
Hadoop
remains
a
foundational
layer
in
many
big
data
architectures.