Hadoop consists of two core components--The Hadoop Distributed File System (HDFS)
--MapReduce Software Framework
There are many other projects based around core Hadoop
--Often referred to as the ‘Hadoop Ecosystem’
--Pig, Hive, HBase, Flume, Oozie, Sqoop, etc
--Many are discussed later in the course
A set of machines running HDFS and MapReduce is known as a Hadoop Cluster
--Individual machines are known as nodes
--A cluster can have as few as one node, as many as several thousands
--More nodes = better performance!
Hadoop Components: HDFS
HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster.
Data files are split into blocks and distributed across multiple nodes in the cluster.
Each block is replicated multiple times
--Default is to replicate each block three times
--Replicas are stored on different nodes
--This ensures both reliability and availability
Hadoop Components: MapReduce
MapReduce is the system used to process data in the Hadoop cluster.
Consists of two phases: Map, and then Reduce.
Each Map task operates on a discrete portion of the overall dataset
--Typically one HDFS data block
After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase.