MapReduce

What is MapReduce
MapReduce is a method for distributing a task across multiple nodes
Each node processes data stored on that node
--Where possible
Consists of two phases:
--Map
--Reduce

Features of MapReduce

Automatic parallelization and distribution
Fault-tolerance
Status and monitoring tools
A clean abstraction for programmers
--MapReduce programs are usually written in Java
MapReduce abstracts all the ‘housekeeping’ away from the developer 
--Developer can concentrate simply on writing the Map and Reduce functions

MapReduce: The Mapper
Hadoop attempts to ensure that Mappers run on nodes which hold their portion of the data locally, to avoid network traffic
--Multiple Mappers run in parallel, each processing a portion of the input data
The Mapper reads data in the form of key/value pairs
It outputs zero or more key/value pairs:

map(in_key, in_value)->(inter_key, inter_value) list

The Mapper may use or completely ignore the input key
--For example, a standard pattern is to read a line of a file at a time
--The key is the byte offset into the file at which the line starts
--The value is the contents of the line itself
--Typically the key is considered irrelevant
If it writes anything at all out, the output must be in the form of key/value pairs

MapReduce Example: Word Count
Count the number of occurrences of each word in a large amount of input data:
Map(input_key, input_value) 
  foreach word w in input_value:
    emit(w, 1)
Input to the mapper
(3414, 'the cat sat on the mat') 
(3437, 'the aardvark sat on the sofa')
output from mapper
('the', 1), ('cat', 1), ('sat', 1), ('on', 1),
('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1),
('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)

MapReduce:The Reducer
After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list
This list is given to a Reducer
--There may be a single Reducer, or multiple Reducers
--All values associated with a particular intermediate key are guaranteed to go to the same Reducer
--The intermediate keys, and their value lists, are passed to the Reducer in sorted key order
--This step is known as the ‘shuffle and sort’
The Reducer outputs zero or more final key/value pairs
--These are written to HDFS
--In practice, the Reducer usually emits a single key/value pair for each input key

Example Reducer: Sum Reducer
Add up all the values associated with each intermediate key:
reduce(output_key,intermediate_vals)
foreach v in intermediate_vals:
  count += v
emit(output_key, count)
Reducer output
('aardvark',1)
('mat',1)
('on',2)
('sat',2)
('sofa',1)
('the',4)



Big Data Analytics with MapReduce 

Map-Reduce Example






Parallel using multiple machines

Applications
Machine Learning
--Apache Mahout
--Scalable machine learning library on top of hadoop.
Scientific calculations
Apache Hama
--Bulk Synchronous Parallel framework on top of HDFS for massive -----scientific calculations such as matrix, graph and network   algorithms.
Graph processing
--Apache Giraph
--an iterative graph processing system built for high scalability.
Image Processing 
HIPI – Hadoop Image Processing Interface.
Bioinformatics
--BLAST, SOM

Hadoop Terminology
Job
a full program – an execution of a Mapper and Reducer across data set
Task
an execution of a mapper or reducer on a slice of data
Task Attempt
a particular instance of an attempt to execute a task on a machine

How Hadoop runs a Map-Reduce job?

MR Flow: Key Value pairs


Hadoop MR data flow

Failure
Task Failure
Task tracker Failure
Job tracker Failure

Features
Combiner
Speculative Execution
Job Scheduling
--Fair Scheduler
--Capacity Scheduler
Counter
Distributed Cache

Key and Value Types
Utilizes Hadoop’s serialization mechanism for writing data in and out of network database or files
--Optimized for network serialization
--A set of basic types is provided
--Easy to implement your own
Extends Writable interface
--Framework’s serialization mechanisms
--Defines how to read and write fields
--org.apache.hadoop.io package
Keys must implement WritableComparable interface
--Extends Writable and java.lang.Comparable<T>
--Required because keys are sorted prior reduce phase
Hadoop is shipped with many default implementations of WritableComparable<T>
--Wrappers for primitives (String, Integer, etc...)
--Or you can implement your own

WritableComparable<T> Implementations

Implement Custom WritableComparable<T>
Implement 3 methods
write(DataOutput)
--Serialize your attributes
readFields(DataInput)
--De-Serialize your attributes
compareTo(T)
--Identify how to order your objects
If your custom object is used as the key it will be sorted prior to reduce phase

Framework’s Usage of InputFormat Implementation

InputFormat


OutputFormat


MR Design patterns
A template for solving a common and general
data manipulation problem with MapReduce

Summarization: get a top-level view by summarizing and grouping data
Filtering: view data subsets such as records generated from one user
Data Organization: reorganize data to work with other systems, or to make MapReduce analysis easier
Join : analyze different datasets together to discover interesting relationships
Metapattern : piece together several patterns to solve multi-stage problems, or to perform several analytics in the same job
Input and Output: customize the way you use Hadoop to load or store data

Related Posts