What is MapReduce

MapReduce is the heart of Hadoop. It is this programming model that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster.

A MapReduce program is composed of a Map() procedure(method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies)

The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. The Reduce job is always performed after the map job.

Example

Let’s look at a simple example. Assume you have five files, and each file contains two columns (a key and a value in Hadoop terms) that represent a city and the corresponding temperature recorded in that city for the various measurement days.  In this example, city is the key and temperature is the value.

Mumbai, 20
Chennai, 25
New Delhi, 22
Hyderabad, 32
Mumbai, 4
Hyderabad, 33
New Delhi, 18

We want to find the maximum temperature for each city across all of the data files

Using the MapReduce framework, we can break this down into five map tasks, where each mapper works on one of the five files and the mapper task goes through the data and returns the maximum temperature for each city. For example, the results produced from one mapper task for the data above would look like this:

(Mumbai, 20) (Chennai, 25) (New York, 22) (Hyderabad, 33)

Let’s assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results:

(Mumbai, 18) (Chennai, 27) (New Delhi, 32) (Hyderabad, 37)
(Mumbai, 32) (Chennai, 20) (New Delhi, 33) (Hyderabad, 38)
(Mumbai, 22) (Chennai, 19) (New Delhi, 20) (Hyderabad, 31)
(Mumbai, 31) (Chennai, 22) (New Delhi, 19) (Hyderabad, 30)

All five of these output streams would be fed into the reduce tasks, which combine the input results and output a single value for each city, producing a final result set as follows:

(Mumbai, 32) (Chennai, 27) (New Delhi, 33) (Hyderabad, 38)


References:
1) Wikipedia
2) IBM  Website

Gopikrishna

    Blogger Comment
    Facebook Comment

0 comments:

Post a Comment