Introduction

Hadoop ecosystem is a framework for Big Data. It was created by Yahoo in 2005. Major benefit of Hadoop ecosystem is that they are open source.

Major goals of Hadoop ecosystem

  1. Enable Scalability
  2. Handle Fault Tolerance
  3. Optimized for a Variety Data Types
  4. Facilitate a Shared Environment
  5. Provide Value

 

hadoop_ecosystem.JPGFigure 1. Hadoop ecosystem (Source: https://www.coursera.org/learn/big-data-introduction/lecture/BpHNu/the-hadoop-ecosystem-welcome-to-the-zoo)

Three main components of Hadoop:

HDFS (Hadoop Distributed File System)

  • Foundation of many Big data frameworks
  • Allows scalable and reliable storage

YARN

  • Flexible scheduling and resource management

MapReduce

  • Programming model that simplifies parallel computing
  • Instead of dealing with complexity of synchronization and scheduling
  • You only need to give two functions Map -> apply() and Reduce -> summarize()
  • Used in indexing websites
  • MapReduce only assumes a limited model to express data

 

MapReduce

Here, we will focus just on MapReduce and its functionalities. MapReduce is a programming framework for processing distributed data, we are distributing the code across the network to work on the data locally. MapReduce in more simple terms, can be compared the act of ‘delegating’ a large task to a group of people, and then combining the result of each person’s effort, to produce the final outcome. For example, let’s say you are preparing salad for a party. You have tomato, lettuce, and onions that you have to chop, and four friends to help you chop them.

  1. Map stage: Each friend is a process that use their “compute” powers to chop them and measure the weight of each type of vegetable.
  2. Shuffle stage: Each friend puts the chopped vegetables in the right places
  3. Reduce stage: Collect the items and put them in a large bowl and label this large bowl with its weight

MapReduce relies on YARN to schedule and execute parallel processing over the distributed file blocks in HDFS. Traditionally, parallel computing required expertise on a number of computing and system concepts, such as synchronization mechanisms like locks, and monitors. Incorrectly using them may lead to a crash or performance issues.

MapReduce simplifies this process of running code in parallel. Instead of worrying about multiple threads, synchronization, or concurrency issues, you only need to create, map, and reduce tasks.

Map and Reduce are two concepts based on functional programming
Map
– The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node process the smaller problem, and passes the answer back to its master node.
Shuffle – Process by which the system performs the sort and transfers the map outputs to the reducer.
Reduce – The master node them collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
(Source: http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=7,650,331.PN.&OS=PN/7,650,331&RS=PN/7,650,331)

MapReduce for WordCount

  1. Run a Map on each node
    1. Map generates key-value pairs
      1. my, my -> (my, 1), (my, 1)
      2. apple -> (apple, 1)
      3. is, is -> (is, 1)
  2. Sort and Shuffle
    1. Pairs with same keys moved to the same node
      1. Node 1: (word1, 1), (word1, 1)
      2. Node 2: (word2, 1)
      3. Node 3: (word3, 1), (word3, 1) …
      4. Node N: (wordN, 1), (wordN, 1), (wordN, 1), (wordN, 1)
    2. Nodes can have different words
    3. Number of nodes can be extended as the application demands
  3. Reduce
    1. Add values of the keys
    2. (You, 1) -> (You,1)
    3. (Apple, 1), (Apple,1) -> (Apple,2)

Search Engine Application

WordCount -> Instead of keys pointing to numbers keys refer to a URL

@Sort and Shuffle stage

  1. Node 1: (word1, http://you1.fake), (word2, http://apple1.fake), (word2, http://apple2.fake)
  2. Node 2: (word3, http://apple2.fake), (word3, http://apple3.fake)
  3. Node 3: (word4, http://apple2.fake), (word5, http://apple2.fake)

@Reduce stage

  1. Reduce results for word2
    1. (word2 -> http://apple1.fake, http://apple2.fake)

This is how Search engine like Google works. You search for the word “word2”, then you look at the URL that is associated with “word2”

Conclusion

There is parallelization in each step. There is parallelization over the input in the Map step. Parallelization is over the input. We must decide on data granularity of each parallel computation. In WordCount, it’s a line. Parallelization over intermediate data in the Shuffle & Sort step. And Parallelization over data groups. As shown, MapReduce hides the complexity of parallel programming and greatly simplifies building parallel applications.

MapReduce is bad for:

  • Frequently changing data. MapReduce is slow since it reads the entire data set each time
  • Dependent tasks. Computation with dependencies cannot be expressed as MapReduce
  • Interactive Analysis. MapReduce does not return any result until the whole process is finished.

MapReduce for Word Count in Java Code

Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}