Wednesday, August 28, 2013

MapReduce with Hadoop 1.2.1 Part One - Install Hadoop on Mac, Run MapReduce job to count words and produce sorted output

I’ve done a bit of work with Hadoop and MapReduce in the past years, and would like to use this blog post to share my experience. 

MapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers. MapReduce offers an attractive, easy computational model that is data scalable. "Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system."

Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. 

Here is an illustration of how MapReduce works:

A Mapper
    Accepts (key,value) pairs from the input
    Produces intermediate (key,value) pairs, which are then shuffled
A Reducer
    Accepts intermediate (key,value) pairs
    Produces final (key,value) pairs for the output
A driver
    Specifies which inputs to use, where to put the outputs
    Chooses the mapper and the reducer to use

Install Hadoop 1.2.1 on Mac

Step 1: Enable SSH to localhost

Go to System Preferences > Sharing.
Make sure “Remote Login” is checked.

$ ssh-keygen -t rsa
$ cat ~/.ssh/ >> ~/.ssh/authorized_keys

Make sure in the end you can ssh to the localhost 
$ ssh localhost

Step 2: Install Hadoop

Download the latest Hadoop core (Hadoop hadoop-1.2.1.tar.gz), untag it and move to 

$ tar xvfz hadoop-1.2.1.tar.gz 
$ mv hadoop-1.2.1 /usr/local/

Most likely, you need the root privilege to do the mv . 
$ sudo mv hadoop-1.2.1 /usr/local/ 

Step 3: Configure Hadoop

Change directory to the  where Hadoop is installed and configure the Hadoop. 

$ cd /usr/local/hadoop-1.2.1
$ mkdir /usr/local/hadoop-1.2.1/dfs/

Add the following to conf/core-site.xml


Add the following to conf/hdfs-site.xml


        <description>Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently</description>

        <description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description>



Add the following to conf/mapred-site.xml


Add the followings to conf/

export HADOOP_OPTS=""
export JAVA_HOME=/Library/Java/Home

Step 4: Configure environment variables for Hadoop

Edit ~/.bash_profile, to add the followings : 

## set Hadoop environment 
export HADOOP_HOME=/usr/local/hadoop-1.2.1
export HADOOP_CONF_DIR=/usr/local/hadoop-1.2.1/conf

Step 5: Format Hadoop filesystem

Run the hadoop command to format the HDFS system. 

$ hadoop namenode -format

Step 6: Start Hadoop

Run the command to start the Hadoop. 


To verify that all Hadoop processes are running:

$ jps
9460 Jps
9428 TaskTracker
9344 JobTracker
9198 DataNode
9113 NameNode
9282 SecondaryNameNode

To run a sample Hadoop MapReduce job (calculating Pi) that comes from hadoop-examples-1.2.1.jar:
$ hadoop jar /usr/local/hadoop-1.2.1/./hadoop-examples-1.2.1.jar pi 10 100

To take a look at hadoop logs:
$ ls -altr /usr/local/hadoop-1.2.1/logs/

To stop hadoop:

Web UI for Hadoop NameNode: http://localhost:50070/
Web UI for Hadoop JobTracker: http://localhost:50030/

Run MapReduce job to count words and produce sorted output 

Use mvn to start a Hadoop MapReduce Java project:
$ mvn archetype:generate -DgroupId=com.lei.hadoop -DartifactId=MapReduceWithHadoop -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false 

$ cd MapReduceWithHadoop/

Add the following Hadoop dependency to pom.xml: 

To build the target jar: 
$ mvn clean compile package 

To generate eclipse project: 
$ mvn eclipse:eclipse 

To run MapReduce job that counts words in text files at local folder wordcount_input, and generate result in sorted format: 
$ hadoop jar ./target/MapReduceWithHadoop-1.0-SNAPSHOT.jar com.lei.hadoop.countword.CountWordsV2 ./wordcount_input ./wordcount_output 

To see the result from output folder wordcount_output: 
$ tail wordcount_output/part-00000
5 Apache
5 a
6 distributed
8 data
8 to
8 of
10 and
10 Hadoop
11 for
12 A

Mapper for WordCount:
  // Maper to send <word, 1> to OutputCollector
  public void map(LongWritable key, Text value, 
    OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
   String line = (caseSensitive) ? value.toString() : value.toString().toLowerCase();
   for (String pattern : patternsToSkip) {
    line = line.replaceAll(pattern, "");
   StringTokenizer tokenizer = new StringTokenizer(line);
   while (tokenizer.hasMoreTokens()) {
    output.collect(word, one);
    reporter.incrCounter(Counters.INPUT_WORDS, 1);
   if ((++numRecords % 100) == 0) {
    reporter.setStatus("Finished processing " + numRecords + " records " + "from the input file: " + inputFile);

Reducer for WordCount:
 public static class CountWordsV2Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
  // Reducer to sum up word count, and send them to  OutputCollector
  public void reduce(Text key, Iterator<IntWritable> values, 
    OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
   int sum = 0;
   while (values.hasNext()) {
    sum +=;
   output.collect(key, new IntWritable(sum));

Mapper for WordSort:

  * @author stones333
 public static class SortWordsMap extends MapReduceBase 
  implements Mapper<LongWritable, Text, IntWritable, Text> {
  public void map(LongWritable key, Text value, 
    OutputCollector<IntWritable, Text> collector, 
    Reporter reporter) throws IOException 
   String line = value.toString();
   StringTokenizer stringTokenizer = new StringTokenizer(line);

   int number = 0; 
   String word = "";

    String strWord= stringTokenizer.nextToken();
    word = strWord.trim();

   if( stringTokenizer.hasMoreElements())
    String strNum = stringTokenizer.nextToken();
    number = Integer.parseInt(strNum.trim());

   collector.collect(new IntWritable(number), new Text(word));


Reducer for WordSort:
  * @author stones333
 public static class SortWordsReduce extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text> {
  public void reduce(IntWritable key, Iterator<Text> values, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException


You can access the code from

 $ git clone

Have fun and enjoy the journey.


  1. tutorials on Upgrade Hadoop is excellent.I am happy to found such helpful and fascinating post that is written in well manner. i actually enhanced my data when browse your post .thanks.
    Hadoop Training in hyderabad

    1. Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging. If anyone wants to become a Java developer learn from Java Training in Chennai. or learn thru Java EE Online Training from India . Nowadays Java has tons of job opportunities on various vertical industry.

  2. Nice piece of article you have shared here, my dream of becoming a hadoop professional become true with the help of Hadoop Course in Chennai, keep up your good work of sharing quality articles.

    big data training in velachery|hadoop training chennai velachery|hadoop training institute in t nagar

  3. Excellent post on iOS mobile apps development!!! The future of mobile application development is on positive note. You can make most it by having in-depth knowledge on mobile application development platform and other stunning features. iOS Training in Chennai | iOS Training Institutes in Chennai

  4. Truely a very good article on how to handle the future technology. This content creates a new hope and inspiration within me. Thanks for sharing article like this. The way you have stated everything above is quite awesome. Keep blogging like this. Thanks :)

    Software testing training in chennai | Software testing course chennai | Automation testing courses in chennai

  5. Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
    Oracle Training In Chennai

  6. Best SQL Query Tuning Training Center In ChennaiIt’s too informative blog and I am getting conglomerations of info’s about Oracle interview questions and answer .Thanks for sharing, I would like to see your updates regularly so keep blogging.

  7. Thank you for this wonderful information. It was really dba training In Chennai

  8. Thank you so much! That did the trick, you saved me more endless hours of searching for a fix.

    Hadoop Certification in Chennai

  9. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai