Introduction to MapReduce with Hadoop

Adam Monsen

February 20, 2014

Welcome!

You'll laugh and you'll cry during this awe-inspiring introduction to big data processing with MapReduce and Hadoop. I only tested my code on Ubuntu GNU/Linux and only provide specific instructions for same, but all code is cross-platform.

How do you scale when you must?

Bigger hardware

In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers. --Grace Hopper

Grace Hopper wrote the first compiler, started the use of the term "debugging" for fixing software mistakes, and led the charge in architecture-independent programming languages.

Try "smarter"

Scale out

Hadoop and MapReduce

Diagram: MapReduce overview

MapReduce diagram 

The above diagram is a work of Wikipedia user Poposhka, used here in accordance with the Creative Commons Attribution-ShareAlike 3.0 License.

Example: Log crunching

  1. shell version, small input file
    • map only, then add reduce
  2. streaming: local (single-node) hadoop
  3. Dumbo local
  4. Dumbo hadoop

These demos are in example/log-crunch. Before you start, download and install Java Hadoop. I used OpenJDK 7 (installed via apt-get) and Hadoop 1.2.1 (downloaded tarball and untarred) when testing these examples. I configured my environment like so:

export HADOOP_PREFIX=/heap/tmp/hadoop-1.2.1
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Install dumbo if you want to run the dumbo examples. I used easy_install, following the directions on their wiki.

Here is how to run examples for each of the above bullets:

  1. ./map.py < log.txt, then make shell
  2. make hadoop
  3. make dumbo-local
  4. make dumbo-hadoop

reduce.py assumes valid input (key=hostname, tab, value=integer), grouped by key (hostname). It must manually total up values for each consecutive group of keys (hostnames).

The dumbo example is simpler because our reducer function is guaranteed to only get one key and a generator of associated values.

map.py and reduce.py borrowed heavily from Michael Knoll's post on Hadoop MapReduce in Python.

Log crunching, more data


In a 10-meter race between a rocket and a scooter, the scooter finishes before the rocket's engines start.

I like the scooter/rocket analogy for another reason. Consider that the scooter will never overcome wind resistance. The rocket will eventually leave the atmosphere, where it can accelerate indefinitely.

Generated large data files like so:

for i in {1..100}; do cat log.txt >> /tmp/data23k; done
for i in {1..100}; do cat /tmp/data23k >> /tmp/data2.3M; done
for i in {1..30}; do cat /tmp/data2.3M >> /tmp/data67M; done
for i in {1..10}; do cat /tmp/data67M >> /tmp/data667M; done
for i in {1..20}; do cat /tmp/data67M >> /tmp/data1.4G; done

Example: Create a book index

The End