Introduction to MapReduce with Hadoop

Adam Monsen

February 20, 2014

Welcome!

thank you Ada Developers Academy
see my article in the April 2013 issue of Linux Journal

You'll laugh and you'll cry during this awe-inspiring introduction to big data processing with MapReduce and Hadoop. I only tested my code on Ubuntu GNU/Linux and only provide specific instructions for same, but all code is cross-platform.

How do you scale when you must?

Bigger hardware

In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers. --Grace Hopper

Grace Hopper wrote the first compiler, started the use of the term "debugging" for fixing software mistakes, and led the charge in architecture-independent programming languages.

Try "smarter"

say we're at the limits of our hardware
how do we "throw more hardware" at our problem?

Scale out

scaling out is hard
- partitioning
- communication
- synchronization
- hardware failure
- different paradigms ("process" vs. "job")
- hardest part is making the above easy

Hadoop and MapReduce

MapReduce provides a pattern to scale arbitrarily, linearly
Hadoop provides an implementation of MapReduce

Diagram: MapReduce overview

MapReduce diagram

The above diagram is a work of Wikipedia user Poposhka, used here in accordance with the Creative Commons Attribution-ShareAlike 3.0 License.

Example: Log crunching

shell version, small input file
- map only, then add reduce
streaming: local (single-node) hadoop
Dumbo local
Dumbo hadoop

These demos are in example/log-crunch. Before you start, download and install Java Hadoop. I used OpenJDK 7 (installed via apt-get) and Hadoop 1.2.1 (downloaded tarball and untarred) when testing these examples. I configured my environment like so:

export HADOOP_PREFIX=/heap/tmp/hadoop-1.2.1
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Install dumbo if you want to run the dumbo examples. I used easy_install, following the directions on their wiki.

Here is how to run examples for each of the above bullets:

./map.py < log.txt, then make shell
make hadoop
make dumbo-local
make dumbo-hadoop

reduce.py assumes valid input (key=hostname, tab, value=integer), grouped by key (hostname). It must manually total up values for each consecutive group of keys (hostnames).

The dumbo example is simpler because our reducer function is guaranteed to only get one key and a generator of associated values.

map.py and reduce.py borrowed heavily from Michael Knoll's post on Hadoop MapReduce in Python.

Log crunching, more data

shell version with a 1.4G input file took 44 seconds
- 8-way Intel i7, 8GB RAM, SSD
streaming with 667M input file on Elastic MapReduce took 7 minutes
- 1x m1.small master, 2x m1.small core
much more data needed to overcome cluster overhead

In a 10-meter race between a rocket and a scooter, the scooter finishes before the rocket's engines start.

I like the scooter/rocket analogy for another reason. Consider that the scooter will never overcome wind resistance. The rocket will eventually leave the atmosphere, where it can accelerate indefinitely.

Generated large data files like so:

for i in {1..100}; do cat log.txt >> /tmp/data23k; done
for i in {1..100}; do cat /tmp/data23k >> /tmp/data2.3M; done
for i in {1..30}; do cat /tmp/data2.3M >> /tmp/data67M; done
for i in {1..10}; do cat /tmp/data67M >> /tmp/data667M; done
for i in {1..20}; do cat /tmp/data67M >> /tmp/data1.4G; done

Example: Create a book index

book text
partially reduced key,value pairs
final output
walk through dumbo mapper() and reducer()

February 20, 2014