In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers. --Grace Hopper
These demos are in example/log-crunch
. Before you start, download and install Java Hadoop. I used OpenJDK 7 (installed via apt-get) and Hadoop 1.2.1 (downloaded tarball and untarred) when testing these examples. I configured my environment like so:
export HADOOP_PREFIX=/heap/tmp/hadoop-1.2.1
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Install dumbo if you want to run the dumbo examples. I used easy_install
, following the directions on their wiki.
Here is how to run examples for each of the above bullets:
./map.py < log.txt
, then make shell
make hadoop
make dumbo-local
make dumbo-hadoop
reduce.py
assumes valid input (key=hostname, tab, value=integer), grouped by key (hostname). It must manually total up values for each consecutive group of keys (hostnames).
The dumbo example is simpler because our reducer function is guaranteed to only get one key and a generator of associated values.
map.py
and reduce.py
borrowed heavily from Michael Knoll's post on Hadoop MapReduce in Python.
In a 10-meter race between a rocket and a scooter, the scooter finishes before the rocket's engines start.
I like the scooter/rocket analogy for another reason. Consider that the scooter will never overcome wind resistance. The rocket will eventually leave the atmosphere, where it can accelerate indefinitely.
Generated large data files like so:
for i in {1..100}; do cat log.txt >> /tmp/data23k; done for i in {1..100}; do cat /tmp/data23k >> /tmp/data2.3M; done for i in {1..30}; do cat /tmp/data2.3M >> /tmp/data67M; done for i in {1..10}; do cat /tmp/data67M >> /tmp/data667M; done for i in {1..20}; do cat /tmp/data67M >> /tmp/data1.4G; done
mapper()
and reducer()