Distributed computing with Hadoop Map/Reduce is a great fit for batch processing Call Detail Records (CDRs) dumped from a mobile network, but it can take some time to wrap your head around this way of breaking down problems. So here’s a simple, concrete example:
Let’s calculate the total credit spent by hour over the history of our mobile operator.
1. Splits
The first step is to split the input files into discrete chunks that can be processed in parallel. This step is handled by the framework, and can configured to use a wide variety of input formats– compressed or uncompressed CSV, fixed width, or other, though generally line-based.
Each InputSplit is assigned to a separate computer or “node”.
2. Map
During the second step, the InputSplit is processed by a “Mapper” that translates input records into key/value pairs. Mappers can be written in Java using the Hadoop API, or in any other language and streamed through stdin/stdout.
In our example of calculating revenue, our mapper would take the CDRs as input, line by line, parse the comma separated values, and extract the date and the credit spent.
So given the input split:
6307970921, 2/7/2010 14:35, $0.34,… 6307970921, 2/7/2010 14:36, $9.12,… 6304319961, 2/7/2010 15:01, $14.33
our Mapper would emit the following Key/Value pairs:
2/7/2010 14h -> $0.34 2/7/2010 14h -> $9.12 2/7/2010 15h -> $14.33
3. Shuffle
At this point, the framework sorts all the key/value pairs emitted by the Mappers by key, and then combines all the values into a list. Note while our example uses simple types, both key and values can be of any type — numbers, strings, or an arbitrary collection of bytes.
4. Reduce
Finally the merged key/value lists are fed into our Reducer program. In this example, we receive a key that corresponds to the hour, and a list of all costs incurred during that our.
Our reducer program simply sums all the values and outputs the summary row, whether to another flat file or to a database.
Higher Levels of Abstraction
Map/Reduce abstracts away a lot of the mechanics of coordinating jobs executing in parallel, recovering from partial failure, and intermediate sorting.
It is however, still a pretty low level of programming. There are a number of good solutions emerging that allow analysts to write queries in high level languages. Hive, for example, translates an SQL-like language into a series of Map/Reduce jobs.




your application of MapReduce is great. Besides the CDRs, what other kinds of data can the approach be applied to?
Thanks.