-
June 30th, 2008 07:10 PM #1
MapReduce in Symphony
MapReduce is a programming model that has been around for a while
(especially in functional style languages), but that has been more
recently popularized by Google (Google Research Publication: MapReduce)
and the Hadoop project for processing large data sets in the Grid.
Given the popularity and the ease of use of the MapReduce programming
model, we decided to implement a C++ MapReduce API layer on top of
Platform Symphony's current C++ API, in order to provide a more
familiar style of programming model for those users who are already
familiar with the concepts of MapReduce. With this new API, developers
do not need to write their own data serialization classes with the
current SOAM API. They can abstract their inputs as key/value
pairs, and get their outputs back in that format as well.
For example, in the Symphony API, when you want to send a message
"Hello Grid", you need to write your own serialization code to create
the "Hello Grid" message, and create your own connection, session and
tasks to process the message. With the new MapReduce API, all the steps
for serialization, connection management, session management and task
processing will be done automatically by the MapReduce library, so you
can focus on your own business logic. The code can be written
something like this:
Client Code:
mapMsg.setKey("Skater");
mapMsg.setValues("Hello Grid!");
jobconf.setMapMSG(&mapMsg);
jobconf.setNumMapTasks(numMaps);
jobconf.setAppName("HelloMapReduceApp");
jobconf.setMapType("maptype");
jobconf.setReduceType("reducetype");
ToolRunner runner;
runner.run(jobconf);
Mapper Code:
class HelloMapper: public MapService
{
public:
HelloMapper(void) { }
~HelloMapper(void) { }
void runMapper(std::string key, std::string values, OutputCollector&
outPut)
{
outPut.collect(key, values + "Response!");
}
};
Additionally, you can write the reducer to process your map output in
the distributed environment, which may become useful when your output
from the mapper is large, and you want to enjoy the advantages of
using the grid.
Reducer Code:
class WordCounterReducer: public ReduceService
{
public:
WordCounterReducer(void) { }
~WordCounterReducer(void) { }
void runReducer(std::string key, std::vector<string> values, OutputCollector& outPut)
{
long sum = 0;
while (!values.empty())
{
values.pop_back();
++sum;
}
outPut.collect(key,sum);
}
};
So in the SOA grid, MapReduce is implemented as two separate services:
Map and Reducer. The inputs and outputs are only key/value pairs,
which developers can process within their own implementations of these
services.
We have made the MapReduce API and examples available in source format
from the HPC Community download area here.
There are two examples in the source package: the WordCounter shows
the classic map reduce example provided by Google in its paper, and
the PiEstimator is a more traditional Symphony application that
estimates the value of Pi using monte-carlo.
We hope that you can use MapReduce for your applications!
-
July 15th, 2008 04:15 PM #2
A big part of Hadoop's implementation of MapReduce is the DFS.
What plays that role in your API ? I only see the common data as possibly having a similar role ...
-
July 15th, 2008 04:50 PM #3
At this stage, we just wanted to show that the MapReduce programming model could be implemented over top of Symphony. We're also considering how to bring the data management functionality to the API as well. One option could be to use HDFS from Hadoop itself. Another could be to use the Kosmos DFS as the file system layer. This would be key to making MapReduce in Symphony actually useable IMHO, but it hasn't been done yet.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
Forum Rules