Discussion for Add the MapReduce API for SOAM related issues.
Discussion for Add the MapReduce API for SOAM related issues.
Add the MapReduce API for SOAM
Give a MapReduce API in SOAM.
--------------------------------------------------------------------------------
Overview
MapReduce is a software framework implemented by Google to support parallel computations over large (multiple petabyte) data sets on unreliable clusters of computers. This framework is largely taken from map and reduce functions commonly used in functional programming, although the actual semantics of the framework are not the same.
wikipedia MapReduce
There a lot of developers who are familiar with the MapReduce programming model.
We want to support this model in SOAM.
Work Items
1. Define a MapReduce API in SOAM.
2. Implement the C++ MapReduce API in SOAM.
3. Using the SOAM MapReduce API create some examples for customers.
4. Add the data aware logic in SOAM.
One of the key aspects of the MapReduce model is the file system capabilities. Google has its GFS and Hadoop has its own Hadoop filesystem. For C++, the Kosmos File System (KFS) might be a useful way of addressing the data problem. See the link
http://kosmosfs.sourceforge.net/. Something to consider as you explore this.
Last edited by Khalid; August 14th, 2008 at 05:52 PM.
We have implemented a MapReduce API based on the current Symphony API.
Dowlod Page
The benefit of this new API is developers do not need to write their own data serialization classes with the current SOAM API. They can abstract their inputs as key/value pairs, and get their outputs back in that format as well.
Additionally, the customers do not need to care about managing the session and connection objects too. Thay all be handled in the Symphony Map reduce API.
Client Code
mapMsg.setKey("Skater");
mapMsg.setValues("Hello Grid!");
jobconf.setMapMSG(&mapMsg);
jobconf.setNumMapTasks(numMaps);
jobconf.setAppName("HelloMapReduceApp");
jobconf.setMapType("maptype");
jobconf.setReduceType("reducetype");
ToolRunner runner;
runner.run(jobconf);
Mapper Code
class HelloMapper: public MapService
{
public:
HelloMapper(void) { }
~HelloMapper(void) { }
void runMapper(std::string key, std::string values, OutputCollector&
outPut)
{
outPut.collect(key, values + "Response!");
}
};
Additionally, you can write the reducer to process your map output in the distributed environment, which may become useful when your output from the mapper is large, and you want to enjoy the advantages of using the grid.
Reducer Code
class WordCounterReducer: public ReduceService {
public:
WordCounterReducer(void) { }
~WordCounterReducer(void) { }
void runReducer(std::string key, std::vector<string> values, OutputCollector& outPut)
{
long sum = 0;
while (!values.empty())
{
values.pop_back();
++sum;
}
outPut.collect(key,sum);
}
};
Last edited by Qiang Xu; August 14th, 2008 at 05:50 PM.
Hello guys:
We have done more research in this area.
Target
• Enhance the performance of Symphony processing large data.
– By introducing a DFS in the Symphony, the customers’ data can be maintained in the computing node. This can reduce the time for passing data to SIM by SOAM messages.
• Let the task run the data which it preferred.
– By enhancing the task and session API, the customers can get the computing node according to their data or resource requirement.
Plan
- DFS integration plan
- Look into the Hadoop about how it using HDFS.(1 week)
- Enhance our Map-reduce API for SOAM. Add more utilities like auto splitting jobs. (1 week)
- Implement a DFS proxy for integrating with any DFS in the future. (1 weeks)
- Integrate the HDFS with our new Map-Reduce API.(1 week)
- After integration with HDFS, the existing DFS proxy should be easy to integrate with KFS.
- Enhance our SOAM API to support data aware scheduling.
- It means let the meta-data server tell SSM where to find the data. (We will start it after the DFS integration project)
HDFS VS KFS
No code has to be inserted here.
KFS performance is better than HDFS
http://code.google.com/p/hypertable/wiki/KFSvsHDFS
The problem is KFS is in its early implementation stage, the system is not stable, and there is not enough sample for how to use it.
Last edited by Qiang Xu; August 14th, 2008 at 06:33 PM.
While I am not familiar with the specific DFS you are proposing, I think it is an excellent idea to provide a DFS for the service tasks to access large datasets. Otherwise, the client may become a storage and communication bottleneck for some applications.
Taking this a bit further, how about a data fabric for the service tasks? Something like GemFire would provide better fault-tolernce and provide caching as well.
-- Cheers, Peter
(Peter Strazdins, Australian National University)
Choosing which DFS to support is a difficult question. I think supporting HDFS or KFS is a "natural" since they were built to support MapReduce (in the form of Hadoop) so we know that the data model fits well. For general Symphony applications, data caches like GemFire or Oracle Coherence or Gigaspaces XAP fit very well (as evidenced by Platform customers using these various technologies together).
For my part, I'd be interested in seeing how a CFS like Lustre or GPFS could be used with MapReduce. I think the important characteristic of the data layer is "parallel access" (vs, say, atomic updates of data items, etc), and these filesystems are well tuned for this type of thing.