HPCCommunity.org
 
Register

Go Back   HPC Community - High Performance Computing (HPC) Community > HPC Community.org > Research Topics and Sponsored Projects

Research Topics and Sponsored Projects Discussion on various research topics and sponsored projects

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old August 14th, 2008, 04:28 PM
Member
 
Join Date: March 16th, 2008
Posts: 70
Blog Entries: 2
Default Add the MapReduce API for SOAM

Discussion for Add the MapReduce API for SOAM related issues.
Reply With Quote
  #2 (permalink)  
Old August 14th, 2008, 04:45 PM
Junior Member
 
Join Date: June 5th, 2008
Posts: 5
Default Add the MapReduce API for SOAM

Add the MapReduce API for SOAM
Give a MapReduce API in SOAM.


--------------------------------------------------------------------------------

Overview

MapReduce is a software framework implemented by Google to support parallel computations over large (multiple petabyte) data sets on unreliable clusters of computers. This framework is largely taken from map and reduce functions commonly used in functional programming, although the actual semantics of the framework are not the same.
wikipedia MapReduce
There a lot of developers who are familiar with the MapReduce programming model.
We want to support this model in SOAM.

Work Items

1. Define a MapReduce API in SOAM.
2. Implement the C++ MapReduce API in SOAM.
3. Using the SOAM MapReduce API create some examples for customers.
4. Add the data aware logic in SOAM.
Reply With Quote
  #3 (permalink)  
Old August 14th, 2008, 04:47 PM
Junior Member
 
Join Date: March 25th, 2008
Posts: 7
Blog Entries: 3
Default

One of the key aspects of the MapReduce model is the file system capabilities. Google has its GFS and Hadoop has its own Hadoop filesystem. For C++, the Kosmos File System (KFS) might be a useful way of addressing the data problem. See the link
http://kosmosfs.sourceforge.net/. Something to consider as you explore this.

Last edited by Khalid; August 14th, 2008 at 04:52 PM..
Reply With Quote
  #4 (permalink)  
Old August 14th, 2008, 04:48 PM
Junior Member
 
Join Date: June 5th, 2008
Posts: 5
Default

We have implemented a MapReduce API based on the current Symphony API.
Dowlod Page
The benefit of this new API is developers do not need to write their own data serialization classes with the current SOAM API. They can abstract their inputs as key/value pairs, and get their outputs back in that format as well.

Additionally, the customers do not need to care about managing the session and connection objects too. Thay all be handled in the Symphony Map reduce API.

Client Code

mapMsg.setKey("Skater");
mapMsg.setValues("Hello Grid!");

jobconf.setMapMSG(&mapMsg);
jobconf.setNumMapTasks(numMaps);

jobconf.setAppName("HelloMapReduceApp");
jobconf.setMapType("maptype");
jobconf.setReduceType("reducetype");

ToolRunner runner;
runner.run(jobconf);



Mapper Code


class HelloMapper: public MapService
{
public:
HelloMapper(void) { }

~HelloMapper(void) { }

void runMapper(std::string key, std::string values, OutputCollector&
outPut)
{
outPut.collect(key, values + "Response!");
}
};


Additionally, you can write the reducer to process your map output in the distributed environment, which may become useful when your output from the mapper is large, and you want to enjoy the advantages of using the grid.

Reducer Code


class WordCounterReducer: public ReduceService {
public:
WordCounterReducer(void) { }

~WordCounterReducer(void) { }


void runReducer(std::string key, std::vector<string> values, OutputCollector& outPut)
{
long sum = 0;
while (!values.empty())
{
values.pop_back();
++sum;
}
outPut.collect(key,sum);
}
};

Last edited by Qiang Xu; August 14th, 2008 at 04:50 PM..
Reply With Quote
  #5 (permalink)  
Old August 14th, 2008, 04:48 PM
Junior Member
 
Join Date: June 5th, 2008
Posts: 5
Default

Hello guys:
We have done more research in this area.

Target
• Enhance the performance of Symphony processing large data.
– By introducing a DFS in the Symphony, the customers’ data can be maintained in the computing node. This can reduce the time for passing data to SIM by SOAM messages.
• Let the task run the data which it preferred.
– By enhancing the task and session API, the customers can get the computing node according to their data or resource requirement.

Plan
  1. DFS integration plan
    1. Look into the Hadoop about how it using HDFS.(1 week)
    2. Enhance our Map-reduce API for SOAM. Add more utilities like auto splitting jobs. (1 week)
    3. Implement a DFS proxy for integrating with any DFS in the future. (1 weeks)
    4. Integrate the HDFS with our new Map-Reduce API.(1 week)
    5. After integration with HDFS, the existing DFS proxy should be easy to integrate with KFS.
  2. Enhance our SOAM API to support data aware scheduling.
    1. It means let the meta-data server tell SSM where to find the data. (We will start it after the DFS integration project)

HDFS VS KFS
========================HDFS ; KFS
Storage Client buffer Strategy: Yes ; Yes
Data Integrity+-------------+: Yes ; Yes
Cluster Rebalancing+-------+: No ; Yes
Metadata Disk Failover+----+: No ; No
File Writing+----------------+: Write once for one file ; Muti-written for one file
Documentation+----------- +: Good ; Far before ready
Application+----------------+: Hadoop, Nutch … ; No
stability+-------------------+: Tested on hadoop ; Not enough testing
OS+------------------------+: Platform independent ; FC5 and FC9 64bit

KFS performance is better than HDFS

http://code.google.com/p/hypertable/wiki/KFSvsHDFS
The problem is KFS is in its early implementation stage, the system is not stable, and there is not enough sample for how to use it.
Attached Files
File Type: zip First step in DFS investigation.zip (918.1 KB, 0 views)

Last edited by Qiang Xu; August 14th, 2008 at 05:33 PM..
Reply With Quote
  #6 (permalink)  
Old September 19th, 2008, 08:19 AM
Junior Member
 
Join Date: September 17th, 2008
Location: Canberra
Posts: 5
Default

While I am not familiar with the specific DFS you are proposing, I think it is an excellent idea to provide a DFS for the service tasks to access large datasets. Otherwise, the client may become a storage and communication bottleneck for some applications.

Taking this a bit further, how about a data fabric for the service tasks? Something like GemFire would provide better fault-tolernce and provide caching as well.
__________________
-- Cheers, Peter

(Peter Strazdins, Australian National University)
Reply With Quote
  #7 (permalink)  
Old October 6th, 2008, 06:29 PM
csmith's Avatar
Junior Member
 
Join Date: March 20th, 2008
Posts: 14
Blog Entries: 7
Default

Choosing which DFS to support is a difficult question. I think supporting HDFS or KFS is a "natural" since they were built to support MapReduce (in the form of Hadoop) so we know that the data model fits well. For general Symphony applications, data caches like GemFire or Oracle Coherence or Gigaspaces XAP fit very well (as evidenced by Platform customers using these various technologies together).

For my part, I'd be interested in seeing how a CFS like Lustre or GPFS could be used with MapReduce. I think the important characteristic of the data layer is "parallel access" (vs, say, atomic updates of data items, etc), and these filesystems are well tuned for this type of thing.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Forum Jump


All times are GMT. The time now is 10:15 PM.


Powered by vBulletin® Version 3.8.0 Release Candidate 1
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.