+ Reply to Thread
Results 1 to 7 of 7

Thread: Add the MapReduce API for SOAM

  1. #1
    vbseo's Avatar
    vbseo is offline Member
    Join Date
    March 16th, 2008
    Posts
    79
    Blog Entries
    2
    Downloads
    114
    Uploads
    37

    Default Add the MapReduce API for SOAM

    Discussion for Add the MapReduce API for SOAM related issues.

  2. #2
    Qiang Xu is offline Junior Member
    Join Date
    June 5th, 2008
    Posts
    4
    Downloads
    10
    Uploads
    0

    Default Add the MapReduce API for SOAM

    Add the MapReduce API for SOAM
    Give a MapReduce API in SOAM.


    --------------------------------------------------------------------------------

    Overview

    MapReduce is a software framework implemented by Google to support parallel computations over large (multiple petabyte) data sets on unreliable clusters of computers. This framework is largely taken from map and reduce functions commonly used in functional programming, although the actual semantics of the framework are not the same.
    wikipedia MapReduce
    There a lot of developers who are familiar with the MapReduce programming model.
    We want to support this model in SOAM.

    Work Items

    1. Define a MapReduce API in SOAM.
    2. Implement the C++ MapReduce API in SOAM.
    3. Using the SOAM MapReduce API create some examples for customers.
    4. Add the data aware logic in SOAM.

  3. #3
    Khalid is offline Junior Member
    Join Date
    March 25th, 2008
    Posts
    5
    Blog Entries
    3
    Downloads
    2
    Uploads
    0

    Default

    One of the key aspects of the MapReduce model is the file system capabilities. Google has its GFS and Hadoop has its own Hadoop filesystem. For C++, the Kosmos File System (KFS) might be a useful way of addressing the data problem. See the link
    http://kosmosfs.sourceforge.net/. Something to consider as you explore this.
    Last edited by Khalid; August 14th, 2008 at 05:52 PM.

  4. #4
    Qiang Xu is offline Junior Member
    Join Date
    June 5th, 2008
    Posts
    4
    Downloads
    10
    Uploads
    0

    Default

    We have implemented a MapReduce API based on the current Symphony API.
    Dowlod Page
    The benefit of this new API is developers do not need to write their own data serialization classes with the current SOAM API. They can abstract their inputs as key/value pairs, and get their outputs back in that format as well.

    Additionally, the customers do not need to care about managing the session and connection objects too. Thay all be handled in the Symphony Map reduce API.

    Client Code

    mapMsg.setKey("Skater");
    mapMsg.setValues("Hello Grid!");

    jobconf.setMapMSG(&mapMsg);
    jobconf.setNumMapTasks(numMaps);

    jobconf.setAppName("HelloMapReduceApp");
    jobconf.setMapType("maptype");
    jobconf.setReduceType("reducetype");

    ToolRunner runner;
    runner.run(jobconf);



    Mapper Code


    class HelloMapper: public MapService
    {
    public:
    HelloMapper(void) { }

    ~HelloMapper(void) { }

    void runMapper(std::string key, std::string values, OutputCollector&
    outPut)
    {
    outPut.collect(key, values + "Response!");
    }
    };


    Additionally, you can write the reducer to process your map output in the distributed environment, which may become useful when your output from the mapper is large, and you want to enjoy the advantages of using the grid.

    Reducer Code


    class WordCounterReducer: public ReduceService {
    public:
    WordCounterReducer(void) { }

    ~WordCounterReducer(void) { }


    void runReducer(std::string key, std::vector<string> values, OutputCollector& outPut)
    {
    long sum = 0;
    while (!values.empty())
    {
    values.pop_back();
    ++sum;
    }
    outPut.collect(key,sum);
    }
    };
    Last edited by Qiang Xu; August 14th, 2008 at 05:50 PM.

  5. #5
    Qiang Xu is offline Junior Member
    Join Date
    June 5th, 2008
    Posts
    4
    Downloads
    10
    Uploads
    0

    Default

    Hello guys:
    We have done more research in this area.

    Target
    • Enhance the performance of Symphony processing large data.
    – By introducing a DFS in the Symphony, the customers’ data can be maintained in the computing node. This can reduce the time for passing data to SIM by SOAM messages.
    • Let the task run the data which it preferred.
    – By enhancing the task and session API, the customers can get the computing node according to their data or resource requirement.

    Plan
    1. DFS integration plan
      1. Look into the Hadoop about how it using HDFS.(1 week)
      2. Enhance our Map-reduce API for SOAM. Add more utilities like auto splitting jobs. (1 week)
      3. Implement a DFS proxy for integrating with any DFS in the future. (1 weeks)
      4. Integrate the HDFS with our new Map-Reduce API.(1 week)
      5. After integration with HDFS, the existing DFS proxy should be easy to integrate with KFS.
    2. Enhance our SOAM API to support data aware scheduling.
      1. It means let the meta-data server tell SSM where to find the data. (We will start it after the DFS integration project)

    HDFS VS KFS
    No code has to be inserted here.

    KFS performance is better than HDFS

    http://code.google.com/p/hypertable/wiki/KFSvsHDFS
    The problem is KFS is in its early implementation stage, the system is not stable, and there is not enough sample for how to use it.
    Attached Files
    Last edited by Qiang Xu; August 14th, 2008 at 06:33 PM.

  6. #6
    Blackbird is offline Junior Member
    Join Date
    September 17th, 2008
    Location
    Canberra
    Posts
    5
    Downloads
    1
    Uploads
    0

    Default

    While I am not familiar with the specific DFS you are proposing, I think it is an excellent idea to provide a DFS for the service tasks to access large datasets. Otherwise, the client may become a storage and communication bottleneck for some applications.

    Taking this a bit further, how about a data fabric for the service tasks? Something like GemFire would provide better fault-tolernce and provide caching as well.
    -- Cheers, Peter

    (Peter Strazdins, Australian National University)

  7. #7
    csmith's Avatar
    csmith is offline Junior Member
    Join Date
    March 20th, 2008
    Posts
    26
    Blog Entries
    7
    Downloads
    17
    Uploads
    1

    Default

    Choosing which DFS to support is a difficult question. I think supporting HDFS or KFS is a "natural" since they were built to support MapReduce (in the form of Hadoop) so we know that the data model fits well. For general Symphony applications, data caches like GemFire or Oracle Coherence or Gigaspaces XAP fit very well (as evidenced by Platform customers using these various technologies together).

    For my part, I'd be interested in seeing how a CFS like Lustre or GPFS could be used with MapReduce. I think the important characteristic of the data layer is "parallel access" (vs, say, atomic updates of data items, etc), and these filesystems are well tuned for this type of thing.

+ Reply to Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts