![]() | ||||
| ||||||
| Research Topics and Sponsored Projects Discussion on various research topics and sponsored projects |
![]() |
| | LinkBack | Thread Tools | Search this Thread | Display Modes |
| |||
| Using EGO as Resource Manager for Hadoop Applications hadoop on demand integration -------------------------------------------------------------------------------- Hadoop is becoming an interesting programming paradigm and run time system for data-intensive computing applications. The Hadoop-On-Demand (HOD) is a system for provisioning virtual Hadoop clusters over a large physical cluster which uses Torque as a resource manager. See http://hadoop.apache.org/core/docs/r0.16.4/hod.html The intention of this project is to determine the feasibility of using EGO as the underlying resource manager for Hadoop applications, whereby EGO policies could be used to control the dynamic growing/shrinking of Hadoop clusters. The main benefit would be to have a common resource policy engine where Hadoop is yet another application. Some of the tasks for this project include: 1) Install and play with HOD 2) Learn the interfaces for integrating with a Resource Manager in HOD 3) Interface EGO with HOD to allow EGO to select hosts for a Hadoop cluster and manage start up 4) Investigate whether it is possible to extract performance metrics from Hadoop run-time to trigger grow/shrink based on workload performance |
| |||
|
We have discussed the solution of integrating Hadoop with EGO, here is our initial idea: In Hadoop, there are 4 java classes for starting 4 basic Hadoop daemons (name node, data node, job tracker and task tracker), we can make them as 4 EGO services, let egosc start/stop them on different machines. We can use PTM orchestrator to monitor the load info of the Hadoop cluster, and to ask egosc to start/stop Hadoop daemons (i.e. add/remove nodes into/from Hadoop cluster) according to the load info changes dynamically. For example, suppose there are 6 nodes(n1 – n6) ready for two Hadoop clusters(hadoop1 and hadoop2), Initially, hadoop1 contains n1 – n3, hadoop2 contains n4 – n6. After a moment, PTM orchestrator finds hadoop1 is very busy, it need more nodes to run jobs, but hadoop2 is very idle, it has free nodes. So PTM orchestrator for Hadoop1 asks egosc to start one more Hadoop data node (e.g. n4), and PTM orchestrator for Hadoop2 asks egosc to stop one Hadoop data node (e.g. n6). In this way, both Hadoop clusters will have proper number of nodes to run jobs. In this solution, we have an issue to make clear: If one node is added into a Hadoop cluster dynamically, can this cluster use this new node to run jobs right away? We will find the answer of this issue ASAP. Last edited by qzhang; August 14th, 2008 at 05:04 PM.. |
| |||
|
We looked through HOD project which uses the Torque to do node allocation. On the allocated nodes, it can start Hadoop daemons. It automatically generates the appropriate configuration files for the Hadoop daemons and client. HOD also has the capability to distribute Hadoop to the nodes in the virtual cluster that it allocates. In short, the advantage of HOD is that it makes it easy for users to quickly setup and use Hadoop cluster. The shortcoming of HOD is, when a Hadoop cluster has been set up and in use, we can not add new nodes into it dynamically, i.e., the Hadoop cluster set up by HOD is hard to expand. But our solution can provide this feature to make Hadoop cluster have good scalability. We make Hadoop daemon as EGO service, and start/stop them on the selected node on demand. We have found that the new node added into Hadoop cluster can be immediately used by Hadoop to run Map/Reduce jobs. So in this way, we can tune the performance of Hadoop cluster dynamically. |
| |||
|
Ok, that sounds interesting. Once we get something up an running on EGO, we should probably circle back to the Hadoop community to see what their thoughts are. If there is any way we can leverage some of the code or the interfaces in HOD to provide a similar user experience we should investigate. Initially at this stage, it would be useful to demonstrate the feasibility of dynamically growing/shrinking Hadoop application clusters leveraging EGO.
|
| |||
|
Yes, Hadoop guys think they are suffering from performance issues with HOD and not finding it the right model for running jobs, so they proposed an idea at the end of last month, to extend the scheduling functionality of Hadoop to allow sharing of large clusters without the use of HOD. We took a look at this idea (http://issues.apache.org/jira/browse/HADOOP-3421) today, and found that the main purpose of this idea is to improve the scheduling capability of Hadoop job tracker, so they plan to implement a lots of related concepts, such as: organization, queue, job priority, resource limit for user, etc. It seems they wants to implement some features somewhat like what our LSF scheduler provides, and wants Hadoop to offer better scheduling capability within a cluster, not to enable dynamic growing/shrinking of Hadoop clusters in a large physical cluster which is the objective of our Hadoop/EGO integration. So I think we can keep doing our solution as there is no explicit conflicts between their idea and our solution. |
| |||
|
I think we should have a separate project to do the integration with LSF and Hadoop. There clearly seems to be a need for richer scheduling policies and controls that LSF provides. I expect the mechanisms that we use to integrate with EGO in terms of starting and stopping various Hadoop services to be re-usable. It might be useful to post in Hadoop community and get some feedback.
|
| |||
|
Now we have already wrapped Hadoop daemons as EGO services, and can start Hadoop clusters by EGO service controller. So EGO policies can already be used to control the dynamic growing/shrinking of Hadoop clusters (In my demo, I have two Hadoop clusters, each of them has 2 nodes at the beginning, we can easily change the number of the nodes they have by adjusting the share ratio of related consumers). Next, we want to use PTM orchestrator to control the growing/shrinking of Hadoop clusters automatically according to the load info. |
| |||
|
Now we can use PTM orchestrator to control the growing/shrinking of Hadoop clusters automatically according to the load info. The load info we use is the ratio of the number of Hadoop tasks to the number of Hadoop data nodes. If the load info is higher than the high water we pre-defined, PTM orchestrator will add more data nodes into Hadoop cluster. But when load info is lower than the low water we pre-defined, now we can not remove any data nodes from Hadoop cluster. The reason is each data node may have some intermediate results produced by the previous tasks, these results are needed by the subsequent tasks. So if we remove a data node, the intermediate results on it will never be accessible by any other subsequent tasks. The work around we use here is defining a time which should be long enough to finish the job we run in the Hadoop cluster, when this time expires (the job is done), if the load info is lower than the low water, shrink the Hadoop cluster by removing some data nodes. Please refer to the attached package(HadoopOrchestrator.zip) for this Hadoop-EGO integration. |