+ Reply to Thread
Page 2 of 3
FirstFirst 1 2 3 LastLast
Results 11 to 20 of 21

Thread: Search parallelism using Symphony DE - Simple Demonstration

  1. #11
    Ajith's Avatar
    Ajith is offline Symphony DE Moderator
    Join Date
    February 28th, 2008
    Location
    Markham, Ontario
    Posts
    104
    Blog Entries
    2
    Downloads
    10
    Uploads
    0

    Default

    DE and 4.0 schedule tasks the same way.

    I'm not following the use case very well. It seems that you have 3 tasks, but want to run all 3 tasks on each of your 3 compute hosts. In the most common case in which no host will fail, your tasks will complete 3 times slower. Unless your host failure rate is 70%, the expected completion time will be better if you just run the 3 tasks the normal way without any redundancy.

    - Ajith

  2. #12
    oags15 is offline Junior Member
    Join Date
    October 20th, 2008
    Location
    Germany
    Posts
    11
    Downloads
    0
    Uploads
    0

    Default

    The aim of the project is that the delay when a host is down be no more than 1 second, so I think that a good approach can be that one. So, when the host fail we only need to activate another host but normally the system needs to send the input data and start the task in another host, maybe in the way I want since the data is already there and the task is actually been executed before but without showing the output, the delay can be less than 1 second, we only need to unblock the output.

  3. #13
    Ajith's Avatar
    Ajith is offline Symphony DE Moderator
    Join Date
    February 28th, 2008
    Location
    Markham, Ontario
    Posts
    104
    Blog Entries
    2
    Downloads
    10
    Uploads
    0

    Default

    1 second recovery for host failure is quite difficult. Symphony will automatically deal with host failures, but the time required to detect the failure and reschedule the task will likely be more than 1 second.

    Symphony doesn't allow you to execute the same task on more than one host with any of it's current scheduling policies. You may want to try MPI which allows the client app to send tasks directly to compute hosts.

    - Ajith

  4. #14
    oags15 is offline Junior Member
    Join Date
    October 20th, 2008
    Location
    Germany
    Posts
    11
    Downloads
    0
    Uploads
    0

    Default

    I want 1 second because the final application will be sensible to delays in case of one host fail, that is the reason i want no more than 1 second or at most 4 seconds. do you know how can I reduce the time delay in case of a host is down?, I made some changes in the xml file and I send a task, then I unplug the lan cable of a host to simulate that is completely down and it takes like 15 seconds to get output again.

    Another way can be that the task can be store in all hosts but not been executed, when host 1 is down i only need to activate the task, is that possible?

    thank you,

    OAGS

  5. #15
    Ajith's Avatar
    Ajith is offline Symphony DE Moderator
    Join Date
    February 28th, 2008
    Location
    Markham, Ontario
    Posts
    104
    Blog Entries
    2
    Downloads
    10
    Uploads
    0

    Default

    Host down situations can be difficult to detect. There are several conditions and they are handled differently.

    1. Host power-off in EGO mode - detected after 60 seconds
    2. Host power-off in DE mode - detected after the TCP KEEP_ALIVE interval that can be configured as short as 3 minutes
    3. Host clean shutdown - detected immediately
    4. Host O/S crash - unknown detection time
    5. Network error - detected immediately
    6. LAN cable disconnected - detected immediately

    Even if host-down is detected immediately, SSM will take some time to find another free slot, schedule the task and then send the task to the compute host. This runs as fast as possible.

    There is no way to send the same task to multiple hosts in 4.0. There are some advancements in 4.1 (sending data directly from the client to the service, bypassing the SSM), but implementing your use case will still be difficult.

  6. #16
    oags15 is offline Junior Member
    Join Date
    October 20th, 2008
    Location
    Germany
    Posts
    11
    Downloads
    0
    Uploads
    0

    Default

    Thank you for your help Ajith, I will try to find a solution or another way to do it.

    OAGS

  7. #17
    oags15 is offline Junior Member
    Join Date
    October 20th, 2008
    Location
    Germany
    Posts
    11
    Downloads
    0
    Uploads
    0

    Default

    Hi Ajith,

    I am following your advice trying to implement the MPI with Symphony DE. Do you think is possible to use the MPI to detect that a host is in fail and then send a message to symphony so it can resend the task faster?, in this way symphony will not be in charge of detecting the host fault, only to sent the task again, maybe it can take less time.

    OAGS

  8. #18
    Ajith's Avatar
    Ajith is offline Symphony DE Moderator
    Join Date
    February 28th, 2008
    Location
    Markham, Ontario
    Posts
    104
    Blog Entries
    2
    Downloads
    10
    Uploads
    0

    Default

    Quote Originally Posted by oags15 View Post
    Hi Ajith,

    I am following your advice trying to implement the MPI with Symphony DE. Do you think is possible to use the MPI to detect that a host is in fail and then send a message to symphony so it can resend the task faster?, in this way symphony will not be in charge of detecting the host fault, only to sent the task again, maybe it can take less time.

    OAGS
    Symphony doesn't give the application any control over the scheduling of tasks. The basic assumption is that host failures are a rare event.

    As in your use case, it is not possible to build in your required high level of redundancy without coding the task distibution logic yourself, which defeats the purpose of using Symphony.

    - Ajith

  9. #19
    oags15 is offline Junior Member
    Join Date
    October 20th, 2008
    Location
    Germany
    Posts
    11
    Downloads
    0
    Uploads
    0

    Default

    Hi Ajith,
    According to the previous posts I would clarify the following:
    I am going to explain the case of using Grid Computing “platform” solution for redundant system, and then I would appreciate receiving your comments.

    1.- Now in my case we are not going to send jobs into different resources, rather the jobs are already located on one platform resource and platform is used just to control, running, detecting and monitoring our jobs, means that in each resources all jobs are existed (passive) and by sending jobs we would like to activate the selected job. The proposed idea for getting benefit of platform is that once any client (resource) is going to fault, Platform can run the failed job on other resource (or resend it to any other available resource). Jobs just need to be activated (run) on a specific machine.

    2.- For the time consuming, what do you mean by more than 1 second?

    Quote Originally Posted by Ajith View Post
    1 second recovery for host failure is quite difficult. Symphony will automatically deal with host failures, but the time required to detect the failure and reschedule the task will likely be more than 1 second.

    Symphony doesn't allow you to execute the same task on more than one host with any of it's current scheduling policies. You may want to try MPI which allows the client app to send tasks directly to compute hosts.

    - Ajith

    3.- What is the fastest possibility?

    Quote Originally Posted by Ajith View Post
    Host down situations can be difficult to detect. There are several conditions and they are handled differently.

    1. Host power-off in EGO mode - detected after 60 seconds
    2. Host power-off in DE mode - detected after the TCP KEEP_ALIVE interval that can be configured as short as 3 minutes
    3. Host clean shutdown - detected immediately
    4. Host O/S crash - unknown detection time
    5. Network error - detected immediately
    6. LAN cable disconnected - detected immediately

    Even if host-down is detected immediately, SSM will take some time to find another free slot, schedule the task and then send the task to the compute host. This runs as fast as possible.

    There is no way to send the same task to multiple hosts in 4.0. There are some advancements in 4.1 (sending data directly from the client to the service, bypassing the SSM), but implementing your use case will still be difficult.
    4.- Even with the rare condition our target is to cover that minor percentage (if happened).

    Quote Originally Posted by Ajith View Post
    Symphony doesn't give the application any control over the scheduling of tasks. The basic assumption is that host failures are a rare event.

    As in your use case, it is not possible to build in your required high level of redundancy without coding the task distibution logic yourself, which defeats the purpose of using Symphony.

    - Ajith

    Please, may you explain me what are the mechanism that Platform went though when any failure happened? For example, when I plug out the network cable for one resource. I could recognize that Platform could detect failure immediately but resending failed job took around 20 seconds, why?

    Thank you,

    OAGS

  10. #20
    Ajith's Avatar
    Ajith is offline Symphony DE Moderator
    Join Date
    February 28th, 2008
    Location
    Markham, Ontario
    Posts
    104
    Blog Entries
    2
    Downloads
    10
    Uploads
    0

    Default

    1. There's no problem with the proposal, you minimize the amount of data transferred by Symphony. This will save time if the amount of task input/output data is large > 1Kb. This won't reduce the time required to detect a host failure.

    2. Depending on the host failure type, Symphony may take up to 2 minutes to detect that the host went down due to a power-off.

    3. I don't have statistics for host failure detection times.

    4. If a failure happens, the SSM will resend the task to another host with the highest priority. If all slots are busy, the SSM will wait until one is free'd up. The 20s delay is probably a detection delay. If there is a free slot, as soon as the host failure is detected, the SSM should reschedule the task. You can set the SSM log level to debug to get more information about what the SSM is doing. Modify the /conf/ssm.log4j.properties file on the SSM host. The SSM log file can be found in /logs.

    - Ajith

+ Reply to Thread
Page 2 of 3
FirstFirst 1 2 3 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts