+ Reply to Thread
Page 1 of 2
1 2 LastLast
Results 1 to 10 of 11

Thread: Failure and Recovery in Symphony DE

  1. #1
    ComputerGuy's Avatar
    ComputerGuy is offline Junior Member
    Join Date
    April 24th, 2008
    Posts
    22
    Downloads
    2
    Uploads
    0

    Default Failure and Recovery in Symphony DE

    While browsing through the Development Guide, I came across a chapter called Automatic failure recovery feature

    Looks like a very comprehensive feature. However, it states at the begining that: "This feature is not applicable in Symphony DE."

    I like to know, in Symphony DE, what happens if "a SOAM process becomes unavailable"? Like it is illustrated in the diagram at the begining of the "Automatic failure recovery feature" document:


  2. #2
    lechen's Avatar
    lechen is offline Junior Member
    Join Date
    March 12th, 2008
    Location
    Toronto, Ontario
    Posts
    71
    Blog Entries
    1
    Downloads
    8
    Uploads
    0

    Default

    Excellent question CG, one that's currently not covered in the documentation.

    First of all, with regards to the feature not being available in Symphony DE. Since Symphony DE is oriented towards being a development environment as opposed to a production environment, maintaining cluster reliability and availability is not the primary objective. In the event of hardware or daemon failure, the developer can always restart Symphony DE or the host without serious impact on the application development process.

    As to your question of what happens if a SOAM process becomes Unavailable? First of all, Symphony DE daemons (start_agent, RS, SD, SSM, SIM) are built as fault-resilient and reliable components. It is uncommon that the process will crash or hang. In the event that they do become unavailable, for example manually killed, please refer to the table below for the specifics on each Symphony DE process:

    No code has to be inserted here.

    Symphony DE also provides features to handle Application failure recovery in case of abnormal termination of a client or service process:

    No code has to be inserted here.

    Hope it helps.

  3. #3
    ComputerGuy's Avatar
    ComputerGuy is offline Junior Member
    Join Date
    April 24th, 2008
    Posts
    22
    Downloads
    2
    Uploads
    0

    Default

    Thanks lechen

    Yeah, my Symphony DE daemons so far are running pretty healthy. Only my services are core dumping

    I just tried manually killing the SSM process of my Java applicaiton, and it restarted within a second, pretty neat.

  4. #4
    lechen's Avatar
    lechen is offline Junior Member
    Join Date
    March 12th, 2008
    Location
    Toronto, Ontario
    Posts
    71
    Blog Entries
    1
    Downloads
    8
    Uploads
    0

    Default

    Quote Originally Posted by ComputerGuy View Post
    I just tried manually killing the SSM process of my Java applicaiton, and it restarted within a second, pretty neat.
    SSM recovery time depends on the amount of data it has to recover from Journaling and Paging. In another words, if you had a large amount of sessions (recoverable) open at the time the SSM process was killed, it will take a few more seconds for the new SSM to restore them before it's ready to accept new requests.

    Quote Originally Posted by ComputerGuy View Post
    Yeah, my Symphony DE daemons so far are running pretty healthy. Only my services are core dumping
    If you do happen to catch lightning in a bottle and catch a Symphony DE daemon crash, send it to us and we'll investigate the cause.

  5. #5
    ComputerGuy's Avatar
    ComputerGuy is offline Junior Member
    Join Date
    April 24th, 2008
    Posts
    22
    Downloads
    2
    Uploads
    0

    Default

    Quote Originally Posted by lechen View Post
    SSM recovery time depends on the amount of data it has to recover from Journaling and Paging. In another words, if you had a large amount of sessions (recoverable) open at the time the SSM process was killed, it will take a few more seconds for the new SSM to restore them before it's ready to accept new requests.
    My application is based on the SampleAppJava. I checked the profile and "recoverable=false".

    I guess since I'm only testing my App therefore it's not necessary to configure "recoverable". Everything else behave the same as long as SSM remains available, correct?

  6. #6
    lechen's Avatar
    lechen is offline Junior Member
    Join Date
    March 12th, 2008
    Location
    Toronto, Ontario
    Posts
    71
    Blog Entries
    1
    Downloads
    8
    Uploads
    0

    Default

    Quote Originally Posted by ComputerGuy View Post
    I guess since I'm only testing my App therefore it's not necessary to configure "recoverable". Everything else behave the same as long as SSM remains available, correct?
    Correct, performance would actually be better for unrecoverable sessions, since SSM does not have to journal session and task data. All the samples packaged in SymphonyDE have recoverable set to false.

  7. #7
    ComputerGuy's Avatar
    ComputerGuy is offline Junior Member
    Join Date
    April 24th, 2008
    Posts
    22
    Downloads
    2
    Uploads
    0

    Default

    Quote Originally Posted by lechen View Post
    Correct, performance would actually be better for unrecoverable sessions, since SSM does not have to journal session and task data. All the samples packaged in SymphonyDE have recoverable set to false.
    While going through the samples, I noticed in the SessionReconnect Sample, the "RecoverableClient" session type has recoverable set to true. Is their any special reason for this session type needing to be recoverable?

    <SessionTypes>
    <Type name="RecoverableClient" priority="1" recoverable="true" abortSessionIfClientDisconnect="false"
    sessionRetryLimit="3" taskRetryLimit="3" abortSessionIfTaskFail="false"
    suspendGracePeriod="100" taskCleanupPeriod="100"
    discardResultsOnDelivery="false"/>

    <Type name="OfflineClient" priority="1" recoverable="false" abortSessionIfClientDisconnect="true"
    sessionRetryLimit="3" taskRetryLimit="3" abortSessionIfTaskFail="false"
    suspendGracePeriod="100" taskCleanupPeriod="100"/>
    </SessionTypes>

  8. #8
    lechen's Avatar
    lechen is offline Junior Member
    Join Date
    March 12th, 2008
    Location
    Toronto, Ontario
    Posts
    71
    Blog Entries
    1
    Downloads
    8
    Uploads
    0

    Default

    Quote Originally Posted by ComputerGuy View Post
    While going through the samples, I noticed in the SessionReconnect Sample, the "RecoverableClient" session type has recoverable set to true. Is their any special reason for this session type needing to be recoverable?
    Recoverable is not necessary.

    But do notice that for the SessionReconnect samples, abortSessionIfTaskFail must be set to FALSE, to allow for the client to reconnect. Otherwise the Session will enter Abort state when the client disconnects.

  9. #9
    lechen's Avatar
    lechen is offline Junior Member
    Join Date
    March 12th, 2008
    Location
    Toronto, Ontario
    Posts
    71
    Blog Entries
    1
    Downloads
    8
    Uploads
    0

    Default

    I've posted a new article "Component Failure and Recovery in Symphony DE" in the Articles section.

    The article describes the behavior and recovery steps when Symphony DE components (both system daemons and client applications) or the host machine become unavailable.

    Here's a general overview:




    Comments welcomed.
    Last edited by Ajith; July 16th, 2008 at 07:50 PM.

  10. #10
    oags15 is offline Junior Member
    Join Date
    October 20th, 2008
    Location
    Germany
    Posts
    11
    Downloads
    0
    Uploads
    0

    Default hi

    Hi,

    I am new in grid computing and in symphony as well. I read the foundations_sym. pdf and it was very helpful but I still have some questions.

    I know the process to start up the cluster (foundations_sym. pdf ). Does somebody know the process to shut down the cluster? That is not include in the pdf file

    I read that symphony has fault tolerance and that every component in the system has a recovery operation, every component is monitored by another component, and can automatically recover from a failure (foundations_sym. pdf again haha) but I want to have more detailed information about it, for example, How much time it takes to restart the system in case of fault?, etc.

    Thanks a lot

+ Reply to Thread
Page 1 of 2
1 2 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts