HPCCommunity.org
 
Register

Go Back   HPC Community - High Performance Computing (HPC) Community > Symphony Developer Edition (DE) > Installing, Managing and Running Symphony DE

Installing, Managing and Running Symphony DE Support and troubleshooting questions for Symphony DE.

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old April 29th, 2008, 09:15 PM
ComputerGuy's Avatar
Junior Member
 
Join Date: April 24th, 2008
Posts: 22
Default Failure and Recovery in Symphony DE

While browsing through the Development Guide, I came across a chapter called Automatic failure recovery feature

Looks like a very comprehensive feature. However, it states at the begining that: "This feature is not applicable in Symphony DE."

I like to know, in Symphony DE, what happens if "a SOAM process becomes unavailable"? Like it is illustrated in the diagram at the begining of the "Automatic failure recovery feature" document:

Reply With Quote
  #2 (permalink)  
Old April 29th, 2008, 09:53 PM
lechen's Avatar
Junior Member
 
Join Date: March 12th, 2008
Location: Toronto, Ontario
Posts: 71
Blog Entries: 1
Default

Excellent question CG, one that's currently not covered in the documentation.

First of all, with regards to the feature not being available in Symphony DE. Since Symphony DE is oriented towards being a development environment as opposed to a production environment, maintaining cluster reliability and availability is not the primary objective. In the event of hardware or daemon failure, the developer can always restart Symphony DE or the host without serious impact on the application development process.

As to your question of what happens if a SOAM process becomes Unavailable? First of all, Symphony DE daemons (start_agent, RS, SD, SSM, SIM) are built as fault-resilient and reliable components. It is uncommon that the process will crash or hang. In the event that they do become unavailable, for example manually killed, please refer to the table below for the specifics on each Symphony DE process:

Process Unavailable ResultRecovery
start_agentSymphony DE shutdownRestart Symphony DE
RSSymphony DE shutdownManually restart Symphony DE
SDSymphony DE shutdownManually restart Symphony DE
SSMSD restarts another SSMAutomatic
SIMSSM restarts another SIMAutomatic

Symphony DE also provides features to handle Application failure recovery in case of abnormal termination of a client or service process:

Application Unavailable ResultRecoveryReference
Service InstanceSIM restarts Service InstanceAutomaticService error handling feature
ClientClient disconnects from sessionRelaunch client and reconnectDisconnect and reconnect to a session

Hope it helps.
Reply With Quote
  #3 (permalink)  
Old April 29th, 2008, 10:03 PM
ComputerGuy's Avatar
Junior Member
 
Join Date: April 24th, 2008
Posts: 22
Default

Thanks lechen

Yeah, my Symphony DE daemons so far are running pretty healthy. Only my services are core dumping

I just tried manually killing the SSM process of my Java applicaiton, and it restarted within a second, pretty neat.
Reply With Quote
  #4 (permalink)  
Old April 29th, 2008, 10:13 PM
lechen's Avatar
Junior Member
 
Join Date: March 12th, 2008
Location: Toronto, Ontario
Posts: 71
Blog Entries: 1
Default

Quote:
Originally Posted by ComputerGuy View Post
I just tried manually killing the SSM process of my Java applicaiton, and it restarted within a second, pretty neat.
SSM recovery time depends on the amount of data it has to recover from Journaling and Paging. In another words, if you had a large amount of sessions (recoverable) open at the time the SSM process was killed, it will take a few more seconds for the new SSM to restore them before it's ready to accept new requests.

Quote:
Originally Posted by ComputerGuy View Post
Yeah, my Symphony DE daemons so far are running pretty healthy. Only my services are core dumping
If you do happen to catch lightning in a bottle and catch a Symphony DE daemon crash, send it to us and we'll investigate the cause.
Reply With Quote
  #5 (permalink)  
Old April 29th, 2008, 10:20 PM
ComputerGuy's Avatar
Junior Member
 
Join Date: April 24th, 2008
Posts: 22
Default

Quote:
Originally Posted by lechen View Post
SSM recovery time depends on the amount of data it has to recover from Journaling and Paging. In another words, if you had a large amount of sessions (recoverable) open at the time the SSM process was killed, it will take a few more seconds for the new SSM to restore them before it's ready to accept new requests.
My application is based on the SampleAppJava. I checked the profile and "recoverable=false".

I guess since I'm only testing my App therefore it's not necessary to configure "recoverable". Everything else behave the same as long as SSM remains available, correct?
Reply With Quote
  #6 (permalink)  
Old April 29th, 2008, 11:00 PM
lechen's Avatar
Junior Member
 
Join Date: March 12th, 2008
Location: Toronto, Ontario
Posts: 71
Blog Entries: 1
Default

Quote:
Originally Posted by ComputerGuy View Post
I guess since I'm only testing my App therefore it's not necessary to configure "recoverable". Everything else behave the same as long as SSM remains available, correct?
Correct, performance would actually be better for unrecoverable sessions, since SSM does not have to journal session and task data. All the samples packaged in SymphonyDE have recoverable set to false.
Reply With Quote
  #7 (permalink)  
Old May 7th, 2008, 07:07 PM
ComputerGuy's Avatar
Junior Member
 
Join Date: April 24th, 2008
Posts: 22
Default

Quote:
Originally Posted by lechen View Post
Correct, performance would actually be better for unrecoverable sessions, since SSM does not have to journal session and task data. All the samples packaged in SymphonyDE have recoverable set to false.
While going through the samples, I noticed in the SessionReconnect Sample, the "RecoverableClient" session type has recoverable set to true. Is their any special reason for this session type needing to be recoverable?

<SessionTypes>
<Type name="RecoverableClient" priority="1" recoverable="true" abortSessionIfClientDisconnect="false"
sessionRetryLimit="3" taskRetryLimit="3" abortSessionIfTaskFail="false"
suspendGracePeriod="100" taskCleanupPeriod="100"
discardResultsOnDelivery="false"/>

<Type name="OfflineClient" priority="1" recoverable="false" abortSessionIfClientDisconnect="true"
sessionRetryLimit="3" taskRetryLimit="3" abortSessionIfTaskFail="false"
suspendGracePeriod="100" taskCleanupPeriod="100"/>
</SessionTypes>
Reply With Quote
  #8 (permalink)  
Old May 7th, 2008, 07:38 PM
lechen's Avatar
Junior Member
 
Join Date: March 12th, 2008
Location: Toronto, Ontario
Posts: 71
Blog Entries: 1
Default

Quote:
Originally Posted by ComputerGuy View Post
While going through the samples, I noticed in the SessionReconnect Sample, the "RecoverableClient" session type has recoverable set to true. Is their any special reason for this session type needing to be recoverable?
Recoverable is not necessary.

But do notice that for the SessionReconnect samples, abortSessionIfTaskFail must be set to FALSE, to allow for the client to reconnect. Otherwise the Session will enter Abort state when the client disconnects.
Reply With Quote
  #9 (permalink)  
Old May 25th, 2008, 05:54 AM
lechen's Avatar
Junior Member
 
Join Date: March 12th, 2008
Location: Toronto, Ontario
Posts: 71
Blog Entries: 1
Default

I've posted a new article "Component Failure and Recovery in Symphony DE" in the Articles section.

The article describes the behavior and recovery steps when Symphony DE components (both system daemons and client applications) or the host machine become unavailable.

Here's a general overview:




Comments welcomed.

Last edited by Ajith; July 16th, 2008 at 06:50 PM..
Reply With Quote
  #10 (permalink)  
Old October 22nd, 2008, 06:01 PM
Junior Member
 
Join Date: October 20th, 2008
Location: Germany
Posts: 6
Default hi

Hi,

I am new in grid computing and in symphony as well. I read the foundations_sym. pdf and it was very helpful but I still have some questions.

I know the process to start up the cluster (foundations_sym. pdf ). Does somebody know the process to shut down the cluster? That is not include in the pdf file

I read that symphony has fault tolerance and that every component in the system has a recovery operation, every component is monitored by another component, and can automatically recover from a failure (foundations_sym. pdf again haha) but I want to have more detailed information about it, for example, How much time it takes to restart the system in case of fault?, etc.

Thanks a lot
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Forum Jump


All times are GMT. The time now is 02:24 PM.


Powered by vBulletin® Version 3.8.0 Release Candidate 1
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.