HA - EVENTS

Event Scrips - Canceling the remaining events


PowerHA can reach a state that is known as Event Script Failure, when an event script encountered an error from which it cannot recover. This failure can cause the local node to go into an RP_FAILED state, and an RP_FAILED message is broadcast to all nodes in the cluster.

In this case an administrator is unable to move RGs or stop cluster services. What they can do is discover what caused the error, recover the affected resources, and then resume event processing by issuing a Recover from Script Failure on the affected node through SMIT or by running the clruncmd command. (Some administrators resort to restarting the affected node, an option that is far from ideal, particularly in production environments.)

To recover from a script failure, the Cluster Manager resumes event processing at the next step in the rules file. In many scenarios, it is hard for the Cluster Manager to accurately determine which resources were successfully managed before the failure, which resources need manual intervention, and which steps can be skipped when processing resumes.

In PowerHA V7.2.3, there is a new option to Cancel remaining event processing, which clears all the queued events on all nodes and skips any remaining steps in the current event. Also, any RGs where the Cluster Manager cannot determine their state are set to ERROR.

To use this option: smitty sysmirror --> Problem Determination Tools --> Recover From PowerHA SystemMirror Script Failure


------------------------------

CONFIG_TOO_LONG and EVENT SCRIPTS:

The cluster manager logs his information to /tmp/clstrmgr.debug, the script output (e.g. appl. stop/start) is sent to /tmp/hacmp.out. Until the event scripts start processing no output will be sent to hacmp.out.



WARNING: Cluster TIBCO_DB_CL has been running recovery program 'TE_DARE_CONFIGURATION' for 360 seconds. Please check cluster status.
WARNING: Cluster TIBCO_DB_CL has been running recovery program 'TE_DARE_CONFIGURATION' for 390 seconds. Please check cluster status.
WARNING: Cluster TIBCO_DB_CL has been running recovery program 'TE_DARE_CONFIGURATION' for 420 seconds. Please check cluster status.
WARNING: Cluster TIBCO_DB_CL has been running recovery program 'TE_DARE_CONFIGURATION' for 450 seconds. Please check cluster status.
...

The config_too_long event is an informational event whenever a cluster event:
    - is taking too long (longer than a preset time)
    - is hanging or failed

HACMP stops processing cluster events until the situation is resolved.

EVENT TAKING TOO LONG:
An event script does not complete inside 6 minutes (which can be extended) but it will be completed without intervention.
For example varying on a lot of disks (or fsck), takes long, but when it is completed, this error message stops being genertated.


EVENT IS HANGING OR FAILED:
If the script is hanging or failed manual intervention is needed:

0. check the status of the cluster and resources:
    clstat (or snmpinfo -m get -o /usr/es/sbin/cluster/hacmp.defs -v clusterSubState.0), lsvg -o, netstat, ps -ef

1. get a high level look at what the cluster was trying to do

    look into: /usr/es/adm/cluster.log
    establish the time when "Config Too Long" event started
    substract the 6 minutes to find the start of the event that had difficulty
    (As the error not always produce a config_too_long, it may be necessary to search for the string: FAILED or ERROR)
    (When find the most recent error read backwards to the beginning, so the source of the problem can be found)
    When you have found the problem: record the time and scriptname.

2. go into more details
    In /tmp/hacmp.out move to the point in time recorded in step 1.
    Follow the trace in /tmp/hacmp.out file until you find some useful message.
       (If there was an un-recoverable error, an "EVENT FAILED" indication should be in hacmp.out)
    (e.g. an fs can not be unmounted, but this did not give a nonzero exit value, the RC=1 occured when varyoffvg failed)
    (HACMP is looking for return codes, so when it received it, that event failed)

3. Manually correct the problem so that you complete the event.
    it may be necessray to display the script to understand what was supposed to happen.   
    (e.g. probably there is a hanging process which need to be killed)
    (e.g. varyoffvg failed due to a running process, and even the process was killed I then manually, umount fs and varyoff the vg)
    (e.g. script was unable to set the hostname, so it cen be set by command line)

4. After correcting the issue, HACMP should be told continue processing from the failed script
    Problem Determination Tools -> Recover From HACMP Script Failure
    /usr/es/adm/cluster.log will show the CM_CONTINUE request and the subsequent events
    (HACMP was looking for a good return code to continue, and "Recover from HACMP ..." provides RC=0 to allow HACMP to continue.))
    (the CONFIG TOO LONG message will be turned off as well)

ADDITIONAL NOTE:

Events may consist of severeal steps and if we don't correct all of them it can fail again.
example:
    1. node_down: (stop_server, release_takeover_addr, release_vg_fs, release_service_addr)
    2. node_down_complete (node_down_local_comlete)
    ...

    If the first event (node_down) failed at step stop_server.
    We correct it, and tell HACMP continue.
    The continue information is sent to the event manager only, so the ebent manager will continue with step 2 (node_down_complete)
    This may fail again if we did not execute the functions of step 1.

------------------------------

To avoid in the future the config_too_long, we can change the time span:
Extended Configuration -> Extended Event Configuration -> Change/Show Time Until Warning

------------------------------

3 comments: