dropdown menu

CONFIG_TOO_LONG and EVENT SCRIPTS:

The cluster manager logs his information to /tmp/clstrmgr.debug, the script output (e.g. appl. stop/start) is sent to /tmp/hacmp.out. Until the event scripts start processing no output will be sent to hacmp.out.



WARNING: Cluster TIBCO_DB_CL has been running recovery program 'TE_DARE_CONFIGURATION' for 360 seconds. Please check cluster status.
WARNING: Cluster TIBCO_DB_CL has been running recovery program 'TE_DARE_CONFIGURATION' for 390 seconds. Please check cluster status.
WARNING: Cluster TIBCO_DB_CL has been running recovery program 'TE_DARE_CONFIGURATION' for 420 seconds. Please check cluster status.
WARNING: Cluster TIBCO_DB_CL has been running recovery program 'TE_DARE_CONFIGURATION' for 450 seconds. Please check cluster status.
...

The config_too_long event is an informational event whenever a cluster event:
    - is taking too long (longer than a preset time)
    - is hanging or failed

HACMP stops processing cluster events until the situation is resolved.

EVENT TAKING TOO LONG:
An event script does not complete inside 6 minutes (which can be extended) but it will be completed without intervention.
For example varying on a lot of disks (or fsck), takes long, but when it is completed, this error message stops being genertated.


EVENT IS HANGING OR FAILED:
If the script is hanging or failed manual intervention is needed:

0. check the status of the cluster and resources:
    clstat (or snmpinfo -m get -o /usr/es/sbin/cluster/hacmp.defs -v clusterSubState.0), lsvg -o, netstat, ps -ef

1. get a high level look at what the cluster was trying to do

    look into: /usr/es/adm/cluster.log
    establish the time when "Config Too Long" event started
    substract the 6 minutes to find the start of the event that had difficulty
    (As the error not always produce a config_too_long, it may be necessary to search for the string: FAILED or ERROR)
    (When find the most recent error read backwards to the beginning, so the source of the problem can be found)
    When you have found the problem: record the time and scriptname.

2. go into more details
    In /tmp/hacmp.out move to the point in time recorded in step 1.
    Follow the trace in /tmp/hacmp.out file until you find some useful message.
       (If there was an un-recoverable error, an "EVENT FAILED" indication should be in hacmp.out)
    (e.g. an fs can not be unmounted, but this did not give a nonzero exit value, the RC=1 occured when varyoffvg failed)
    (HACMP is looking for return codes, so when it received it, that event failed)

3. Manually correct the problem so that you complete the event.
    it may be necessray to display the script to understand what was supposed to happen.   
    (e.g. probably there is a hanging process which need to be killed)
    (e.g. varyoffvg failed due to a running process, and even the process was killed I then manually, umount fs and varyoff the vg)
    (e.g. script was unable to set the hostname, so it cen be set by command line)

4. After correcting the issue, HACMP should be told continue processing from the failed script
    Problem Determination Tools -> Recover From HACMP Script Failure
    /usr/es/adm/cluster.log will show the CM_CONTINUE request and the subsequent events
    (HACMP was looking for a good return code to continue, and "Recover from HACMP ..." provides RC=0 to allow HACMP to continue.))
    (the CONFIG TOO LONG message will be turned off as well)

ADDITIONAL NOTE:

Events may consist of severeal steps and if we don't correct all of them it can fail again.
example:
    1. node_down: (stop_server, release_takeover_addr, release_vg_fs, release_service_addr)
    2. node_down_complete (node_down_local_comlete)
    ...

    If the first event (node_down) failed at step stop_server.
    We correct it, and tell HACMP continue.
    The continue information is sent to the event manager only, so the ebent manager will continue with step 2 (node_down_complete)
    This may fail again if we did not execute the functions of step 1.

------------------------------

To avoid in the future the config_too_long, we can change the time span:
Extended Configuration -> Extended Event Configuration -> Change/Show Time Until Warning

------------------------------

1 comment: