HA - Application

Application server  (Application controller scripts)

It is a cluster resource, which controls an application that must be kept highly available. It includes start and stop scripts. 

Applications are defined in PowerHA as application controllers with these attributes:
Start script: Starts the application (from clean and unexpected shutdown).  The exit code is monitored by PowerHA. (with set -x, the output of the script is logged in the hacmp.out)
Stop script: This script must be able to successfully stop the application. Output is also logged and the exit code monitored.

Application monitors
To keep applications highly available, PowerHA can monitor the application too, not just the required resources.

Application startup mode
Introduced in PowerHA v7.1 this mode specifies how the application startup script is called. Using background (the default value) will start the script in the background, and event processing continues even if the start script has not completed. Select foreground if you want the event to suspend processing until the start script exits.

As the exit codes from the application scripts are monitored, PowerHA assumes that a non-zero return code from the script means that the script failed and therefore starting or stopping the application was not successful. If this is the case, the resource group will go into error state and a config_too_long message is recorded.

During application maintenance periods, taking the application offline only is often what you will want instead of stopping cluster services. If application monitoring is being used, it is required to suspend application monitoring before stopping the application.

----------------------------------

Application Monitor

PowerHA needs to know the state of a running application, so in case of a problem it can provide high availability. With the Unmanaged Resource Groups option (which leaves applications running when stopping cluster services), the application monitors are even more important. After the "unmanage" period, when cluster services are started again (to begin managing the resource groups again), the Application Monitor will check the application to determine if it is online or not. If it is online, acquiring that resource is skipped. If no Application Monitor is defined, the cluster manager runs the application server start script. This might cause problems for applications that cannot deal with another instance being started, for example, if the start script is run again when the application is already running. (A workaround for this problem would be to replace start scripts temporarily with dummy start scripts, which contains one line only: exit 0)

For each PowerHA application server, you can configure up to 128 application monitors, but the total number of application monitors in a cluster cannot exceed 128.

Configure an application monitor using SMIT:
smitty hacmp --> Cluster Applications and Resources --> Resources --> Configure User Applications (Scripts and Monitors) --> Application Monitors


There are 2 types of application monitoring:
- Process monitoring: Detects the termination of a process, using RSCT Resource Monitoring and Control (RMC) capability.
- Custom monitoring: It can be a user written script to monitors the health of an application.

Application monitors can be configured to run in these modes:
- Long-running mode: after the stabilization interval expired, it periodically checks the application based on the Monitor Interval setting
- Startup mode: checks during the stabilization interval until the process is active or the stabilization interval expires.
- Both modes: checks for the successful startup of the application server and periodically checks that the application is running successfully


----------------------------------

Process monitoring

Process monitors detect the termination of a processes. Any process that appears in the output of ps -el can be monitored by process monitor. It uses RMC, so no script is needed. It detects only the process termination and does not detect any other malfunction of the application. When PowerHA finds that the monitored application process is terminated, it tries to restart the application on the current node until the  retry count is exhausted. 

The stabilization interval is one of the most critical values in the monitor configuration. It must be set to a value that is long enough. If the application is in the process of a successful start and the stabilization interval expires, cleanup will be attempted and the resource group will be placed into ERROR state. 

* Monitor Name                                       [X11_monitor]
* Application Server(s) to Monitor                    X11
* Monitor Mode                                       [Long-running monitoring]
* Processes to Monitor                               [x11d x11color]
* Process Owner                                      [root]
  Instance Count                                     [1]
* Stabilization Interval                             [30]
* Restart Count                                      [3]
  Restart Interval                                   [60]
* Action on Application Failure                      [fallover]  
  Notify Method                                      []
  Cleanup Method                                     [/usr/local/scripts/cluster/X11_stop.ksh]
  Restart Method                                     [/usr/local/scripts/cluster/X11_start.ksh]


Processes to Monitor    <--the name of the process from the output of ps -el (not ps -ef)
Instance count          <--how many processes should be running in the ps -el output
Stabilization Interval  <--the length of time the monitor will wait before resuming monitoring
Restart Count           <--the maximum times the application will be (re)started
Restart Interval        <--the elapsed time the application must run before the Failure Counter is reset.
                        If the Restart Interval time is reached the Failure Count is reset to 0.
                       (is the time during which attempts will be made to restart the application)


If the application fails, the Restart Method is run to recover the application. If the application fails to recover to a running state after the number of restart attempts (Retry Count), the Action on Application Failure is taken. The action can be notify or fallover. If notify is selected, no further action is taken after running the Notify Method. If fallover is selected, the resource group containing the monitored application moves to the next available node in the resource group.

The Cleanup Method and Restart Method define the scripts for stopping and restarting the application after failure is detected. The default values are the start and stop scripts as defined in the application server configuration.


----------------------------------

Custom monitoring

Custom application monitor offers the possibility to use custom scripts. Based on the exit code of this script, the monitor will decide if the application is available or not. If the script exits with return code 0, then the application is available. Any other return code means that application is not available. PowerHA 7.1 does not pass arguments to the script. The logs can be found in  /var/hacmp/log/clappmond.application_name.resource_group_name.monitor.log

Additional SMIT fields for custom application monitors:
Monitor Method: The full path to the script that checks the application status. 
Monitor Interval: Defines the time (in seconds) between each occurrence of Monitor Method being run.
Hung Monitor Signal: The signal that is sent to stop the Monitor Method if it doesn't return within Monitor Interval seconds. The default action is SIGKILL(9).

----------------------------------

Application monitoring general processing

Once the application is started, there is a stabilization period. For long-running monitors, during the stabilization period the appl. monitoring does not attempt to determine if the application is alive or not. The startup monitor behaves differently, as it checks the application during stabilization period, and if it fails the RG goes into ERROR state. Once the stabilization period has expired in long running mode, the monitor periodically checks that the application is running successfully. 

If the application fails, the retry counter is examined. If it is not zero, it is decremented and an appl. cleanup and restart is attempted. This process continues until the retry counter is zero. If retry counter reached 0 and the appl. failed again, the Action on Failure setting (fallover or notify) is applied

As a protection mechanism, prior to invoking the application server start script, the cluster manager uses an application monitor to determine the status of the application If no application monitor is defined or if the application monitor returns a failure status (RC!=0 for custom monitors, processes not running via RMC for process monitors), the application server start script is invoked. If the application monitor returns a success status (RC=0 for custom monitors, processes running via RMC for process monitors), the application server start script is not run.

If more than one application monitor is defined, the selection priority is based on the Monitor Type (custom or process) and the Invocation (both, long-running or startup). The ranking of the combinations of these two is the following:
 Both, Process
 Long-running, Process
 Both, Custom
 Long-running, Custom
 Startup, Process
 Startup, Custom

The highest priority application monitor found is used to test the state of the application. We can test which monitor will be used by /usr/es/sbin/cluster/utilities/cl_app_startup_monitor

# /usr/es/sbin/cluster/utilities/cl_app_startup_monitor -s testmonApp -a
 Mon: Custom, Long-running
 bothuser_testmon: Both, Custom
 longproctestmon: Process, Long-running


----------------------------------

19 comments:

  1. thanks that was a good piece of learning i did today.

    regards
    rahul

    ReplyDelete
  2. Great info it is very useful ...

    ReplyDelete
  3. Nice definistion for Application Monitor..Thank you sir

    ReplyDelete
  4. Hi Balazs,

    Plz do mention command line to start/stop application monitoring...

    ReplyDelete
    Replies
    1. Hi,

      if I would know, I would definitely mention it :)

      Delete
  5. Nice piece of info
    ..
    I have a little question , do we need to have customize shall script to monitor HACMP ?

    Regards
    Manoj

    ReplyDelete
    Replies
    1. to monitor "application server" not HACMP, type error !

      in the Example provided by you i could see there is no option of defining application monitoring script !

      Please clarify..

      Delete
    2. Hi, yes, with custom monitoring you can use any scripts to check application availability.
      From HACMP Redbook:
      Custom monitors check the health of an application with a user-written custom monitor method... This gives the
      administrator the freedom to check for anything that can be defined as a determining factor in an application’s health.... A return code from the user-written monitor of zero (0) indicates that application is healthy, no further action is taken. A non-zero return code indicates that the application is not healthy and recovery actions are to take place."

      Delete
  6. If the last topic, u mentioned Resume/Suspend Monitoring, that means if you suspend the monitoring then it will simply failover the application to other node I guess?

    ReplyDelete
    Replies
    1. If you suspend the monitoring, you can do anything on the server (kill processes, stop network,,,), cluster will not failover.

      Delete
  7. can we have 2 application servers (SAP & MQ FTE) defined for Single Resource Group

    ReplyDelete