AIX for System Administrators: HA

HA - COMMANDS

Commands:

odmget HACMPlogs                           shows where are the log files
odmget HACMPcluster                        shows cluster version
odmget HACMPnode                           shows info from nodes (cluster version)
/etc/es/objrepos                           HACMP ODM files

Log files:
(changing log files path: C-SPOC > Log Viewing and Management)

/var/hacmp/adm/cluster.log main PowerHA log file (errors,events,messages ... /usr/es/adm/cluster.log)
/var/hacmp/adm/history/cluster.mmddyy    shows only the EVENTS, generated daily (/usr/es/sbin/cluster/history/cluster.mmddyyyy)

/var/hacmp/log/clinfo.log records the activity of the clinfo daemon
/var/hacmp/log/clstrmgr.debug debug info about the cluster (clstrmg.debug.long also exists) IBM support using these
/var/hacmp/log/clutils.log summary of nightly verification
/var/hacmp/log/cspoc.log shows more info of the smitty c-spoc command (good place to look if a command fails)
/var/hacmp/log/hacmp.out similar to cluster.log, but more detailed (with all the output of the scripts)
/var/hacmp/log/loganalyzer/loganalyzer.log log analyzer (clanalyze) stores outputs there

/var/hacmp/clverify   shows the results of the verifications (verification errors are logged here)
/var/log/clcomd/clcomd.log contains every connect request between the nodes and return status of the requests

RSCT Logs:
/var/ha/log                                RSCT logs are here
/var/ha/log/nim.topsvcs...                 the heartbeats are logged here (comm. is OK between the nodes)

clRGinfo            Shows the state of RGs (in earlier HACMP clfindres was used)
clRGinfo -p         shows the node that has temporarily the highest priority (POL)
clRGinfo -t         shows the delayed timer information
clRGinfo -m         shows the status of the application monitors of the cluster
            resource groups state can be: online, offline, acquiring, releasing, error, unknown

cldump (or clstat -o)    detailed info about the cluster (realtime, shows cluster status) (clstat requires a running clinfo)
cldisp               detailed general info about the cluster (not realtime) (cldisp | egrep 'start|stop', lists start/stop scripts)
cltopinfo            Detailed information about the network of the cluster (this shows the data in DCD not in ACD)
cltopinfo -i       good overview, same as cllsif: this also lists cluster inetrfaces, it was used prior HACMP 5.1
cltopinfo -m       shows heartbeat statistics, missed heartbeats (-m is no longer available on PowerHA 7.1)
clshowres            Detailed information about the resource group(s)
cllsserv           Shows which scripts will be run in case of a takeover

clrgdependency -t PARENT_CHILD -sl shows parent child dependencies of resource groups

clshowsrv -v        shows status of the cluster daemons (very good overview!!!)
lssrc -g cluster    lists the running cluster daemons
ST_STABLE: cluster services running with resources online
NOT_CONFIGURED: cluster is not configured or node is not synced
ST_INIT: cluster is configured but not active on this node
ST_JOINING: cluster node is joining the cluster
ST_VOTING: cluster nodes are voting to decide event execution
ST_RP_RUNNING: cluster is running a recovery program
RP_FAILED: recovery program event script is failed
ST_BARRIER: clstrmgr is in between events waiting at the barrier
ST_CBARRIER: clstrmgr is exiting a recovery program
ST_UNSTABLE: cluster is unstable usually due to an event error

lssrc -ls clstrmgrES shows if cluster is STABLE or not, cluster version, Dynamic Node Priority (pgspace free, disk busy, cpu idle)
lssrc -ls topsvcs     shows the status of individual diskhb devices, heartbeat intervals, failure cycle (missed heartbeats)
lssrc -ls grpsvcs     gives info about connected clients, number of groups)
lssrc -ls emsvcs    shows the resource monitors known to the event management subsystem)
lssrc -ls snmpd       shows info about snmpd
halevel -s            shows PowerHA level (from 6.1)

lscluster list CAA cluster configuration information
-c cluster configuration
-d    disk (storage) configuration
-i    interfaces configuration
-m node configuration
mkcluster create a CAA cluster
chcluster change a CAA cluster configuration
rmcluster remove a CAA cluster configuration
clcmd <command> it will run given <command> on both nodes (for example: clcmd date)

cl_ping               pings all the adapters of the given list (e.g.: cl_ping -w 2 aix21 aix31 (-w: wait 2 seconds))
cldiag              HACMP troubleshooting tool (e.g.: cldiag debug clstrmgr -l 5 <--shows clstrmgr heartbeat infos)
                  cldiags vgs -h nodeA nodeB <--this checks the shared vgs definitions on the given node for inconsistencies

clmgr offline cluster WHEN=now MANAGE=offline STOP_CAA=yes stop cluster and CAA as well (after maintenance start with START_CAA=yes)
clmgr view report cluster TYPE=html FILE=/tmp/powerha.report create the HTML report
clanalyze -a -p "diskfailure" analyzes PowerHA logs for applicationfailure, interfacefailure, networkfailure, nodefailure...
lvlstmajor lists the available major numbers for each node in the cluster

------------------------------------------------------
/usr/es/sbin/cluster/utilities/get_local_nodename    shows the name of this node within the HACMP
/usr/es/sbin/cluster/utilities/clexit.rc             this script halt the node if the cluster manager daemon stopped incorrectly
------------------------------------------------------

Remove HACMP:

1. stop cluster on both nodes
2. remove the cluster configuration ( smitty hacmp) on both nodes
3. remove cluster filesets (startinf with cluster.*)

------------------------------------------------------

If you are planning to do crash-test, do it with halt -q or reboot -q
shutdown -Fr will not work, because it stops hacmp and resource groups garcefully (rc.shutdown), so no takeover will occur

------------------------------------------------------

clhaver - clcomd problem:

If there are problems during start up a cluster or synch. and verif., and you see something like this:

1800-106 An error occurred:
connectconnect: : Connection refusedConnection refused
clhaver[113]: cl_socket(aix20)clhaver[113]: cl_socket(aix04): : Connection refusedConnection refused

Probably there is a problem with clcomd.

1. check if if it is running: clshowsrv -v or lssrc -a | grep clcomd
    refresh or start it: refresh -s clcomdES or startsrc -s clcomdES

2. check log file: /var/hacmp/clcomd/clcomd.log
    you can see something like this: CONNECTION: REJECTED(Invalid address): aix10: 10.10.10.100->10.10.10.139

    for me the solution was:
        -update /usr/sbin/cluster/etc/rhosts file on both nodes (I added all ip's of both servers (except service ip + service backup ip))
        -refresh -s clcomdES

------------------------------------------------------

When trying to bring up a resource group in HACMP, got the following errors in the hacmp.out log file.

    cl_disk_available[187] cl_fscsilunreset fscsi0 hdiskpower1 false
    cl_fscsilunreset[124]: openx(/dev/hdiskpower1, O_RDWR, 0, SC_NO_RESERVE): Device busy
    cl_fscsilunreset[400]: ioctl SCIOLSTART id=0X11000 lun=0X1000000000000 : Invalid argument

To resolve this, you will have to make sure that the SCSI reset disk method is configured in HACMP. For example, when using EMC storage:

Make sure emcpowerreset is present in /usr/lpp/EMC/Symmetrix/bin/emcpowerreset.

Then add new custom disk method:
smitty hacmp -> Ext. Conf. -> Ext. Res. Conf. -> HACMP Ext. Resources Conf. -> Conf. Custom Disk Methods -> Add Cust. Disk

    * Disk Type (PdDvLn field from CuDv)                 [disk/pseudo/power]
    * Method to identify ghost disks                     [SCSI3]
    * Method to determine if a reserve is held           [SCSI_TUR]
    * Method to break a reserve                          [/usr/lpp/EMC/Symmetrix/bin/emcpowerreset]
    Break reserves in parallel                          true
    * Method to make the disk available                  [MKDEV]

------------------------------------------------------

Once I had a problem with commands 'cldump' and 'clstat -o' (version 5.4.1 SP3)

cldump: Waiting for the Cluster SMUX peer (clstrmgrES)
to stabilize...

Can not get cluster information.

Solution was:
-checked all the below mentioned daemons (clinfo, clcomd,snmpd...) and started what was missing
-and after that I did: refresh -s clstrmgrES (cldump and clstat was OK only after this refresh has been done)
-once had a problem with clstat -a (but clinfo was running), after refresh -s clinfoES it was OK again
(This can be also good: stopsrc -s clinfoES && sleep 2 && startsrc -s clinfoES )

things what can be checked regarding snmp:

-clinfoES and clcomdES:
clshowsrv -v

-snmpd and mibd daemons (if not active startsrc can start it)
root@aix20: / # lssrc -a | egrep 'snm|mib'
snmpmibd         tcpip            552998       active
aixmibd          tcpip            524418       active
hostmibd         tcpip            430138       active
snmpd            tcpip            1212632      active

(hostmibd is not necessary all the time to be active)

-snmpd conf and log files

root@aix20: / # ls -l /etc | grep snmp
-rw-r-----    1 root     system         2302 Aug 16 2005 clsnmp.conf
-rw-r--r--    1 root     system           37 Jun 16 16:18 snmpd.boots
-rw-r-----    1 root     system        10135 Aug 11 2009 snmpd.conf
-rw-r-----    1 root     system         2693 Aug 11 2009 snmpd.peers
-rw-r-----    1 root     system        10074 Jun 16 16:22 snmpdv3.conf
drwxrwxr-x    2 root     system          256 Aug 11 2009 snmpinterfaces
-rw-r-----    1 root     system         1816 Aug 11 2009 snmpmibd.conf

root@aix20: / # ls -l /var/tmp | grep snmp
-rw-r--r--    1 root     system        83130 Jun 16 20:32 snmpdv3.log
-rw-r--r--    1 root     system       100006 Oct 01 2008 snmpdv3.log.1
-rw-r--r--    1 root     system        16417 Jun 16 16:19 snmpmibd.log

------------------------------------------------------

During PowerHA upgrade from 5.4.1 to 6.1 received these errors:
(it was an upgrade where we put into unmanage state the resource groups)

grep: can't open aixdb1
./cluster.es.cspoc.rte.pre_rm: ERROR
Cluster services are active on this node. Please stop all
cluster services prior to installing this software.

...

grep: can't open aixdb1
./cluster.es.client.rte.pre_rm: ERROR
Cluster services are active on this node. Please stop all
cluster services prior to installing this software.

Failure occurred during pre_rm.
Failure occurred during rminstal.
installp: An internal error occurred while attempting
        to access the Software Vital Product Data.
        Use local problem reporting procedures.

We checked where to find that script at the first ERROR:
root@aixdb1: / # find /usr -name cluster.es.client.rte.pre_rm -ls
145412    5 -rwxr-x--- 1 root      system        4506 Feb 26 2009 /usr/lpp/cluster.es/inst_root/cluster.es.client.rte.pre_rm

Looking through the script, found these 2 lines:
LOCAL_NODE=$(odmget HACMPcluster 2>/dev/null | sed -n '/nodename = /s/^.* "$.*$".*/\1/p')
LC_ALL=C lssrc -ls clstrmgrES | grep "Forced down" | grep -qw $LOCAL_NODE

Checking these, after running the second line, the original error could be successfully recreated:
root@aixdb1: / # lssrc -ls clstrmgrES | grep "Forced down" | grep -qw $LOCAL_NODE
grep: can't open aixdb1

There were 2 entries in this variable, and that caused the error:
root@aixdb1: / # echo $LOCAL_NODE
aixdb1 aixdb1

root@aixdb1: / # odmget HACMPcluster
HACMPcluster:
        id = 1315338110
        name = "DFWEAICL"
        nodename = "aixdb1"    <--grep finds this entry
        sec_level = "Standard"
        sec_level_msg = ""
        ...
        rg_distribution_policy = "node"
        noautoverification = 1
        clvernodename = "aixdb1"        <--grep finds this entry as well (this is causing the trouble)
        clverhour = 0
        clverstartupoptions = 0

After Googling, what is clvernodename, find out this field is set by "Automatic Cluster Configuration Verification", and if it is set to Disabled it will remove the additional entry from ODM:

We checked in smitty hacmp -> HACmp verification -> Automatic..:

* Automatic cluster configuration verification     Enabled        <--we changed it to disabled
* Node name                               aixdb1
* HOUR (00 - 23)                                    [00]
Debug                                             no

After this correction, smitty update_all issued again. We received some similar errors (grep: can't open...), but when we retried smitty update_all then it was all successful. (All the earlier Broken filesets were corrected, and we had the new PowerHA version, without errors.)

------------------------------------------------------

Manual cluster switch:
1. varyonvg
2. mount FS (mount -t sapX11)(mount -t nfs)
3. check nfs: clshowres
if there are exported fs: exportfs -a
go to the nfs client: mount -t nfs
4. IP configure (ifconfig)
grep ifconfig /tmp/hacmp.out -> it will show the command:
IPAT via IP replacement: ifconfig en1 inet 10.10.110.11 netmask 255.255.255.192 up mtu 1500
IPAT via IP aliasing: ifconfig en3 alias 10.10.90.254 netmask 255.255.255.192
netmask can be found from ifconfig or cltopinfo -i
(removing ip: ifconfig en3 delete 10.10.90.254)
5. check routing (extra routes could be necessary)
(removing route: route delete -host 10.10.90.192 10.10.90.254 or route delete -net 10.10.90.192/26 10.10.90.254)
6. start applications

------------------------------------------------------

27 comments:

baski said...: This is a very good learning page. Thanks Author; January 26, 2012 at 4:17 PM
dzodzo said...: Greetings, have you experienced following error? Cluster is stable but only one node can read the information. On other node clstrmgrES is unresponsive:

lssrc -ls clstrmgrES
0513-014 The request could not be passed to the clstrmgrES subsystem.
The System Resource Controller is experiencing problems with
the subsystem's communication socket.

If you search for 0513-014 in the rsct documentation you don't find satisfactory solution how to proceed: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.rsct.v3r2.rsct400.doc%2F0513-014.htm

User response
Contact your system administrator.

is not really a good help if you are the system administrator =) Any clues what can be checked next? all snmp related daemons are running.; September 18, 2012 at 12:05 PM
aix said...: If I remember correctly, once I had some comm. problem on a cluster and I did "refresh -s clstrmgrES" which solved that issue. (Be careful if it is a productive system, for me it was a test system so it was not important if cluster goes down.)

Other idea is restarting cluster services.
Other idea: The error message says SRC has some problem with clstrmgrES, so probably you can look around there as well:

This is how my tree looks like, probably you can see some difference:

# lssrc -s clstrmgrES
Subsystem Group PID Status
clstrmgrES cluster 151816 active

# ps -ef | grep srcmstr
root 110802 1 0 Jul 28 - 1:27 /usr/sbin/srcmstr
# ps -fT 110802
UID PID PPID C STIME TTY TIME CMD
root 110802 1 0 Jul 28 - 1:27 /usr/sbin/srcmstr
root 106674 110802 0 Jul 28 - 0:00 |\--/usr/sbin/portmap
root 114824 110802 0 Jul 28 - 0:00 |\--/usr/sbin/nimsh -s
...
...
root 135662 110802 0 Jul 28 - 44:17 |\--/usr/sbin/rsct/bin/IBM.ServiceRMd
root 151816 110802 0 Jul 28 - 250:06 |\--/usr/es/sbin/cluster/clstrmgr
root 676408 151816 0 Nov 05 - 0:00 | \--run_rcovcmd; September 18, 2012 at 1:23 PM
Siva said...: Hi,

Could please share the step by step process of increasing filesize in hacmp

Regards,
Siva; February 12, 2013 at 6:09 PM
aix said...: Hi, almost all system administration on HACMP cluster should be started with "smitty hacmp".
This is the case here: smitty hacmp -> System Management (C-SPOC) -> HACMP Logical Volume Management -> Shared File Systems -> ... Change/Show characteristic
In the size filed you can give the new size of the filesystem, or with a + sign, you say how much addition space is needed for the filesystem. Be prepared, that by default, it counts the size in 512 bytes, so of you would like to add +1 MB, you should write in the field 2048.

Regards,
Balazs; February 12, 2013 at 7:18 PM
Anonymous said...: Hi,

can you please tell me how to see the information in hacmp.out file
what happen if it remove

Regards,; June 24, 2013 at 7:40 AM
aix said...: Hi, it is a plain text file, you can read it with cat or tail ....
If it is removed create a new empty file and HACM will use that.; June 24, 2013 at 7:45 AM
Unknown said...: Hi,

Can you pls share the procedure to convert a normal sharedvg to BigVG under cluster. else is it the same procedure to follow what we would do for a normal VG in AIX lvm, ensure atleast 1 free PP on all of the vg disks and then fire the command which can be done online.
I have datavg as shared vg created as Normal which can accommodate 32 disks max. I have a requirement of extending existing shared FS under datavg, for which I would require more luns (> 32) to be added to fulfill this requirement.
Converting datavg from Normal to Bigvg is what I'm looking for as an option. is it a disruptive and does it requires DB2 apps to be down.

Your response is highly appreaciated.; June 27, 2013 at 4:36 PM
aix said...: Hi, in smitty HACMP --> in Storage section --> in Change/Show characteristics of a Volume Group, there is a possibility to do this: "Change to big VG format?" I think if this option is implemented in SMIT, it is a safe way to do this way...however I've never tried this, and if there is a possibility try first on a test system.; June 28, 2013 at 10:10 PM
Manoj Suyal said...: Hi Balazs,

We have one PowerHA running on OS 6.1 which is in production. Two node are are participating in HACMP .

And clinfoES src is not active on both the node due to which we are not able to use clstat.

Is there any consequences if we start clinfoES src manually on both the nodes. If yes best possible way to start the same.

Regard
Manoj Suyal; August 9, 2013 at 10:40 AM
Manoj Suyal said...: bash-3.2# ./clshowsrv -v
Status of the RSCT subsystems used by HACMP:
Subsystem Group PID Status
topsvcs topsvcs 11075752 active
grpsvcs grpsvcs 11862222 active
grpglsm grpsvcs inoperative
emsvcs emsvcs 10879104 active
emaixos emsvcs inoperative
ctrmc rsct 6095092 active

Status of the HACMP subsystems:
Subsystem Group PID Status
clcomdES clcomdES 7077964 active
clstrmgrES cluster 7733272 active

Status of the optional HACMP subsystems:
Subsystem Group PID Status
clinfoES cluster inoperative; August 9, 2013 at 10:46 AM
aix said...: Hi, if you read above, you can see I had some issues with clinfo as well. startsrc -s clinfoES solved my issues and I had no problem at all starting it.; August 12, 2013 at 8:47 AM
Anonymous said...: using CLI /usr/es/sbin/cluster/cspoc/cli_chfs -a size=+(amount of space to increase in GB or MB ); October 5, 2013 at 11:48 AM
Anonymous said...: using CLI /usr/es/sbin/cluster/cspoc/cli_chfs -a size=+(amount of space to increase in GB or MB ); October 5, 2013 at 11:49 AM
Safi said...: Anyone could tell me, how to find the RG failover date and time information?; November 6, 2013 at 12:57 PM
Unknown said...: does anyone know:

dbserver:
#get_local_nodename
---------------> but the output is empty
#

apserver:
#get_local_nodename
apserver
#; January 14, 2014 at 5:04 PM
Linux4Admins said...: Give " mount " command in the cluster server where RG is up and running...
Check the cluster file system mounted date and time..It gives you the RG failover date and time info.

-------------------------------
/dev/lvfs1 /fs1 jfs2 Nov 04 19:04 rw,log=/dev/lvlog1502
/dev/lvfsrg /fs_RG1 jfs2 Nov 04 19:04 rw,log=/dev/lvlog1502

Regards,
Ramya; November 4, 2014 at 8:43 PM
Linux4Admins said...: root /usr/es/sbin/cluster/cspoc/cli_extendlv PP's
root /usr/es/sbin/cluster/cspoc/cli_chfs -a size=5GB /filesystem; November 4, 2014 at 8:46 PM
Linux4Admins said...: Hi Manoj,

The following is the hierarchy to stop and start the clinfoES

root stopsrc -s clinfoES
root stopsrc -s snmpd
root startsrc -s snmpd
root startsrc -s clinfoES

Regards,
Ramya; November 4, 2014 at 8:56 PM
Anonymous said...: HI Belaz ,I want to ask you something about removing HACMP, if I remove both nodes from the cluster after stopping it and then remove the filesets , Can I continue use the nodes without any trouble , Do I have to remove the resource group or cluster network before removing the filesets? I am assigned a task to remove HACMP because they want to implement Oracle RAC cluster instead.Because they prefer active/active cluster instead of active/passive. Do I have to change those VGs after removing the HACMP (chvg -l vgname) . Do you have any suggestions, I will really appreciate it. Thanks; January 21, 2015 at 4:30 PM
aix said...: Hi, I cannot give you a step-by-step info, but at this link there are some hints: http://www-01.ibm.com/support/docview.wss?uid=isg3T1000444
Basically I would document all necessary info (network, vg/lv/fs/NFS...), stop cluster, reomove cluster config (smitty hacmp), and remove cluster filesets.
While doing this Oracle should be stopped, after checking the system and doing necessary actions. chvg -l is also a good idea and there are some other hints at the IBM link above.; January 22, 2015 at 7:34 AM
Anonymous said...: Thanks Belaz, I wasn't looking for step by step info , that will be like you are doing my job , you already done more that enough in this site, I was just looking for your opinion on HACMP to RAC, I know in RAC (which i don't know much about) they need to have private network between to LPARs , (little different than HACMP private network) which is easy to do if it is a virtual servers, I can just create another virtual switch and configure a ip for these two nodes only, but I will have to think about how to do it between one physical lpar and one virtual lpar. If there is any thoughts , I will appreciate that . Thanks again; January 22, 2015 at 2:57 PM
aix said...: Hi, usually ip should come from Network team, they should give you subnet mask, VLAN id (if there is any), etc. So I just need to configure those IPs (either creating new virtual ethernet adapters, or on physical ones (or as alias)). Creating new virtual switch by your own for virtual servers is OK if they are on 1 Physical Machine, but if you want to do LPM that will not work. So, best is to discuss it with Network team.; January 23, 2015 at 8:33 AM
Anonymous said...: Hi , I newbie HACMP
could you help me ?
There are node 1(primary) and node 2 (secondary) in the same cluster, i can moving the RG from node 1 to node 2 (online and also offline) , but when i try moving RG from node 2 to node 1 the RG cannot move, there was one Filesystem (application) can't move, so application be error...could you tell me how to solve about it ?, or where i can know error-re; August 26, 2015 at 12:15 AM
Anonymous said...: Hi , I newbie HACMP
could you help me ?
There are node 1(primary) and node 2 (secondary) in the same cluster, i can moving the RG from node 1 to node 2 (online and also offline) , but when i try moving RG from node 2 to node 1 the RG cannot move, there was one Filesystem (application) can't move, so application be error...could you tell me how to solve about it ?, or where i can know error-re; August 26, 2015 at 12:19 AM
Unknown said...: Hi Blaze,
I got a requirement to change the boot and service IP in both nodes.Could you please let me know the process to do.; November 10, 2015 at 4:17 PM
Unknown said...: Hacmp 6.1; November 10, 2015 at 4:17 PM

dropdown menu

HA - COMMANDS

27 comments: