dropdown menu

Commands:


Important files:

odmget HACMPlogs                           shows where are the log files
odmget HACMPcluster                        shows cluster version
odmget HACMPnode                           shows info from nodes (cluster version)
(changing the location of the log files: C-SPOC > Log Viewing and Management)
/etc/es/objrepos                           HACMP ODM files


HACMP Logs: (location differs on newer and older HACMP versions)

/var/hacmp/adm/cluster.log                 main PowerHA log file (errors,events,messages ... /usr/es/adm/cluster.log)
/var/hacmp/adm/history/cluster.mmddyy      shows only the EVENTS, generated daily (/usr/es/sbin/cluster/history/cluster.mmddyyyy)

/var/hacmp/log/clinfo.log                  records the activity of the clinfo daemon
/var/hacmp/log/clstrmgr.debug              debug info about the cluster (clstrmg.debug.long also exists) IBM support using these
/var/hacmp/log/clutils.log                 summary of nightly verification
/var/hacmp/log/cspoc.log                   shows more info of the smitty c-spoc command (good place to look if a command fails)
/var/hacmp/log/hacmp.out                   similar to cluster.log, but more detailed (with all the output of the scripts)

/var/hacmp/clverify                        shows the results of the verifications (verification errors are logged here)
/var/log/clcomd/clcomd.log                 contains every connect request between the nodes and return status of the requests


RSCT Logs:
/var/ha/log                                RSCT logs are here
/var/ha/log/nim.topsvcs...                 the heartbeats are logged here (comm. is OK between the nodes)


clRGinfo            Shows the state of RGs (in earlier HACMP clfindres was used)
clRGinfo -p         shows the node that has temporarily the highest priority (POL)
clRGinfo -t         shows the delayed timer information
clRGinfo -m         shows the status of the application monitors of the cluster
                    resource groups state can be: online, offline, acquiring, releasing, error, unknown

cldump (or clstat -o)    detailed info about the cluster (realtime, shows cluster status) (clstat requires a running clinfo)
cldisp               detailed general info about the cluster (not realtime) (cldisp | egrep 'start|stop', lists start/stop scripts)
cltopinfo            Detailed information about the network of the cluster (this shows the data in DCD not in ACD)
cltopinfo -i         good overview, same as cllsif: this also lists cluster inetrfaces, it was used prior HACMP 5.1
cltopinfo -m         shows heartbeat statistics, missed heartbeats (-m is no longer available on PowerHA 7.1)
clshowres            Detailed information about the resource group(s)               
cllsserv             Shows which scripts will be run in case of a takeover

clrgdependency -t PARENT_CHILD -sl shows parent child dependencies of resource groups

clshowsrv -v          shows status of the cluster daemons (very good overview!!!)       
lssrc -g cluster      lists the running cluster daemons

lssrc -ls clstrmgrES  shows if cluster is STABLE or not, cluster version, Dynamic Node Priority (pgspace free, disk busy, cpu idle)
                ST_STABLE: cluster services running with resources online
                NOT_CONFIGURED: cluster is not configured or node is not synced
                ST_INIT: cluster is configured but not active on this node
                ST_JOINING: cluster node is joining the cluster
                ST_VOTING: cluster nodes are voting to decide event execution
                ST_RP_RUNNING: cluster is running a recovery program
                RP_FAILED: recovery program event script is failed
                ST_BARRIER: clstrmgr is in between events waiting at the barrier
                ST_CBARRIER: clstrmgr is exiting a recovery program
                ST_UNSTABLE: cluster is unstable usually due to an event error
lssrc -ls topsvcs     shows the status of individual diskhb devices, heartbeat intervals, failure cycle (missed heartbeats)
lssrc -ls grpsvcs     gives info about connected clients, number of groups)
lssrc -ls emsvcs      shows the resource monitors known to the event management subsystem)
lssrc -ls snmpd       shows info about snmpd
halevel -s            shows PowerHA level (from 6.1)

lscluster             list CAA cluster configuration information
-c            cluster configuration
-d            disk (storage) configuration
-i            interfaces configuration
-m            node configuration
mkcluster             create a CAA cluster
chcluster             change a CAA cluster configuration
rmcluster             remove a CAA cluster configuration
clcmd <command>       it will run given <command> on both nodes (for example: clcmd date)

cl_ping               pings all the adapters of the given list (e.g.: cl_ping -w 2 aix21 aix31 (-w: wait 2 seconds))
cldiag                HACMP troubleshooting tool (e.g.: cldiag debug clstrmgr -l 5 <--shows clstrmgr heartbeat infos)
                      cldiags vgs -h nodeA nodeB  <--this checks the shared vgs definitions on the given node for inconsistencies

------------------------------------------------------
/usr/es/sbin/cluster/utilities/get_local_nodename    shows the name of this node within the HACMP
/usr/es/sbin/cluster/utilities/clexit.rc             this script halt the node if the cluster manager daemon stopped incorrectly
------------------------------------------------------

Remove HACMP:

1. stop cluster on both nodes
2. remove the cluster configuration ( smitty hacmp) on both nodes
3. remove cluster filesets (startinf with cluster.*)

------------------------------------------------------

If you  are planning to do crash-test, do it with halt -q or reboot -q
shutdown -Fr will not work, because it stops hacmp and resource groups garcefully (rc.shutdown), so no takeover will occur

------------------------------------------------------

clhaver - clcomd problem:

If there are problems during start up a cluster or synch. and verif., and you see something like this:

  1800-106 An error occurred:
  connectconnect: : Connection refusedConnection refused
  clhaver[113]: cl_socket(aix20)clhaver[113]: cl_socket(aix04): : Connection refusedConnection refused

Probably there is a problem with clcomd.

1. check if if it is running: clshowsrv -v or lssrc -a | grep clcomd
    refresh or start it: refresh -s clcomdES or startsrc -s clcomdES

2. check log file: /var/hacmp/clcomd/clcomd.log
    you can see something like this: CONNECTION: REJECTED(Invalid address): aix10: 10.10.10.100->10.10.10.139

    for me the solution was:
        -update /usr/sbin/cluster/etc/rhosts file on both nodes (I added all  ip's of both servers (except service ip + service backup ip))
        -refresh -s clcomdES

------------------------------------------------------

When trying to bring up a resource group in HACMP, got the following errors in the hacmp.out log file.

    cl_disk_available[187] cl_fscsilunreset fscsi0 hdiskpower1 false
    cl_fscsilunreset[124]: openx(/dev/hdiskpower1, O_RDWR, 0, SC_NO_RESERVE): Device busy
    cl_fscsilunreset[400]: ioctl SCIOLSTART id=0X11000 lun=0X1000000000000 : Invalid argument


To resolve this, you will have to make sure that the SCSI reset disk method is configured in HACMP. For example, when using EMC storage:

Make sure emcpowerreset is present in /usr/lpp/EMC/Symmetrix/bin/emcpowerreset.

Then add new custom disk method:
smitty hacmp -> Ext. Conf. -> Ext. Res. Conf. -> HACMP Ext. Resources Conf. -> Conf. Custom Disk Methods -> Add Cust. Disk

    * Disk Type (PdDvLn field from CuDv)                 [disk/pseudo/power]
    * Method to identify ghost disks                     [SCSI3]
    * Method to determine if a reserve is held           [SCSI_TUR]
    * Method to break a reserve                          [/usr/lpp/EMC/Symmetrix/bin/emcpowerreset]
        Break reserves in parallel                          true
    * Method to make the disk available                  [MKDEV]

------------------------------------------------------

Once I had a problem with commands 'cldump' and 'clstat -o' (version 5.4.1 SP3)

cldump: Waiting for the Cluster SMUX peer (clstrmgrES)
to stabilize...


Can not get cluster information.


Solution was:
-checked all the below mentioned daemons (clinfo, clcomd,snmpd...) and started what was missing
-and after that I did: refresh -s clstrmgrES (cldump and clstat was OK only after this refresh has been done)
-once had a problem with clstat -a (but clinfo was running), after refresh -s clinfoES it was OK again
(This can be also good: stopsrc -s clinfoES && sleep 2 && startsrc -s clinfoES )


things what can be checked regarding snmp:

-clinfoES and clcomdES:
clshowsrv -v

-snmpd and mibd daemons (if not active startsrc can start it)
root@aix20: / # lssrc -a | egrep 'snm|mib'
 snmpmibd         tcpip            552998       active
 aixmibd          tcpip            524418       active
 hostmibd         tcpip            430138       active
 snmpd            tcpip            1212632      active

(hostmibd is not necessary all the time to be active)

-snmpd conf and log files

root@aix20: / # ls -l /etc | grep snmp
-rw-r-----    1 root     system         2302 Aug 16 2005  clsnmp.conf
-rw-r--r--    1 root     system           37 Jun 16 16:18 snmpd.boots
-rw-r-----    1 root     system        10135 Aug 11 2009  snmpd.conf
-rw-r-----    1 root     system         2693 Aug 11 2009  snmpd.peers
-rw-r-----    1 root     system        10074 Jun 16 16:22 snmpdv3.conf
drwxrwxr-x    2 root     system          256 Aug 11 2009  snmpinterfaces
-rw-r-----    1 root     system         1816 Aug 11 2009  snmpmibd.conf

root@aix20: / # ls -l /var/tmp | grep snmp
-rw-r--r--    1 root     system        83130 Jun 16 20:32 snmpdv3.log
-rw-r--r--    1 root     system       100006 Oct 01 2008  snmpdv3.log.1
-rw-r--r--    1 root     system        16417 Jun 16 16:19 snmpmibd.log

------------------------------------------------------

During PowerHA upgrade from 5.4.1 to 6.1 received these errors:
(it was an upgrade where we put into unmanage state the resource groups)

grep: can't open aixdb1
./cluster.es.cspoc.rte.pre_rm: ERROR

Cluster services are active on this node.  Please stop all
cluster services prior to installing this software.

...

grep: can't open aixdb1
./cluster.es.client.rte.pre_rm: ERROR

Cluster services are active on this node.  Please stop all
cluster services prior to installing this software.

Failure occurred during pre_rm.
Failure occurred during rminstal.
installp: An internal error occurred while attempting
        to access the Software Vital Product Data.
        Use local problem reporting procedures.


We checked where to find that script at the first ERROR:
root@aixdb1: / # find /usr -name cluster.es.client.rte.pre_rm -ls
145412    5 -rwxr-x---  1 root      system        4506 Feb 26  2009 /usr/lpp/cluster.es/inst_root/cluster.es.client.rte.pre_rm

Looking through the script, found these 2 lines:
LOCAL_NODE=$(odmget HACMPcluster 2>/dev/null | sed -n '/nodename = /s/^.* "\(.*\)".*/\1/p')
LC_ALL=C lssrc -ls clstrmgrES | grep "Forced down" | grep -qw $LOCAL_NODE


Checking these, after running the second line, the original error could be successfully recreated:
root@aixdb1: / # lssrc -ls clstrmgrES | grep "Forced down" | grep -qw $LOCAL_NODE
grep: can't open aixdb1


There were 2 entries in this variable, and that caused the error:
root@aixdb1: / # echo $LOCAL_NODE
aixdb1 aixdb1



root@aixdb1: / # odmget HACMPcluster
HACMPcluster:
        id = 1315338110
        name = "DFWEAICL"
        nodename = "aixdb1"            <--grep finds this entry
        sec_level = "Standard"
        sec_level_msg = ""
        ...
        rg_distribution_policy = "node"
        noautoverification = 1
        clvernodename = "aixdb1"        <--grep finds this entry as well (this is causing the trouble)
        clverhour = 0
        clverstartupoptions = 0


After Googling, what is clvernodename, find out this field is set by "Automatic Cluster Configuration Verification", and if it is set to Disabled it will remove the additional entry from ODM:

We checked in smitty hacmp -> HACmp verification -> Automatic..:

* Automatic cluster configuration verification      Enabled        <--we changed it to disabled
* Node name                                         aixdb1
* HOUR (00 - 23)                                    [00]
  Debug                                             no       

After this correction, smitty update_all issued again. We received some similar errors (grep: can't open...), but when we retried smitty update_all then it was all successful. (All the earlier Broken filesets were corrected, and we had the new PowerHA version, without errors.)

------------------------------------------------------

27 comments:

  1. This is a very good learning page. Thanks Author

    ReplyDelete
  2. Greetings, have you experienced following error? Cluster is stable but only one node can read the information. On other node clstrmgrES is unresponsive:

    lssrc -ls clstrmgrES
    0513-014 The request could not be passed to the clstrmgrES subsystem.
    The System Resource Controller is experiencing problems with
    the subsystem's communication socket.

    If you search for 0513-014 in the rsct documentation you don't find satisfactory solution how to proceed: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.rsct.v3r2.rsct400.doc%2F0513-014.htm

    User response
    Contact your system administrator.

    is not really a good help if you are the system administrator =) Any clues what can be checked next? all snmp related daemons are running.

    ReplyDelete
    Replies
    1. If I remember correctly, once I had some comm. problem on a cluster and I did "refresh -s clstrmgrES" which solved that issue. (Be careful if it is a productive system, for me it was a test system so it was not important if cluster goes down.)

      Other idea is restarting cluster services.
      Other idea: The error message says SRC has some problem with clstrmgrES, so probably you can look around there as well:

      This is how my tree looks like, probably you can see some difference:

      # lssrc -s clstrmgrES
      Subsystem Group PID Status
      clstrmgrES cluster 151816 active

      # ps -ef | grep srcmstr
      root 110802 1 0 Jul 28 - 1:27 /usr/sbin/srcmstr
      # ps -fT 110802
      UID PID PPID C STIME TTY TIME CMD
      root 110802 1 0 Jul 28 - 1:27 /usr/sbin/srcmstr
      root 106674 110802 0 Jul 28 - 0:00 |\--/usr/sbin/portmap
      root 114824 110802 0 Jul 28 - 0:00 |\--/usr/sbin/nimsh -s
      ...
      ...
      root 135662 110802 0 Jul 28 - 44:17 |\--/usr/sbin/rsct/bin/IBM.ServiceRMd
      root 151816 110802 0 Jul 28 - 250:06 |\--/usr/es/sbin/cluster/clstrmgr
      root 676408 151816 0 Nov 05 - 0:00 | \--run_rcovcmd

      Delete
  3. Hi,

    Could please share the step by step process of increasing filesize in hacmp

    Regards,
    Siva

    ReplyDelete
    Replies
    1. Hi, almost all system administration on HACMP cluster should be started with "smitty hacmp".
      This is the case here: smitty hacmp -> System Management (C-SPOC) -> HACMP Logical Volume Management -> Shared File Systems -> ... Change/Show characteristic
      In the size filed you can give the new size of the filesystem, or with a + sign, you say how much addition space is needed for the filesystem. Be prepared, that by default, it counts the size in 512 bytes, so of you would like to add +1 MB, you should write in the field 2048.

      Regards,
      Balazs

      Delete
    2. using CLI /usr/es/sbin/cluster/cspoc/cli_chfs -a size=+(amount of space to increase in GB or MB )

      Delete
    3. using CLI /usr/es/sbin/cluster/cspoc/cli_chfs -a size=+(amount of space to increase in GB or MB )

      Delete
    4. root /usr/es/sbin/cluster/cspoc/cli_extendlv PP's
      root /usr/es/sbin/cluster/cspoc/cli_chfs -a size=5GB /filesystem

      Delete
  4. Hi,

    can you please tell me how to see the information in hacmp.out file
    what happen if it remove

    Regards,

    ReplyDelete
    Replies
    1. Hi, it is a plain text file, you can read it with cat or tail ....
      If it is removed create a new empty file and HACM will use that.

      Delete
  5. Hi,

    Can you pls share the procedure to convert a normal sharedvg to BigVG under cluster. else is it the same procedure to follow what we would do for a normal VG in AIX lvm, ensure atleast 1 free PP on all of the vg disks and then fire the command which can be done online.
    I have datavg as shared vg created as Normal which can accommodate 32 disks max. I have a requirement of extending existing shared FS under datavg, for which I would require more luns (> 32) to be added to fulfill this requirement.
    Converting datavg from Normal to Bigvg is what I'm looking for as an option. is it a disruptive and does it requires DB2 apps to be down.

    Your response is highly appreaciated.

    ReplyDelete
    Replies
    1. Hi, in smitty HACMP --> in Storage section --> in Change/Show characteristics of a Volume Group, there is a possibility to do this: "Change to big VG format?" I think if this option is implemented in SMIT, it is a safe way to do this way...however I've never tried this, and if there is a possibility try first on a test system.

      Delete
  6. Hi Balazs,


    We have one PowerHA running on OS 6.1 which is in production. Two node are are participating in HACMP .

    And clinfoES src is not active on both the node due to which we are not able to use clstat.

    Is there any consequences if we start clinfoES src manually on both the nodes. If yes best possible way to start the same.

    Regard
    Manoj Suyal

    ReplyDelete
    Replies
    1. Hi Manoj,

      The following is the hierarchy to stop and start the clinfoES

      root stopsrc -s clinfoES
      root stopsrc -s snmpd
      root startsrc -s snmpd
      root startsrc -s clinfoES

      Regards,
      Ramya

      Delete
  7. bash-3.2# ./clshowsrv -v
    Status of the RSCT subsystems used by HACMP:
    Subsystem Group PID Status
    topsvcs topsvcs 11075752 active
    grpsvcs grpsvcs 11862222 active
    grpglsm grpsvcs inoperative
    emsvcs emsvcs 10879104 active
    emaixos emsvcs inoperative
    ctrmc rsct 6095092 active

    Status of the HACMP subsystems:
    Subsystem Group PID Status
    clcomdES clcomdES 7077964 active
    clstrmgrES cluster 7733272 active

    Status of the optional HACMP subsystems:
    Subsystem Group PID Status
    clinfoES cluster inoperative

    ReplyDelete
    Replies
    1. Hi, if you read above, you can see I had some issues with clinfo as well. startsrc -s clinfoES solved my issues and I had no problem at all starting it.

      Delete
  8. Anyone could tell me, how to find the RG failover date and time information?

    ReplyDelete
    Replies
    1. Give " mount " command in the cluster server where RG is up and running...
      Check the cluster file system mounted date and time..It gives you the RG failover date and time info.

      -------------------------------
      /dev/lvfs1 /fs1 jfs2 Nov 04 19:04 rw,log=/dev/lvlog1502
      /dev/lvfsrg /fs_RG1 jfs2 Nov 04 19:04 rw,log=/dev/lvlog1502

      Regards,
      Ramya

      Delete
  9. does anyone know:

    dbserver:
    #get_local_nodename
    ---------------> but the output is empty
    #

    apserver:
    #get_local_nodename
    apserver
    #

    ReplyDelete
  10. HI Belaz ,I want to ask you something about removing HACMP, if I remove both nodes from the cluster after stopping it and then remove the filesets , Can I continue use the nodes without any trouble , Do I have to remove the resource group or cluster network before removing the filesets? I am assigned a task to remove HACMP because they want to implement Oracle RAC cluster instead.Because they prefer active/active cluster instead of active/passive. Do I have to change those VGs after removing the HACMP (chvg -l vgname) . Do you have any suggestions, I will really appreciate it. Thanks

    ReplyDelete
    Replies
    1. Hi, I cannot give you a step-by-step info, but at this link there are some hints: http://www-01.ibm.com/support/docview.wss?uid=isg3T1000444
      Basically I would document all necessary info (network, vg/lv/fs/NFS...), stop cluster, reomove cluster config (smitty hacmp), and remove cluster filesets.
      While doing this Oracle should be stopped, after checking the system and doing necessary actions. chvg -l is also a good idea and there are some other hints at the IBM link above.

      Delete
    2. Thanks Belaz, I wasn't looking for step by step info , that will be like you are doing my job , you already done more that enough in this site, I was just looking for your opinion on HACMP to RAC, I know in RAC (which i don't know much about) they need to have private network between to LPARs , (little different than HACMP private network) which is easy to do if it is a virtual servers, I can just create another virtual switch and configure a ip for these two nodes only, but I will have to think about how to do it between one physical lpar and one virtual lpar. If there is any thoughts , I will appreciate that . Thanks again

      Delete
    3. Hi, usually ip should come from Network team, they should give you subnet mask, VLAN id (if there is any), etc. So I just need to configure those IPs (either creating new virtual ethernet adapters, or on physical ones (or as alias)). Creating new virtual switch by your own for virtual servers is OK if they are on 1 Physical Machine, but if you want to do LPM that will not work. So, best is to discuss it with Network team.

      Delete
  11. Hi , I newbie HACMP
    could you help me ?
    There are node 1(primary) and node 2 (secondary) in the same cluster, i can moving the RG from node 1 to node 2 (online and also offline) , but when i try moving RG from node 2 to node 1 the RG cannot move, there was one Filesystem (application) can't move, so application be error...could you tell me how to solve about it ?, or where i can know error-re

    ReplyDelete
  12. Hi , I newbie HACMP
    could you help me ?
    There are node 1(primary) and node 2 (secondary) in the same cluster, i can moving the RG from node 1 to node 2 (online and also offline) , but when i try moving RG from node 2 to node 1 the RG cannot move, there was one Filesystem (application) can't move, so application be error...could you tell me how to solve about it ?, or where i can know error-re

    ReplyDelete
  13. Hi Blaze,
    I got a requirement to change the boot and service IP in both nodes.Could you please let me know the process to do.

    ReplyDelete