odmget HACMPlogs shows where are the log files
odmget HACMPcluster shows cluster version
odmget HACMPnode shows info from nodes (cluster version)
/etc/es/objrepos HACMP ODM files
Log files:
(changing log files path: C-SPOC > Log Viewing and Management)
/var/hacmp/adm/cluster.log main PowerHA log file (errors,events,messages ... /usr/es/adm/cluster.log)
/var/hacmp/adm/history/cluster.mmddyy shows only the EVENTS, generated daily (/usr/es/sbin/cluster/history/cluster.mmddyyyy)
/var/hacmp/log/clinfo.log records the activity of the clinfo daemon
/var/hacmp/log/clstrmgr.debug debug info about the cluster (clstrmg.debug.long also exists) IBM support using these
/var/hacmp/log/clutils.log summary of nightly verification
/var/hacmp/log/cspoc.log shows more info of the smitty c-spoc command (good place to look if a command fails)
/var/hacmp/log/hacmp.out similar to cluster.log, but more detailed (with all the output of the scripts)
/var/hacmp/log/loganalyzer/loganalyzer.log log analyzer (clanalyze) stores outputs there
/var/hacmp/clverify shows the results of the verifications (verification errors are logged here)
/var/log/clcomd/clcomd.log contains every connect request between the nodes and return status of the requests
RSCT Logs:
/var/ha/log RSCT logs are here
/var/ha/log/nim.topsvcs... the heartbeats are logged here (comm. is OK between the nodes)
clRGinfo Shows the state of RGs (in earlier HACMP clfindres was used)
clRGinfo -p shows the node that has temporarily the highest priority (POL)
clRGinfo -t shows the delayed timer information
clRGinfo -m shows the status of the application monitors of the cluster
resource groups state can be: online, offline, acquiring, releasing, error, unknown
cldump (or clstat -o) detailed info about the cluster (realtime, shows cluster status) (clstat requires a running clinfo)
cldisp detailed general info about the cluster (not realtime) (cldisp | egrep 'start|stop', lists start/stop scripts)
cltopinfo Detailed information about the network of the cluster (this shows the data in DCD not in ACD)
cltopinfo -i good overview, same as cllsif: this also lists cluster inetrfaces, it was used prior HACMP 5.1
cltopinfo -m shows heartbeat statistics, missed heartbeats (-m is no longer available on PowerHA 7.1)
clshowres Detailed information about the resource group(s)
cllsserv Shows which scripts will be run in case of a takeover
clrgdependency -t PARENT_CHILD -sl shows parent child dependencies of resource groups
clshowsrv -v shows status of the cluster daemons (very good overview!!!)
lssrc -g cluster lists the running cluster daemons
ST_STABLE: cluster services running with resources online
NOT_CONFIGURED: cluster is not configured or node is not synced
ST_INIT: cluster is configured but not active on this node
ST_JOINING: cluster node is joining the cluster
ST_VOTING: cluster nodes are voting to decide event execution
ST_RP_RUNNING: cluster is running a recovery program
RP_FAILED: recovery program event script is failed
ST_BARRIER: clstrmgr is in between events waiting at the barrier
ST_CBARRIER: clstrmgr is exiting a recovery program
ST_UNSTABLE: cluster is unstable usually due to an event error
lssrc -ls topsvcs shows the status of individual diskhb devices, heartbeat intervals, failure cycle (missed heartbeats)
lssrc -ls grpsvcs gives info about connected clients, number of groups)
lssrc -ls emsvcs shows the resource monitors known to the event management subsystem)
lssrc -ls snmpd shows info about snmpd
halevel -s shows PowerHA level (from 6.1)
lscluster list CAA cluster configuration information
-c cluster configuration
-d disk (storage) configuration
-i interfaces configuration
-m node configuration
mkcluster create a CAA cluster
chcluster change a CAA cluster configuration
rmcluster remove a CAA cluster configuration
clcmd <command> it will run given <command> on both nodes (for example: clcmd date)
cl_ping pings all the adapters of the given list (e.g.: cl_ping -w 2 aix21 aix31 (-w: wait 2 seconds))
cldiag HACMP troubleshooting tool (e.g.: cldiag debug clstrmgr -l 5 <--shows clstrmgr heartbeat infos)
cldiags vgs -h nodeA nodeB <--this checks the shared vgs definitions on the given node for inconsistencies
clmgr offline cluster WHEN=now MANAGE=offline STOP_CAA=yes stop cluster and CAA as well (after maintenance start with START_CAA=yes)
clmgr view report cluster TYPE=html FILE=/tmp/powerha.report create the HTML report
clanalyze -a -p "diskfailure" analyzes PowerHA logs for applicationfailure, interfacefailure, networkfailure, nodefailure...
lvlstmajor lists the available major numbers for each node in the cluster
/usr/es/sbin/cluster/utilities/get_local_nodename shows the name of this node within the HACMP
/usr/es/sbin/cluster/utilities/clexit.rc this script halt the node if the cluster manager daemon stopped incorrectly
------------------------------------------------------
Remove HACMP:
1. stop cluster on both nodes
2. remove the cluster configuration ( smitty hacmp) on both nodes
3. remove cluster filesets (startinf with cluster.*)
------------------------------------------------------
If you are planning to do crash-test, do it with halt -q or reboot -q
shutdown -Fr will not work, because it stops hacmp and resource groups garcefully (rc.shutdown), so no takeover will occur
------------------------------------------------------
clhaver - clcomd problem:
If there are problems during start up a cluster or synch. and verif., and you see something like this:
1800-106 An error occurred:
connectconnect: : Connection refusedConnection refused
clhaver[113]: cl_socket(aix20)clhaver[113]: cl_socket(aix04): : Connection refusedConnection refused
Probably there is a problem with clcomd.
1. check if if it is running: clshowsrv -v or lssrc -a | grep clcomd
refresh or start it: refresh -s clcomdES or startsrc -s clcomdES
2. check log file: /var/hacmp/clcomd/clcomd.log
you can see something like this: CONNECTION: REJECTED(Invalid address): aix10: 10.10.10.100->10.10.10.139
for me the solution was:
-update /usr/sbin/cluster/etc/rhosts file on both nodes (I added all ip's of both servers (except service ip + service backup ip))
-refresh -s clcomdES
------------------------------------------------------
When trying to bring up a resource group in HACMP, got the following errors in the hacmp.out log file.
cl_disk_available[187] cl_fscsilunreset fscsi0 hdiskpower1 false
cl_fscsilunreset[124]: openx(/dev/hdiskpower1, O_RDWR, 0, SC_NO_RESERVE): Device busy
cl_fscsilunreset[400]: ioctl SCIOLSTART id=0X11000 lun=0X1000000000000 : Invalid argument
To resolve this, you will have to make sure that the SCSI reset disk method is configured in HACMP. For example, when using EMC storage:
Make sure emcpowerreset is present in /usr/lpp/EMC/Symmetrix/bin/emcpowerreset.
Then add new custom disk method:
smitty hacmp -> Ext. Conf. -> Ext. Res. Conf. -> HACMP Ext. Resources Conf. -> Conf. Custom Disk Methods -> Add Cust. Disk
* Disk Type (PdDvLn field from CuDv) [disk/pseudo/power]
* Method to identify ghost disks [SCSI3]
* Method to determine if a reserve is held [SCSI_TUR]
* Method to break a reserve [/usr/lpp/EMC/Symmetrix/bin/emcpowerreset]
Break reserves in parallel true
* Method to make the disk available [MKDEV]
------------------------------------------------------
Once I had a problem with commands 'cldump' and 'clstat -o' (version 5.4.1 SP3)
cldump: Waiting for the Cluster SMUX peer (clstrmgrES)
to stabilize...
Can not get cluster information.
Solution was:
-checked all the below mentioned daemons (clinfo, clcomd,snmpd...) and started what was missing
-and after that I did: refresh -s clstrmgrES (cldump and clstat was OK only after this refresh has been done)
-once had a problem with clstat -a (but clinfo was running), after refresh -s clinfoES it was OK again
(This can be also good: stopsrc -s clinfoES && sleep 2 && startsrc -s clinfoES )
things what can be checked regarding snmp:
-clinfoES and clcomdES:
clshowsrv -v
-snmpd and mibd daemons (if not active startsrc can start it)
root@aix20: / # lssrc -a | egrep 'snm|mib'
snmpmibd tcpip 552998 active
aixmibd tcpip 524418 active
hostmibd tcpip 430138 active
snmpd tcpip 1212632 active
(hostmibd is not necessary all the time to be active)
-snmpd conf and log files
root@aix20: / # ls -l /etc | grep snmp
-rw-r----- 1 root system 2302 Aug 16 2005 clsnmp.conf
-rw-r--r-- 1 root system 37 Jun 16 16:18 snmpd.boots
-rw-r----- 1 root system 10135 Aug 11 2009 snmpd.conf
-rw-r----- 1 root system 2693 Aug 11 2009 snmpd.peers
-rw-r----- 1 root system 10074 Jun 16 16:22 snmpdv3.conf
drwxrwxr-x 2 root system 256 Aug 11 2009 snmpinterfaces
-rw-r----- 1 root system 1816 Aug 11 2009 snmpmibd.conf
root@aix20: / # ls -l /var/tmp | grep snmp
-rw-r--r-- 1 root system 83130 Jun 16 20:32 snmpdv3.log
-rw-r--r-- 1 root system 100006 Oct 01 2008 snmpdv3.log.1
-rw-r--r-- 1 root system 16417 Jun 16 16:19 snmpmibd.log
------------------------------------------------------
During PowerHA upgrade from 5.4.1 to 6.1 received these errors:
(it was an upgrade where we put into unmanage state the resource groups)
grep: can't open aixdb1
./cluster.es.cspoc.rte.pre_rm: ERROR
Cluster services are active on this node. Please stop all
cluster services prior to installing this software.
...
grep: can't open aixdb1
./cluster.es.client.rte.pre_rm: ERROR
Cluster services are active on this node. Please stop all
cluster services prior to installing this software.
Failure occurred during pre_rm.
Failure occurred during rminstal.
installp: An internal error occurred while attempting
to access the Software Vital Product Data.
Use local problem reporting procedures.
We checked where to find that script at the first ERROR:
root@aixdb1: / # find /usr -name cluster.es.client.rte.pre_rm -ls
145412 5 -rwxr-x--- 1 root system 4506 Feb 26 2009 /usr/lpp/cluster.es/inst_root/cluster.es.client.rte.pre_rm
Looking through the script, found these 2 lines:
LOCAL_NODE=$(odmget HACMPcluster 2>/dev/null | sed -n '/nodename = /s/^.* "\(.*\)".*/\1/p')
LC_ALL=C lssrc -ls clstrmgrES | grep "Forced down" | grep -qw $LOCAL_NODE
Checking these, after running the second line, the original error could be successfully recreated:
root@aixdb1: / # lssrc -ls clstrmgrES | grep "Forced down" | grep -qw $LOCAL_NODE
grep: can't open aixdb1
There were 2 entries in this variable, and that caused the error:
root@aixdb1: / # echo $LOCAL_NODE
aixdb1 aixdb1
root@aixdb1: / # odmget HACMPcluster
HACMPcluster:
id = 1315338110
name = "DFWEAICL"
nodename = "aixdb1" <--grep finds this entry
sec_level = "Standard"
sec_level_msg = ""
...
rg_distribution_policy = "node"
noautoverification = 1
clvernodename = "aixdb1" <--grep finds this entry as well (this is causing the trouble)
clverhour = 0
clverstartupoptions = 0
After Googling, what is clvernodename, find out this field is set by "Automatic Cluster Configuration Verification", and if it is set to Disabled it will remove the additional entry from ODM:
We checked in smitty hacmp -> HACmp verification -> Automatic..:
* Automatic cluster configuration verification Enabled <--we changed it to disabled
* Node name aixdb1
* HOUR (00 - 23) [00]
Debug no
After this correction, smitty update_all issued again. We received some similar errors (grep: can't open...), but when we retried smitty update_all then it was all successful. (All the earlier Broken filesets were corrected, and we had the new PowerHA version, without errors.)
------------------------------------------------------
Manual cluster switch:
1. varyonvg
2. mount FS (mount -t sapX11)(mount -t nfs)
3. check nfs: clshowres
if there are exported fs: exportfs -a
go to the nfs client: mount -t nfs
4. IP configure (ifconfig)
grep ifconfig /tmp/hacmp.out -> it will show the command:
IPAT via IP replacement: ifconfig en1 inet 10.10.110.11 netmask 255.255.255.192 up mtu 1500
IPAT via IP aliasing: ifconfig en3 alias 10.10.90.254 netmask 255.255.255.192
netmask can be found from ifconfig or cltopinfo -i
(removing ip: ifconfig en3 delete 10.10.90.254)
5. check routing (extra routes could be necessary)
(removing route: route delete -host 10.10.90.192 10.10.90.254 or route delete -net 10.10.90.192/26 10.10.90.254)
6. start applications
------------------------------------------------------
27 comments:
This is a very good learning page. Thanks Author
Greetings, have you experienced following error? Cluster is stable but only one node can read the information. On other node clstrmgrES is unresponsive:
lssrc -ls clstrmgrES
0513-014 The request could not be passed to the clstrmgrES subsystem.
The System Resource Controller is experiencing problems with
the subsystem's communication socket.
If you search for 0513-014 in the rsct documentation you don't find satisfactory solution how to proceed: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.rsct.v3r2.rsct400.doc%2F0513-014.htm
User response
Contact your system administrator.
is not really a good help if you are the system administrator =) Any clues what can be checked next? all snmp related daemons are running.
If I remember correctly, once I had some comm. problem on a cluster and I did "refresh -s clstrmgrES" which solved that issue. (Be careful if it is a productive system, for me it was a test system so it was not important if cluster goes down.)
Other idea is restarting cluster services.
Other idea: The error message says SRC has some problem with clstrmgrES, so probably you can look around there as well:
This is how my tree looks like, probably you can see some difference:
# lssrc -s clstrmgrES
Subsystem Group PID Status
clstrmgrES cluster 151816 active
# ps -ef | grep srcmstr
root 110802 1 0 Jul 28 - 1:27 /usr/sbin/srcmstr
# ps -fT 110802
UID PID PPID C STIME TTY TIME CMD
root 110802 1 0 Jul 28 - 1:27 /usr/sbin/srcmstr
root 106674 110802 0 Jul 28 - 0:00 |\--/usr/sbin/portmap
root 114824 110802 0 Jul 28 - 0:00 |\--/usr/sbin/nimsh -s
...
...
root 135662 110802 0 Jul 28 - 44:17 |\--/usr/sbin/rsct/bin/IBM.ServiceRMd
root 151816 110802 0 Jul 28 - 250:06 |\--/usr/es/sbin/cluster/clstrmgr
root 676408 151816 0 Nov 05 - 0:00 | \--run_rcovcmd
Hi,
Could please share the step by step process of increasing filesize in hacmp
Regards,
Siva
Hi, almost all system administration on HACMP cluster should be started with "smitty hacmp".
This is the case here: smitty hacmp -> System Management (C-SPOC) -> HACMP Logical Volume Management -> Shared File Systems -> ... Change/Show characteristic
In the size filed you can give the new size of the filesystem, or with a + sign, you say how much addition space is needed for the filesystem. Be prepared, that by default, it counts the size in 512 bytes, so of you would like to add +1 MB, you should write in the field 2048.
Regards,
Balazs
Hi,
can you please tell me how to see the information in hacmp.out file
what happen if it remove
Regards,
Hi, it is a plain text file, you can read it with cat or tail ....
If it is removed create a new empty file and HACM will use that.
Hi,
Can you pls share the procedure to convert a normal sharedvg to BigVG under cluster. else is it the same procedure to follow what we would do for a normal VG in AIX lvm, ensure atleast 1 free PP on all of the vg disks and then fire the command which can be done online.
I have datavg as shared vg created as Normal which can accommodate 32 disks max. I have a requirement of extending existing shared FS under datavg, for which I would require more luns (> 32) to be added to fulfill this requirement.
Converting datavg from Normal to Bigvg is what I'm looking for as an option. is it a disruptive and does it requires DB2 apps to be down.
Your response is highly appreaciated.
Hi, in smitty HACMP --> in Storage section --> in Change/Show characteristics of a Volume Group, there is a possibility to do this: "Change to big VG format?" I think if this option is implemented in SMIT, it is a safe way to do this way...however I've never tried this, and if there is a possibility try first on a test system.
Hi Balazs,
We have one PowerHA running on OS 6.1 which is in production. Two node are are participating in HACMP .
And clinfoES src is not active on both the node due to which we are not able to use clstat.
Is there any consequences if we start clinfoES src manually on both the nodes. If yes best possible way to start the same.
Regard
Manoj Suyal
bash-3.2# ./clshowsrv -v
Status of the RSCT subsystems used by HACMP:
Subsystem Group PID Status
topsvcs topsvcs 11075752 active
grpsvcs grpsvcs 11862222 active
grpglsm grpsvcs inoperative
emsvcs emsvcs 10879104 active
emaixos emsvcs inoperative
ctrmc rsct 6095092 active
Status of the HACMP subsystems:
Subsystem Group PID Status
clcomdES clcomdES 7077964 active
clstrmgrES cluster 7733272 active
Status of the optional HACMP subsystems:
Subsystem Group PID Status
clinfoES cluster inoperative
Hi, if you read above, you can see I had some issues with clinfo as well. startsrc -s clinfoES solved my issues and I had no problem at all starting it.
using CLI /usr/es/sbin/cluster/cspoc/cli_chfs -a size=+(amount of space to increase in GB or MB )
using CLI /usr/es/sbin/cluster/cspoc/cli_chfs -a size=+(amount of space to increase in GB or MB )
Anyone could tell me, how to find the RG failover date and time information?
does anyone know:
dbserver:
#get_local_nodename
---------------> but the output is empty
#
apserver:
#get_local_nodename
apserver
#
Give " mount " command in the cluster server where RG is up and running...
Check the cluster file system mounted date and time..It gives you the RG failover date and time info.
-------------------------------
/dev/lvfs1 /fs1 jfs2 Nov 04 19:04 rw,log=/dev/lvlog1502
/dev/lvfsrg /fs_RG1 jfs2 Nov 04 19:04 rw,log=/dev/lvlog1502
Regards,
Ramya
root /usr/es/sbin/cluster/cspoc/cli_extendlv PP's
root /usr/es/sbin/cluster/cspoc/cli_chfs -a size=5GB /filesystem
Hi Manoj,
The following is the hierarchy to stop and start the clinfoES
root stopsrc -s clinfoES
root stopsrc -s snmpd
root startsrc -s snmpd
root startsrc -s clinfoES
Regards,
Ramya
HI Belaz ,I want to ask you something about removing HACMP, if I remove both nodes from the cluster after stopping it and then remove the filesets , Can I continue use the nodes without any trouble , Do I have to remove the resource group or cluster network before removing the filesets? I am assigned a task to remove HACMP because they want to implement Oracle RAC cluster instead.Because they prefer active/active cluster instead of active/passive. Do I have to change those VGs after removing the HACMP (chvg -l vgname) . Do you have any suggestions, I will really appreciate it. Thanks
Hi, I cannot give you a step-by-step info, but at this link there are some hints: http://www-01.ibm.com/support/docview.wss?uid=isg3T1000444
Basically I would document all necessary info (network, vg/lv/fs/NFS...), stop cluster, reomove cluster config (smitty hacmp), and remove cluster filesets.
While doing this Oracle should be stopped, after checking the system and doing necessary actions. chvg -l is also a good idea and there are some other hints at the IBM link above.
Thanks Belaz, I wasn't looking for step by step info , that will be like you are doing my job , you already done more that enough in this site, I was just looking for your opinion on HACMP to RAC, I know in RAC (which i don't know much about) they need to have private network between to LPARs , (little different than HACMP private network) which is easy to do if it is a virtual servers, I can just create another virtual switch and configure a ip for these two nodes only, but I will have to think about how to do it between one physical lpar and one virtual lpar. If there is any thoughts , I will appreciate that . Thanks again
Hi, usually ip should come from Network team, they should give you subnet mask, VLAN id (if there is any), etc. So I just need to configure those IPs (either creating new virtual ethernet adapters, or on physical ones (or as alias)). Creating new virtual switch by your own for virtual servers is OK if they are on 1 Physical Machine, but if you want to do LPM that will not work. So, best is to discuss it with Network team.
Hi , I newbie HACMP
could you help me ?
There are node 1(primary) and node 2 (secondary) in the same cluster, i can moving the RG from node 1 to node 2 (online and also offline) , but when i try moving RG from node 2 to node 1 the RG cannot move, there was one Filesystem (application) can't move, so application be error...could you tell me how to solve about it ?, or where i can know error-re
Hi , I newbie HACMP
could you help me ?
There are node 1(primary) and node 2 (secondary) in the same cluster, i can moving the RG from node 1 to node 2 (online and also offline) , but when i try moving RG from node 2 to node 1 the RG cannot move, there was one Filesystem (application) can't move, so application be error...could you tell me how to solve about it ?, or where i can know error-re
Hi Blaze,
I got a requirement to change the boot and service IP in both nodes.Could you please let me know the process to do.
Hacmp 6.1
Post a Comment