Enhanced Concurrent VG
PowerHA is using highly available filesystems on a special VG, which is called "enhanced concurrent volume group" (ECVG). This is basically a usual VG with the possibility to access the VG on all nodes. Any disk which will be used in PowerHA on multiple nodes, must be placed in an enhanced concurrent volume group, which can be used in either concurrent or nonconcurrent setup:
- Concurrent setup: An application runs on all cluster nodes at the same time. In this setup ECVG enables concurrent access to the VG
- Non-concurrent setup: An application runs on one node at a time. The volume groups are not concurrently accessed, they are accessed by one node at a time.
ECVG has the ability to vary on VG in two modes:
- Active state: The vg behaves the same way as the traditional varyon. Operations can be performed on the vg, and logical volumes and file systems can be mounted.
- Passive state: The passive state allows limited read only access to the volume group descriptor area (VGDA) and the logical volume control block (LVCB).
This type of VG can be created in "smitty hacmp" under C-SPOC --> Storage --> Volume Groups --> Create a Volume Group and then choose both nodes... (In PowerHA we need to create "enhanced concurrent mode volume groups", but this does not necessarily mean that we will use them for concurrent access)
Ensure that the automatic varyon attribute is set to No for shared volume groups.
So, in non-concurrent mode, the node which owns the VG (where appl. is running) can do read-write operations, but on the other node only limited read operations are possible (passive mode). Passive mode is the LVM equivalent of disk fencing, where LVM restricts the higher-level actions, such as NFS and JFS mounts and it does not allow read/write access to files. These different modes can be checked by lsvg command:
# lsvg bb_vg
VOLUME GROUP: bb_vg
VG STATE: active
VG PERMISSION: read/write <--this is the actve node
# lsvg bb_vg
VOLUME GROUP: bb_vg
VG STATE: active
VG PERMISSION: passive-only <--this is the standby node
(Also "lsvg -o" will report only those volume groups which are are varied on in active mode.)
Historically, because there was no SCSI locking, a partitioned cluster could quickly vary on a VG, which could cause data corruption. Enhancements in AIX 6.1 and 7.1 introduced JFS2 mountguard. This option prevents a file system from being mounted on more than one system at the same time. PowerHA v7.1 and later automatically enable this feature.
During cluster configuration these volume groups are activated in passive mode. When the resource group comes online on the node, the VG varied on in active mode. When the resource group goes offline, the volume group is varied off to passive mode.
--------------------------
Fast disk takeover
To enable fast disk takeover, PowerHA activates enhanced concurrent volume groups in active or passive states. Fast disk takeover reduces total fallover time by providing faster acquisition of the disks without having to break SCSI reserves. During fast disk takeover, PowerHA skips the extra processing needed to break the disk reserves or update and synchronize the LVM information. It uses enhanced concurrent volume groups, and additional LVM enhancements provided by AIX. When a node fails, the other node changes the volume group state from passive mode to active mode. This change takes approximately 10 seconds and it is at the volume group level. This time impact is minimal compared to the previous method of breaking SCSI reserves.
The active and passive mode flags to the varyonvg command are not documented because they should not be used outside a PowerHA environment. However, you can find it in the hacmp.out log.
Active mode varyon command: varyonvg -n -c -A app2vg
Passive mode varyon command: varyonvg -n -c -P app2vg
When a resource group is brought online, PowerHA checks a disk with lqueryvg to determine if it is an enhanced concurrent volume group or not:
# lqueryvg -p hdisk0 -X
0 <---return code of 0 (zero) indicates a regular non-concurrent volume group (like rootvg)
# lqueryvg -p hdisk8 -X
32 <--return code of 32 indicates an enhanced concurrent volume group
----------------------------
Lazy update
If a volume group under PowerHA is updated directly (that is, without C-SPOC), information on other nodes will be updated only, when PowerHA brings the vg online on those nodes, but not before.
The volume group time stamp is maintained in VGDA, and the local ODM. When PowerHA is doing varyon vg, it compares the time stamp in ODM with the VGDA. If the values differ, PowerHA will update ODM with the information in the VGDA.
In normal cicumstances PowerHA does not require lazy update processing for enhanced concurrent volume groups, as it keeps all cluster nodes updated with the LVM information.
----------------------------
Disk fencing
In PowerHA 6.1 and previous releases, when using enhanced concurrent volume groups in a fast disk takeover mode, the VGs are in full read/write (active) mode on the node owning the resource group. Any standby node has the VGs varied on in read only (passive) mode. The passive mode is the LVM equivalent of disk fencing.
Passive mode allows readability only of the volume group descriptor area and the first 4 KB of each logical volume. It does not allow read/write access to file systems or logical volumes. It also does not support LVM operations. However, low-level commands, such as dd, can bypass LVM and write directly to the disk. And when the enhanced concurrent volume group is varied off, there is no write-protect restriction on the disks any more.
PowerHA SystemMirror 7.1 uses the storage framework disk fencing, and it prevents disk write access from the AIX SCSI disk driver layer. It invalidates the possibility of a lower-level operation, such as dd. This new feature prevents data corruption because and it protects the disks even after the enhanced concurrent volume groups varied off.
PowerHA creates a fence group and the AIX SCSI disk driver rejects disk write access and returns the EWPROTECT error when disks are fenced with read only access.
Disk fencing when node and RG comes online:
1. When a PowerHA node comes up a fence group is created for the enhanced concurrent VG and sets the permissions (fence height) to read only. This prevents write access to the disks
2. Prior to vary on the VG, the fence height is set to read write, so the varyonvg to passive mode could work, and when it is done fence height is returned to 'read only'.
3. When the PowerHA resource groups come online, the VG is varied on in active mode, and the fence height is set to read write, and the write access to the disks is granted.
----------------------------
NodeA has changes, we want nodeB to be aware about them:
(for example an fs in a shared vg has been increased without HACMP)
I. TRY C-SPOC FIRST: smitty hacmp -> c-spoc -> HACMP Logica Volume Management -> Synchronize a Shared Volume Group Definition
II. MANUAL METHOD:
1 .NodeA:
-check lv ownerships: ls -l /dev/<lvname>
-if varyoffvg is possible: varyoffvg <vgname>
-if varyoffvg is not possible: varyonvg -bun <vgname>
2. NodeB:
-if vg was not exported on the remote node (nodeA)learning import is possible (this preserves lv ownerships under /dev):
importvg -L <vgname> -n <hdiskname>
-if learning import not possible and vg already exists, first should be exported:
exportvg <vgname>
-import vg from disk: importvg -y <vgname> -n -V<maj.numb.> <hdisk>
-!!!!if necessary autovaryon should be set back to no on the standby node: varyonvg -bun <vg> -> chvg -an <vgname> -> varyoffvg
-if vg was exported/imported lv ownerships will be reset to root.system, check if ok under /dev
3. NodeA:
-if varyonvg -bun ... was used, vg should be set back to normal use:
varyonvg <vgname>
III. IMPORT VG DEFINITIONS TO HACMP:
On the node with the incorrect data and the VG NOT varied on export VG: exportvg shared_vg
On the node with the VG varied on:
smitty hacmp -> Ext. Conf. -> Ext. Res. Conf. -> HACMP Ext. Res. Group -> Change Show Res. and Attr.:
"Automatically Import Volume Groups" set to "true", and the VG information will be synchronized immidiately.
--------------------------
Repository disk
PowerHA sends heartbeat packets between all communication interfaces (adapters) in the network to determine the status of the adapters and nodes. Cluster Aware AIX (CAA) provides an extra heartbeat path over SAN or Fibre Channel (FC) adapters. This SAN heartbeating (also known as sancomm) is not mandatory, however it provides an additional heartbeat path for redundancy.
Cluster Aware AIX maintains cluster related configurations (such as node list, cluster tunables etc.) on a special disk device called the repository disk. If a node loses access to the repository disk, the cluster continues operating, but the affected node is considered to be in degraded mode. If in this situation a network problem also occurs on this node, then PowerHA does not allow this node to operate, because a split brain situation could happen (partitioned cluster). In other situations where only the network interfaces have failed, the CAA and the storage framework use the repository disk to do all the necessary communication. (The tie breaker is an optional feature you can use to prevent a partitioned cluster, also known as split brain. If specified as an “arbitrator” (in split and merge policy), the tie breaker decides which partition of the cluster survives. A node that succeeds in placing a SCSI reserve on the tie breaker disk wins, and hence survives. The loser is rebooted.)
# /usr/lib/cluster/clras dumprepos <--displaying the content of the repository disk (including the CAA tunables and cluster configuration)
HEADER
CLUSRECID: 0xa9c2d4c2
Name: caa_cl
...
...
Multicast: 228.1.1.48
config_timeout : 240
node_down_delay : 10000
node_timeout : 20000
link_timeout : 0
deadman_mode : a
repos_mode : e
hb_src_lan : 1
hb_src_san : 2
hb_src_disk : 1
site_up : e
site_down : e
DISKS none
NODES numcl numz uuid shid flags name
1 0 25bc9128-784a-11e1-a79d-b6fcc11f846f 1 00000000 caa1
1 0 25e7a7dc-784a-11e1-a79d-b6fcc11f846f 2 00000000 caa2
A backup repository disk can be also configured, which is an empty disk to be used for rebuilding the cluster repository if the current repository disk encounters any failure. To see which disk is reserved as a backup disk, use the clmgr -v query repository command or the odmget HACMPsircol command.
Since PowerHA 7.1.1 after the failure of a repository disk some cluster maintenance tasks are still possible. This feature is called as "repository disk resilience", and moving Resource Groups, bringing a Resource Group online/offline is still possible, even with a failed repository disk. Because all repository configuration is also maintained in memory, CAA can recreate the configuration information when a new repository disk is provided, which can be done online with no cluster impact.
Replacing a repository disk from the command line:
1. # clmgr modify cluster caa_cl REPOSITORY=hdiskX <-- replace a repository disk
2. # clmgr verify cluster CHANGES_ONLY=yes SYNC=yes <--verify and synchronize PowerHA configuration
3. # /usr/sbin/lscluster -d <--after sync completed, verify that the new repository disk works
(A respository disk can be replaced using smitty hacmp as well: Problem Determination Tools --> Select a new Cluster repository disk)
When an additional node joins the cluster, CAA uses the ODM information to locate the repository disk. In case the ODM entry is missing the ODM entry can be repopulated and the node forced to join the cluster using an undocumented option of clusterconf. This assumes the administrator knows the hard disk name for the repository disk: # clusterconf -r hdiskX
--------------------------
# /usr/lib/cluster/clras dumprepos <--displaying the content of the repository disk (including the CAA tunables and cluster configuration)
HEADER
CLUSRECID: 0xa9c2d4c2
Name: caa_cl
...
...
Multicast: 228.1.1.48
config_timeout : 240
node_down_delay : 10000
node_timeout : 20000
link_timeout : 0
deadman_mode : a
repos_mode : e
hb_src_lan : 1
hb_src_san : 2
hb_src_disk : 1
site_up : e
site_down : e
DISKS none
NODES numcl numz uuid shid flags name
1 0 25bc9128-784a-11e1-a79d-b6fcc11f846f 1 00000000 caa1
1 0 25e7a7dc-784a-11e1-a79d-b6fcc11f846f 2 00000000 caa2
A backup repository disk can be also configured, which is an empty disk to be used for rebuilding the cluster repository if the current repository disk encounters any failure. To see which disk is reserved as a backup disk, use the clmgr -v query repository command or the odmget HACMPsircol command.
Since PowerHA 7.1.1 after the failure of a repository disk some cluster maintenance tasks are still possible. This feature is called as "repository disk resilience", and moving Resource Groups, bringing a Resource Group online/offline is still possible, even with a failed repository disk. Because all repository configuration is also maintained in memory, CAA can recreate the configuration information when a new repository disk is provided, which can be done online with no cluster impact.
Replacing a repository disk from the command line:
1. # clmgr modify cluster caa_cl REPOSITORY=hdiskX <-- replace a repository disk
2. # clmgr verify cluster CHANGES_ONLY=yes SYNC=yes <--verify and synchronize PowerHA configuration
3. # /usr/sbin/lscluster -d <--after sync completed, verify that the new repository disk works
(A respository disk can be replaced using smitty hacmp as well: Problem Determination Tools --> Select a new Cluster repository disk)
When an additional node joins the cluster, CAA uses the ODM information to locate the repository disk. In case the ODM entry is missing the ODM entry can be repopulated and the node forced to join the cluster using an undocumented option of clusterconf. This assumes the administrator knows the hard disk name for the repository disk: # clusterconf -r hdiskX
--------------------------
FS extension after new storage has been added
1. cfgmgr - on both nodes
2. add PVID - check LUNs are the same
3. add LUN to the vg in smitty hacmp (it will do necessary actions on both nodes)
(smitty hacmp -> system man. -> hacmp log. vol. man. -> shared vol. gr. -> set charact. of a ... -> add a volume ...)
4. lv, fs extension in smitty hacmp:
1. usual fs extension: hacmp is counting in 512byte blocks. 1MB=2048
or
2. if we want to choose which disks should be used:
-first increase lv (choose disks)
-lsfs -q (it will show new lv size in 512byte blocks)
-increase fs by the output of lsfs -q
--------------------------
If PVIDs are not consistent (on nodeB):
nodeA: varyoffvg sharedvg
nodeB: rmdev -dl hdisk3 (this will not delete data on the disk, only removes ODM definitions)
rmdev all inconsistent disks
cfgmgr
importvg -V123 -yshared_vg hdisk3 (ODM will be updated with new values)
chvg -an sharedvg
varyoffvg sharedvg
nodeA: varyonvg sharedvg
-----------------------------------
Error: timestamp is different for a VG on the 2 nodes:
This could be because timestamp on the disk and ODM is different for the vg (at least on 1 node)
You can check and compare the timestamps from ODM and disk:
-ODM:
lsattr -El <vgname> <--there will be a line for the timestamp
odmget CuAt | grep -p timestamp | grep -p <vgname> <--there will be a line for the timestamp
-disk:
lqueryvg -Tp <hdisk> <--it can be any disk from the vg
You can check on both nodes.
Correction:
-increase in smitty hacmp
or
-importvg:
node1: varyonvg -bun ... <--break the reservation
node2: importvg -L ... <--updates the timestamp
node2: lsattr -El ... <--verify the values
node1: varyonvg -n ... <--restores the reservation
-----------------------------
Importing vgs with ESS:
-if hdisks has pvid:
importvg -y vg_name -V major# hdisk1
hd2vp vg_name
-if vpath has only pvid:
importvg -y vg_name -V major# vpath0
-if neither hdisks nor vpath has pvid:
chdev -l vpath0 -a pv=yes
importvg -y vg_name -V major# vpath0
-----------------------------
can we increase PP size of Vg which is on HACMP.
ReplyDeleteHow to reflect the same on two nodes
If PP size needs to be changed, you need to recreate the VG. (backing up the VG, recreating the VG with correct PP size, restoring the VG). On HACMP you have shared disks, so when you recreate the VG , you have to do it in "smitty hacmp", so both servers will know about the changes. The restore will happen on the online node only. Hope this helps.
ReplyDeleteHow can we export the filesystems after we add to the RG.
ReplyDeleteThe NFS filesystem which are exported, are attributes of the resource group:
ReplyDeleteExtended Conf. --> Ext. Resource Conf. --> HACMP Ext. Res. Group Conf. --> Change/Show Res. and Attr....
how we can export the NFS filesystem to particular nodes in HACMP?My understanding is that first we need to add it in Resource group and manually edit /usr/es/sbin/cluster/etc/exports?
ReplyDeleteIf /usr/es/sbin/cluster/etc/exports file missing HACMP will export NFS filesystems with default options (root access for everyone). If special options are needed you can do that in /usr/es/sbin/cluster/etc/exports file.
DeleteI need to add new lun in existing volume group in HACMP
ReplyDeleteCould you please tell me step by step procedure ..
Thanks in Advance
The steps are right on this page, just a little above:
DeleteFS extension after new storage has been added
1. cfgmgr - on both nodes
2. add PVID - check LUNs are the same
3. add LUN to the vg in smitty hacmp (it will do necessary actions on both nodes)
...
PVID is needed on both nodes, otherwise in smitty HACMP there is no option to choose the disk.
thanks :-)
DeleteIf i wish to configure mirrorvg using 2 different SAN storage with same model as my shared storage,
ReplyDeletewhat is the proper steps to implement ?
You need to add the LUNs to the nodes and the vg (see my previous reply how to configure disks in HACMP (cfgmgr, pvid...) then, smitty hacmp -> system man. -> hacmp log. vol. man. -> shared vol. gr. -> set charact. of a ... -> add a volume ...
DeleteWhen you have the disks there, you need to mirror it in smitty hacmp:
smitty hacmp -> system man. -> hacmp log. vol. man. -> shared vol. gr. -> mirror a shared vol. gr.
Thanks for reply. Let say if one of the storage go down, and one of the disk from the mirror will likely go into stale state, and if my storage recover and up, how do sync back the disk ?
ReplyDeleteHi, in smitty hacmp you can do almost everything, which is cluster related.
DeleteFor syncing back: smitty hacmp -> system management -> hacmp logical volume management -> synchronize shared lvm mirrorrs ...
Hi,
ReplyDeletewhats the best option to go with if chvg -t vgname needs to apply on Concurrent VG inorder to increase max PV limits from 16 to 32
1-varyoff VG , apply change ?
2- create new VG with B option and then import exisiting VG ?
how can it applied in Active/Active node
Hi,
Delete"chvg -t" on a normal VG can be applied online, however I have never used this on Concurrent VG.
To be on the safe side, I would do it while VG is varied off, and do cluster synchronization as well.
If you need official answer, probably the best to ask IBM support.
If you find a good solution and you share that here, that would be great!
Balazs
hi,
ReplyDeleteI have added one filesystem using OS command in enhance concurrent VG which is managed by HACMP. when I run "lsvg -l vgmane" on passive node then I am able to see the volume in o/p but mount point is not showing on passive node. I have tried to sync hacmp and sync volume group definiation also but no difference. I tried with "varyonvg -bu vgname" and "importvg -L vgname " on 2nd node and it works fine.
is it the only way to get mount point name reflected on 2nd node?
Hi, I would add a filesystem in HACMP with "smitty HACMP" and not by OS command. If you use smitty HACMP, it will take care to implement the changes on the other node as well, and do many other things which is necessary to have a synchronized cluster. (But your workaround looks good as well :-))
DeleteHi,
ReplyDeleteWhat is the purpose of using major no to create VG in HACMP? Any specific reason.
Regards,
Siva
Hi,
DeleteIBM says this:
"When creating shared volume groups, typically you can leave the Major Number field blank and let the system provide a default. However, NFS uses volume group major numbers to help uniquely identify exported file systems. Therefore, all nodes to be included in a resource group containing an NFS-exported file system must have the same major number for the volume group on which the file system resides."
Hope this helps,
Balazs
Hi,
ReplyDeleteplease tell me
when we need to bring the RG offline while synchronizing(cases) in cluster?
and also cases of RG online while synchronizing?
thanks,
sathish.
Hi, there are some cluster configuration settings which can be synchronized only if the RG is in offline state, however most of the settings can be synchronized when RG is online. (More info can be found in PowerHA Redbooks.)
DeleteI am understand this task but i have an doubt we want to update the entry in /usr/es/sbin/cluster/etc/exports or it will update automatically kindly update a reply to me
ReplyDeletemy id: johnsoncls@google.com
I guess, it will not update automatically but you need to add manually. Otherwise it will exported with default option like with rw to everyone.
DeleteHi Sir,
ReplyDeleteThis is Kishor
how to find rg state? and how to know which rg using in cluster?
how to extend the filesystem in hacmp using cmdline?
ReplyDeleteHow to extend the vg and lv in hacmp using cmdline?
how can i remove a VG from HACMP?
ReplyDeleteDo PowerHA is able to export GPFS filesystems by NFSv4 server using sec=krb5 (kerberos 5) ?
ReplyDeleteI'm working to install and setup this environnement :
GPFS filesystems to export by NFSv4 using Kerberos 5 authentification.
I have IBM Kerberos 5 NAS system using ITM LDAP Tivoli Directory Server backend.
I would like to configure IBM Power HA to have a NFS v4 server in High Availability for exporting my GPFS filesystems (so not a JFS or JFS2 shared volume group in HACMP, but GPFS filesystem).
So right now, my LDAP and Kerberos 5 servers are configured and working well.
My PowerHA servers and ressource group is configured and working "pretty well".
I make kinit host/hostname, and I verified my krb5 creds are ok with klist.
I'm able to mount my nfsv4_service and list the nfsv4 mount point.
But when I fail over the 2nd NFSv4 server (from powerha), I lost access to nfsv4 mount point. I'm not able to list the gpfs filesystems anymore. But if I come back on the 1st server, I'm able to list my mount point.
Does anybody configured something like this ? How to make NFSv4 server using Kerberos 5 authentification in HA environnement, to be able to do maintenance and patch without impact to client ?
Regards,
Eric Dubé
Any one can tell me what is cross over mounting a file system in PowerHP7.1 and how to do that.
ReplyDelete