dropdown menu

HMC - RMC


RMC (Resource Monitoring and Control):

RMC is a distributed framework and architecture that allows the HMC to communicate with a managed logical partition. RMC daemons should be running on AIX partition in order to be able to do DLPAR operations on HMC.

For example "Dynamic LPAR Resource Manager" is an RMC daemon that runs inside the AIX (and VIO server). The HMC uses this capability to remotely execute partition specific commands.

The daemons in the LPARs and the daemons on the HMC must be able to communicate to the AIX partition through an external network not through the Service Processor. An external network that the partition has access to and the HMC has acces to as well.

For example, if HMC has a connection to a 9.x.x.x network and I put my AIX partition to that 9.x.x.x network and as long there is network connectivity (HMC is allowed to communicate to that partition over that network) and RMC daemon is running on the partition, then DLPAR operations are available.

In order for RMC to work, port 657 upd/tcp must be open in both directions between the HMC public interface and the lpar.

The RMC daemons are part of the Reliable, Scalable Cluster Technology (RSCT) and are controlled by the System Resource Controller (SRC). These daemons run in all LPARs and communicate with equivalent RMC daemons running on the HMC. The daemons start automatically when the operating system starts and synchronize with the HMC RMC daemons.

Note: Apart from rebooting, there is no way to stop and start the RMC daemons on the HMC!

----------------------------------------

HMC and LPAR authentication (RSCT authentication)
(RSCT authentication is used to ensure the HMC is communicating with the correct LPAR.)

Authentication is the process of ensuring that another party is who it claims to be.
Authorization is the process by which a cluster software component grants or denies resources based on certain criteria.
The RSCT component that implements authorization is RMC. It uses access control list (ACL) files to control user access to resources.


The RSCT authorization process in detail:
1. On the HMC: DMSRM pushes down a secret key and HMC IP address to NVRAM where AIX LPAR exists.

2. On the AIX LPAR: CSMAgentRM, reads the key and HMC IP address out from NVRAM. It will then authenticate the HMC. This process is repeated every five minutes on a LPAR to detect new HMCs.

3. On the AIX LPAR: After authenticating the HMC, CSMAgentRM will contact the DMSRM on the HMC to create a ManagedNode resource then it creates a ManagementServer resource on AIX.

4. On the AIX LPAR: After the creation of these resources on the HMC and AIX, CSMAgentRM grants HMC permission to access necessary resources on the LPAR and changes its ManagedNode Status to 1 on the HMC.

5. On the HMC: After the ManagedNode Status is changed to 1, a session is established with the LPAR to query operating system information and DLPAR capabilities, and then waits for the DLPAR commands from users.

----------------------------------------

RMC Domain Status

When partitions have active RMC connections, they become managed nodes in a Management Domain. The HMC is then the Management Control Point (MCP) of that Management Domain. You can then use the rmcdomainstatus command to check the status of those managed nodes (i.e. your partitions).

As root on the HMC or on the AIX LPAR you can execute the rmcdomainstatus command as follows:

# /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc

From HMC: You should get a list of all the partitions that the HMC server can reach on the public network on port 657.

Management Domain Status: Managed Nodes
  O a  0xc8bc2c9647c1cef3  0003  9.2.5.241
  I a  0x96586cb4b5fc641c  0002  9.2.5.33


From LPAR: You should get a list of all the Management Control Points

Management Domain Status: Management Control Points
   I A  0xef889c809d9617c7 0001  9.57.24.139


1. column:
-I: Indicates that the partition is "Up" as determined by the RMC heartbeat mechanism (i.e. an active RMC connection exists).
-O: Indicates that the RMC connection is "Down", as determined by the RMC heartbeat mechanism.

2. column:
-A: Indicates that there are no messages queued to the specified node
-a: Same as A, but the specified node is executing a version of the RMC daemon that is at a lower code level than the local RMC daemon.

more info: https://www-304.ibm.com/support/docview.wss?uid=isg3T1011508

----------------------------------------

If rmcdomainstatus shows "i" at (1st column):

Indicates that the partition is Pending Up. Communication has been established, but the initial handshake between two RMC daemons has not been completed (message authentication is most likely failing.)
Authentication problems will occur when the partition identity do not match each other's trusted host list:

# /usr/sbin/rsct/bin/ctsvhbal        <--list the current identities for the HMC and the logical partition (run this command on both)
# /usr/sbin/rsct/bin/ctsthl -l       <--list the trusted host list on the partition

On the HMC, there is an entry for the partition. On the partition, there is an entry for the HMC. The HOST_IDENTITY value must match one of the identities listed in the respective ctsvhbal command output.

----------------------------------------

Things to check at the HMC:

- checking the status of the managed nodes: /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc  (you must be root on the HMC)

- checking connection between HMC and LPAR:
hscroot@hmc10:~> lspartition -dlpar
<#0> Partition:<2*8204-E8A*0680E12 aix10.domain.com, 10.10.50.18>
       Active:<1>, OS:<AIX, 6.1, 6100-03-01-0921>, DCaps:<0x4f9f>, CmdCaps:<0x1b, 0x1b>, PinnedMem:<1452>
<#1> Partition:<4*8204-E8A*0680E32 aix20.domain.com, 10.10.50.71>
       Active:<0>, OS:<AIX, 6.1, 6100-04-05-1015>, DCaps:<0x0>, CmdCaps:<0x1b, 0x1b>, PinnedMem:<656>

For correct DLPAR function:
- the partition must return with the correct IP of the lpar.
- the active value (Active:...) must be higher than zero,
- the decaps value (DCaps:...) must be higher 0x0

(The first line shows a DLPAR capable LPAR, the second line is anon-working LPAR)

- another way to check RMC connection: lssyscfg -r lpar -F lpar_id,name,state,rmc_state,rmc_ipaddr -m p750
(It should list "active" for the LPARs with active RMC connection.)


----------------------------------------

Things to check at the LPAR:

- checking the status of the managed nodes: /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc

- Checking RMC status:
# lssrc -a | grep rsct
 ctrmc            rsct             8847376      active          <--it is a RMC subsystem
 IBM.DRM          rsct_rm          6684802      active          <--it is for executing the DLPAR command on the partition
 IBM.DMSRM        rsct_rm          7929940      active          <--it is for tracking statuses of partitions
 IBM.ServiceRM    rsct_rm          10223780     active
 IBM.CSMAgentRM   rsct_rm          4915254      active          <--it is for  handshaking between the partition and HMC
 ctcas            rsct                          inoperative     <--it is for security verification
 IBM.ERRM         rsct_rm                       inoperative
 IBM.AuditRM      rsct_rm                       inoperative
 IBM.LPRM         rsct_rm                       inoperative
 IBM.HostRM       rsct_rm                       inoperative     <--it is for obtaining OS information

You will see some active and some missing (The key for DLPAR is the IBM.DRM)

- Stopping and starting RMC without erasing configuration:

# /usr/sbin/rsct/bin/rmcctrl -z    <--it stops the daemons
# /usr/sbin/rsct/bin/rmcctrl -A    <--adds entry to /etc/inittab and it starts the daemons
# /usr/sbin/rsct/bin/rmcctrl -p    <--enables the daemons for remote client connections

(This is the correct method to stop and start RMC without erasing the configuration.)
Do not use stopsrc and startsrc for these daemons; use the rmcctrl commands instead!

- recfgct: deletes the RMC database, does a discovery, and recreates the RMC configuration
# /usr/sbin/rsct/install/bin/recfgct
(Wait several minutes)
# lssrc -a | grep rsct

(If you see IBM.DRM active, then you have probably resolved the issue)

- lsrsrc "IBM.ManagementServer"    <--it shows HMCs via RMC

33 comments:

Unknown said...

how will we know managed machine having the dlpar capability or not

aix said...

On HMC if you issue the command "lspartition -dlpar", it will show dlpar capable partitions.
(it will show if RMC connection is OK between the HMC and LPAR)

Unknown said...

Hi..

How HMC and LPAR communicate ?

What is the answer - Through ethernet connection ?
- Through service process ?
- Through RMC ?
Your article made me so clear, but i just want a single line answer, can you give me, please ?

aix said...

Hi, a single line answer would be: all of them.
Right at the moment, you and me communicating through "all of them" as well. An ethernet connection line is needed where communication can travel. A processor is needed which will process our communication. And an application layer is needed which transfers communication.
The same is happening between HMC and LPAR (Managed System). Physical layer is ethernet, communication will be processed by Service Processor and RMC daemon is the application which is sending and receiving information.

You can read more on this link: http://aix4admins.blogspot.hu/2012/09/hmc-and-ibm-power7-systems-install-hmc.html

For example:
"The HMC is connected to each managed system's FSP (Flexible Service Processor). On most systems, the FSP provides two Ethernet ports labeled HMC1 and HMC2. This allows you to connect up to two HMCs."

Unknown said...

Thanks..
What I Love about this blog is ..two things
1.) Its very informative and
2.) Its the admin...who responds immediately, spends his valuable time to respond us..
We hope the same in the future...

Any way thanks....thanks alot

Unknown said...

I got it...Thank you

Anonymous said...

Very informative & much appreciated.

Unknown said...

Can I duplicate the conditions and responses from one lpar to another

aix said...

I don't fully understand what you mean...can you give a specific example?

Unknown said...

I have setup Resource Monitoring conditions and responses on an lpar. I would like to duplicate these to another lpar. I am trying to not rekey multiple times

aix said...

no idea...perhaps someone else

Anonymous said...

I have three question:
1) My LPAR OS version on HMC shows: "unknown". How do I fixed it?
2) Is it safe to run replacepv command while middle-ware are running? Does the middle-ware need to be stopped?
3) How do I reject TL/ML in aix?
Any help will be greatly appreciated. Nice blog :)

zidane said...

Is there a way I can run lparstat command on the LPAR through HMC remotely and get the output? Is it possible with RMC?

Anonymous said...

ver very informative thanks to amdin

aix said...

Hi, with RMC, I guess not...but if you can change to root on HMC you can ssh to the lpar and run lparstat. This is not usual, and IBM does not support to be root on HMC. If you could figure out other ways, please let me know.

Anonymous said...

Have a look t the -C parameter to lscondition and lsresponse. It displays the command that can be used to create the object.

Anonymous said...

I have a some AIX 6.1 LPARs on a P5 sytem, controlled by a HMC.

There are files on the AIX systems that grow rapily with error messages repeated ever 5 sec.
The filename is eg.: /var/ct/3423108940/log/mc/IBM.MgmtDomainRM.default (the number varies)
and the errors inside that log look like this:

ue Mar 4 09:55:44 CUT 2014(804070) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/00850/1.24 2613-027 MDC detected a protocol error with the HMC specified in RTAS slot number 4 due to prior error 2613-021
Tue Mar 4 09:55:49 CUT 2014(813481) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/00546/1.24 2613-023 The data in RTAS slot number 0 is invalid. The error number is 2.
Tue Mar 4 09:55:49 CUT 2014(813862) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/00546/1.24 2613-023 The data in RTAS slot number 1 is invalid. The error number is 2.
Tue Mar 4 09:55:49 CUT 2014(813985) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/00546/1.24 2613-023 The data in RTAS slot number 2 is invalid. The error number is 2.
Tue Mar 4 09:55:49 CUT 2014(814162) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/00546/1.24 2613-023 The data in RTAS slot number 5 is invalid. The error number is 2.
Tue Mar 4 09:55:49 CUT 2014(822175) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/02047/1.24 2613-021 MDC received an error response for getSupportedProtocols from 9.149.149.74: 262154 2610-415 Cannot execute the command. The resource manager IBM.DMSRM is not available.

Tue Mar 4 09:55:49 CUT 2014(822353) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/00850/1.24 2613-027 MDC detected a protocol error with the HMC specified in RTAS slot number 4 due to prior error 2613-021
Tue Mar 4 09:55:54 CUT 2014(831151) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/00546/1.24 2613-023 The data in RTAS slot number 0 is invalid. The error number is 2.
Tue Mar 4 09:55:54 CUT 2014(831339) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/00546/1.24 2613-023 The data in RTAS slot number 1 is invalid. The error number is 2.
Tue Mar 4 09:55:54 CUT 2014(831460) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/00546/1.24 2613-023 The data in RTAS slot number 2 is invalid. The error number is 2.
Tue Mar 4 09:55:54 CUT 2014(831638) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/00546/1.24 2613-023 The data in RTAS slot number 5 is invalid. The error number is 2.
Tue Mar 4 09:55:54 CUT 2014(842821) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/02047/1.24 2613-021 MDC received an error response for getSupportedProtocols from 9.149.149.74: 262154 2610-415 Cannot execute the command. The resource manager IBM.DMSRM is not available.

Tue Mar 4 09:55:54 CUT 2014(843000) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/00850/1.24 2613-027 MDC detected a protocol error with the HMC specified in RTAS slot number 4 due to prior error 2613-021

Unfortunately our AIX Expert is on vacation. Where should I start investigating? Google brought me here, hopefully someone can help (please).
Joachim

Anonymous said...

Similar issue here with 1 AIX 7.1 TL2 SP3 system on P6. We have close to 100 other LPAR without this issue

Wed Apr 9 08:45:45 EDT 2014(676210) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/01913/1.22 2613-024 MDC could not start a session with 172.17.0.2, from RTAS slot number 2. The mc_timed_start_session function returned 52.
2610-652 The specified time limit has been exceeded.
Wed Apr 9 08:45:45 EDT 2014(676802) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/01913/1.22 2613-024 MDC could not start a session with 172.16.0.1, from RTAS slot number 3. The mc_timed_start_session function returned 52.
2610-652 The specified time limit has been exceeded.
Wed Apr 9 08:50:58 EDT 2014(664238) ../../../../../src/rsct/rm/MgmtDomainRM/PKE_client.c/02166/1.13 2613-016 PKE timed out with the node at IP address 192.168.155.198.
Wed Apr 9 08:50:58 EDT 2014(664411) ../../../../../src/rsct/rm/MgmtDomainRM/PKE_client.c/02170/1.13 2613-027 MDC detected a protocol error with the HMC specified in RTAS slot number 2 due to prior error 2613-016
Wed Apr 9 08:50:58 EDT 2014(664614) ../../../../../src/rsct/rm/MgmtDomainRM/PKE_client.c/02166/1.13 2613-016 PKE timed out with the node at IP address 192.168.155.31.
Wed Apr 9 08:50:58 EDT 2014(664677) ../../../../../src/rsct/rm/MgmtDomainRM/PKE_client.c/02170/1.13 2613-027 MDC detected a protocol error with the HMC specified in RTAS slot number 3 due to prior error 2613-016
Wed Apr 9 08:51:30 EDT 2014(677424) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/01913/1.22 2613-024 MDC could not start a session with 172.16.0.1, from RTAS slot number 3. The mc_timed_start_session function returned 52.
2610-652 The specified time limit has been exceeded.
Wed Apr 9 08:51:30 EDT 2014(678131) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/01913/1.22 2613-024 MDC could not start a session with 172.17.0.2, from RTAS slot number 2. The mc_timed_start_session function returned 52.
2610-652 The specified time limit has been exceeded.
Wed Apr 9 08:51:30 EDT 2014(774061) ../../../../../src/rsct/rm/MgmtDomainRM/PKE_client.c/01637/1.13 2613-008 PKE received an error response for pTwoPublicKeyExchange from 192.168.155.198: 108 2613-034 Error number 108 was returned when attempting to define an IBM.MngNode resource.

Wed Apr 9 08:51:32 EDT 2014(176824) ../../../../../src/rsct/rm/MgmtDomainRM/PKE_client.c/01637/1.13 2613-008 PKE received an error response for pTwoPublicKeyExchange from 192.168.155.31: 108 2613-034 Error number 108 was returned when attempting to define an IBM.MngNode resource.

Wed Apr 9 08:53:46 EDT 2014(692834) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/01913/1.22 2613-024 MDC could not start a session with 172.16.0.1, from RTAS slot number 3. The mc_timed_start_session function returned 52.
2610-652 The specified time limit has been exceeded.
Wed Apr 9 08:53:46 EDT 2014(693554) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/01913/1.22 2613-024 MDC could not start a session with 172.17.0.2, from RTAS slot number 2. The mc_timed_start_session function returned 52.
2610-652 The specified time limit has been exceeded.
...

Anonymous said...

Instead of running /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc on the HMC, you can run lspartition -dlpar

Anonymous said...

I am having HMC with FQDN on 9.x1 IP address and VIOS having only admin ips on 192.x network connecting to the HMC with nated ips , FQDN through FW (admin ip 192.x nated with 9.x2 IP registerd with FQDN ).is there any issue in connection.Whether vios can create a connection back to hmc or will have issue with 192.x ip while conecting back.? Please clarify. Thank you.

Anonymous said...

this is very informative

Anonymous said...

While i am executing the command "lsrsrc "IBM.ManagementServer" its throwing some error like there is no persistent attribute set and its not showing HMC IP Address on lpar . How can we overcome from this issue. Your input will be highly appreciated.

Prabhanjan Gururaj said...

To copy a condition from an existing condition:
mkcondition [-h] -c Existing_condition[:Node_name] [-r Resource_class] [-e "Event_expression"]
[-E "Rearm_expression"] [-d Event_description] [-D Rearm_description]
[-s "Selection_string"] [-n Node_name[,Node_name...]] [-S c|w|i] [-m l|m|p]
[-g 0|1|2] [-p Node_name] [--qtoggle|--qnotoggle] [-b interval[,max_events] [,retention_period] [,max_totalsize]] [-TV] Condition

Unknown said...

My hmc is in 1.1.2.X network and my lpar is in 1.1.3.X network. How can i do Dlpar in this case.

Anonymous said...

hello how do I toogle the RMC DAEMON on!?

Pilou said...

becareful the link https://www-304.ibm.com/support/docview.wss?uid=isg3T1011508 is dead .
Thank's for your web site.

Salman Khalid said...

Please do following steps on this VIOS:

#ps -ef | egrep "accessprocess|lparmgr"
#kill -9 < pid of /usr/ios/lpm/sbin/accessprocess>
#kill -9 <pid of /usr/bin/ksh /usr/ios/lpm/sbin/lparmgr all start"

Restarted the lparmgr.
#/usr/bin/ksh /usr/ios/lpm/sbin/lparmgr all start

Check if apsock created
#ls -lt /var/adm/lpm/apsock
If it created
#/usr/ios/lpm/bin/lsivm
# echo $?
If you get return 1, check rmc connection or you can recycle RMC by:
#/usr/sbin/rsct/bin/rmcctrl -z
#/usr/sbin/rsct/bin/rmcctrl -A
#/usr/sbin/rsct/bin/rmcctrl -p

Thanks,
Salman Khalid

Salman Khalid said...

Check for general RMC connection issues from HMC:
diagrmc

Check for general RMC connection issues and for RMC connection issues to partition mylpar with IP address e.g. 10.10.10.110

diagrmc -m my_managed_system -p mylpar --ip 10.10.10.110

Check for general RMC connection issues and for RMC connection issues to partition ID 5 with IP address 10.10.10.110 , and automatically correct the issues found:

diagrmc -m my_managed_system --id 5 --ip 10.10.10.110 --autocorrect

*my_managed_system = IBM Frame Name
* id = LPAR partition ID (use "uname -L" to find the partition id on lpar console)
*ip= LPAR's IP which is failing RMC connection betweek HMC & itself

Anonymous said...

Do You maybe know how to solve inactive RMC connection for systems which are under CAA environment? According to IBM recfgct command should not be used for systems under CAA...I have this situation after replacing failed HMC. Cluster nodes don't have RMC communication with new HMC, and output of lsrsrc IBM.MCP is giving NodeID and HMCName for failed HMC, not the new one...
Best regards,

Dileep said...

I have used recfgct multiple times on systems under CAA. First, you have to bring down CAA, otherwise it will cause multipl e issues with running cluster. For bringing down CAA, you can use command
clmgr offline cluster "cluster_name" STOP_CAA=yes

To start use command
clmgr online cluster "cluster_name" START_CAA=yes

Ildar said...

badly formatted text: the wide text very hard to read, fixed font prevents line breaks.

Anonymous said...

How to start the rmc deamon for dlpar in cluster server . we should not break the running cluster ( power ha ) .

Anonymous said...

Irrespective of whether Power HA cluster is setup or not, RMC daemon should be already be up (started via inittab). You need to check whether LPAR has a rmc connection from hmc or not to do DLPAR.