dropdown menu

PERF. - CPU VIRTUALIZATION

VIO - POWERVM

VP: Virtual Processor
EC: Entitled Capacity

VP/Core ratio in the pool:
The sum of VPs in the pool / Cores in the pool --> should be around 2 (or less). Higher the ratio, less uncapped capacity is available.

EC/VP ratio of an LPAR:
If EC/VP ratio is below 0.6 you will not use the core for running programs, but mostly for dispatch cyles (overhead).
EC/VP for VIOS must be at least 0.6-0.8, in order to process incoming network and storage requests.
(You will not have any performance on an LPAR if VIOS is not able to process data.)

Some recommendations say to have VP=ROUNDUP (CE) to the next whole integer number.

-------------------------

Some considerations regarding Uncapped Pool, Hypervisor work...:

Performance on Uncapped Capacity is less predictable (We don't know what is happenning in Uncapped Area, so capacity from there is not guaranteed, additionally it creates additional work to Hypervisor and you may loose processor affinity.)

If you give work to the Hypervisor, the LPAR is paying it (For example too many VPs, or folding is deactivated)

Uncapped processor cycles mean work to the Hypervisor, and it is valuable dispatch time, what is payed by LPARs. (It is not good parctice to count on uncapped area fully, as it is extra work for Hypervisor, and it is not guaranteed.)

It is reasonable to use for a critical LPAR 80% of its own resources (entitled capacity) and 20% from the uncapped area (shared pool). For a test system you can use 90% from shared pool.

If a shared LPAR is using more than 80% of its resources (EC) 24/7 it is good to think about changing it to a dedicated LPAR.

Dedicated vs Shared processors: Wherever possible, using partitions in Dedicated processing mode instead of Sharing mode. (No other partions can use its resources. The partition’s memory remains in the cache for longer duration and thus resulting in faster memory access. )

We should consider rounding up some partitions to the nearest whole processors (for example 1.00, 2.00 etc) as it will reduce the effect of partitions sharing the same processor. If a partition is not sized in whole processor it will be sharing the L2 cache with at least one partition and there will be chances of L2 cache contents getting flushed and subsequently reading the data will take time and resulting in performance degrade.

-------------------------

Virtual Processor number by rungueue:

Check runqueue (processes that are currently running or waiting (queued) to run: topas or vmstat -Iwt 2)
    SMT off: runqueue should be less than Virtual Processors X 2
    SMT 2: runqueue should be less than Virtual Processors X 2 X 2
    SMT 4: runqueue should be less than Virtual Processors X 2 X 4

(The conventional rule of thumb of twice the number of CPUs (One process on CPU while second is waiting for some other factors like disk I/O, paging etc)

-------------------------

L2 Cache and performance:

L2 cache is a fast memory which stores copies of the data from the most frequently used  main memory locations.



First picture shows a Power5 system, second picture Power6 (or Power7) system.

In Power5 systems, there is a single L2 cache on a processor chip which is shared by  both the cores on the chip. In later servers (Power6 and Power7) they have separate L2 cache for each core on a chip. The partition’s performance depends on how efficiently it is using the processor cache.

If L2 cache interference with other partitions is minimal, performance is much better for the partition. Sharing L2 cache with other partitions means there is a chance of processor’s most frequently accessed data will be flushed out of the cache and accessing from L3 cache or from main memory will take more time.


Dedicated Procesor vs Shared Processor

Partitions with dedicated processors perform best. In dedicated mode, no other partitions can use a processor. This way the partition’s memory remains in the cache for longer duration and thus resulting in faster memory access.

Processors in multiple of 2
Power5 servers have a common L2 cache on a chip which is shared by two cores. If we assign processors as a multiple of 2, it will minimize the L2 cache contention with other partitions. (Otherwise each partition has to share its L2 cache with other partitions.) On Power6 and Power7 servers using multiple of two processors is not required as processors have private L2 caches.

Whole Processors in sharing mode
In sharing-processor mode consider rounding up some partitions to the nearest whole processors (for example 1.00, 2.00 etc) as it will reduce the effect of partitions sharing the same processor. If a partition is not sized in whole processor it will be sharing the L2 cache with at least one other partition and there will be chances of L2 cache contents getting flushed and reading the data again will take additional time.

Number of Virtual Processor
Number of virtual processors can be called as "spreading factor". It tells how many physical processors will be assigned on which work will be spread across.
(While spreading across multiple processors allow more work to be done in parallel; it can potentially reduce the effectiveness of L2 cache.)

Given a well-threaded application and 1.6 processing units total, four virtual processor each with 0.4 processing units is likely a better choice than two virtual processors each with 0.8 processing units. An application that isn't well-threaded would likely operate faster with the two 0.8 sized processors.

-------------------------

Some consideration about correct EC and VP ratio:

(EC: Entitiled Capacity, VP: Virtual Processor)

If EC=0.4 and VP=4 it means our EC can grow 10X bigger as we desired. This extra resource will be coming from uncapped area. Do we really need this huge amount spare capacity???? (If the whole machine is busy we could be forced down to 0.4 plus a little, based on our weight.)
This configuration is bad, because EC is lower than actual CPU usage.

The other drawback of this type of config, when LPAR is activated Hypervisor tries to assign memory to it from local chip where it's CPU resides (local memory). Later when thhis LPAR increases and it needs more CPU, probably CPU core will be found on other chip (or CEC or book), so reaching out for the memory from these new CPUs will be much longer.

Recommendation:
Monitor Physical Core usage of the LPAR, and if it is an important LPAR EC should be set up, to cover the CPU peaks. (For example if LPAR uses 1.2 cores most of the time EC should be around 1.5.) So, set EC (to normal peak) and VP correctly (a little higher) to have some spare resource.
Reducing VP number should be considered as well (too many VPs are "shredding" the cores.) Low utilized partitions should be configured with minimal number of virtual processors.


Dedicated and Shared Capped LPARs:

80% for normal work 20% extra capacity

Shared Uncapped LPARs:
EC: should cover regular busy peaks and we think should have that much guaranteed
VP: allows some extra space around 25-50% more

EC: 0.05 - 0.6 -->  VP=1
EC: 0.7 - 1.4  -->  VP=2
EC: 1.5 - 2.3  -->  VP=3

-------------------------

Finding spare capacity on a machine:


1.
- raise EC to peak use and lower VP (force higher SMT thread use)
- monitor CPU pool for unused CPU (lparstat)

2.
-for important LPARs do the same as above
-other LPARs have a standard for example VP=round (EC+1) CPU
-monitor and wait for complaints

3.
-fix few large LPARs (as above)
-remove 1 CPU from the pool each day until it hurts (add to a dedicated LPAR, which is doing nothing)
-when performance problems are arise DLPAR a CPU back to the pool

-----------------------

Spikey workload on a system with lots of busy LPARs not getting resources:
Sometimes disabling CPU folding (with schedo, vpm_xvcpus=-1) can help.

More info:
http://www.powershow.com/view4/4834f6-ZTU3Y/vpm_xvcpus_powerpoint_ppt_presentation

-----------------------

topas:

CPU  User%  Kern%  Wait%  Idle%  Physc   Entc
ALL   46.0   47.8    0.0    6.2   0.61  203.9

Physc: Physical consumption represents the amount of processing unit currently consumed. (The number of physical processors that are consumed.)
Entc: Entitled capacity shows the percentage of processing unit currently consumed compared to processing units allocated to the partition.     (Consequently, uncapped shared partitions can have an entitlement consumption that exceeds 100%.)

Runqueue    1.0
Waitqueue   0.0

Runqueue: On average over 2s, only one thread is ready to run.
Waitqueue: On average over 2s, no thread is waiting for paging to complete.


topas -L:
%usr %sys %wait %idle physc  %entc %lbusy    app    Vcsw    phint   %hypv   hcalls
  40   58     0     2   0.3 104.13   9.00   3.54     758       18    23.9    41667

%lbusy: how much percent of the logical processors are effectively in use

-----------------------

lparstat:

root@vios1: / # lparstat -h 1

%user  %sys  %wait  %idle physc %entc  lbusy   app  vcsw phint  %hypv hcalls
----- ----- ------ ------ ----- ----- ------   --- ----- ----- ------ ------
 48.9  15.2    0.0   35.9  1.96 178.3   14.0 19.88   869   159   60.5  10222
 48.8  12.4    0.0   38.8  2.73 247.9   15.2 19.65  6138   164   51.7  16838
 51.0  10.3    0.0   38.7  1.81 164.5   10.1 21.90  3128   121   45.5  15001 

lphysc (pc): Physical cores consumed by the partition.

%entc (ec): Percentage of entitled capacity consumed by the partition. (uncapped shared LPARS can exceed 100%)

lbusy: shows the percentage of logical processor utilization that occurs while executing in user and system mode. If this value approaches 100%, it may indicate that the partition could make use of additional virtual processors.

app: Indicates the available physical processors in the shared pool. (Shows if there are free CPUs)
if "app" shows 0.0, it can happen that only the 1st SMT thread is used, and actually there is space for many unused threads on the 2nd 3rd and 4th SMT thread.

vcsw: Indicates the number of virtual context switches

%hypv: Indicates the percentage of physical processor consumption spent making hypervisor calls.
       (Too many virt. cpus may result in high %hypv)

-------------------------

6 comments:

Unknown said...

Hi,

If we have Dual VIO in the environment, how do we know the current running VIO, means currently from which VIO, the LPARs are accessing the Virtual devices.

aix said...

Hi, you should check your configuration:
-regarding disks: if NPIV is used check VFC maping, if VSCSI check also disk mappings (lsmap) and on client the paths (lspath)
-regarding network: entstat command for the SEA will show which adapters are active on the VIOS.

Unknown said...

Hi,
It's absolutely amazing how in only one page you can find tons of information. Every word here is important. Thanks. ;)
Now the question:
You say here the "combination" of EC/VP as:

EC: 0.05 - 0.6 --> VP=1
EC: 0.7 - 1.4 --> VP=2
EC: 1.5 - 2.3 --> VP=3

But you not include SMT. Do you recommend ALWAYS to use SMT2/4?

aix said...

Hi, these ratios are taken from IBM sources (e.g. Nigel Griffiths), so I am just communicating those here as well. Regarding SMT I did not find any generic rule of thumb. I wrote some info about intelligent SMT threads, which probably helps you to answer your question: http://aix4admins.blogspot.hu/2011/08/commands-and-processes-process-you-use.html

-Balazs

Unknown said...

Hi Balazs, thannks for your quick response. I just asked you because I've been not able to understand well when to use SMT (or not) in combination with EC and VP. I'm gonna keep searching and read your link above.
Keep that great blog. Believe me it's a big source of knowledge for those which start with this wonderfull OS.

gantoki said...

Every word here is important. Thanks. ;)