AIX for System Administrators: PERF.

I/O LVM

IO QUEUE

AIX disk and adapter drivers use 2 queues to handle IO: service queue and wait queue.
IO requests in the service queue are sent to the storage, and the service queue slot is freed when the IO is complete. IO requests in the wait queue stay there until a service queue slot is free. (A read operation is complete when AIX receives the data. A write operation is complete when AIX receives an acknowledgement.)

Service queue for an hdisk is called queue_depth

$ lsattr -El hdisk16
...
q_err yes Use QERR bit True
q_type simple Queuing TYPE True
qfull_dly 2 delay in seconds for SCSI TASK SET FULL True
queue_depth 20 Queue DEPTH True
recoverDEDpath no Recover DED Failed Path True
reserve_policy no_reserve Reserve Policy True

pcmpath query devstats 16 --> Maximum column for I/O shows the maximum IOs sent to LUN, and this won’t exceed queue_depth.

For adapters, the maximum value represents the maximum number of IOs submitted to the adapter over time, and this can exceed num_cmd_elems. Thus, the adapter maximum tells us the value we should assign to num_cmd_elems.

---------------------------------------

Service queue for an adapter is called num_cmd_elems

$ lsattr -El fcs4
...
intr_priority 3 Interrupt priority False
lg_term_dma 0x800000 Long term DMA True
max_xfer_size 0x100000 Maximum Transfer Size True
num_cmd_elems 500 Maximum number of COMMANDS to queue to the adapter True
pref_alpa 0x1 Preferred AL_PA True
sw_fc_class 2 FC Class for Fabric True
tme no Target Mode Enabled True

In fcstat (for example: fcstat fcs4) the values of "No Command Resource Count" shows that an IO was temporarily blocked waiting for resources, due to num_cmd_elems is too low.
Non-zero values indicate that increasing num_cmd_elems may help improve IO service times

pcmpath query adaptstats --> Maximum column for I/O shows the maximum value represents the maximum number of IOs submitted to the adapter.
This can exceed num_cmd_elems. Thus, the adapter maximum tells us the value we should assign to num_cmd_elems.

IBM doc: https://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105745

---------------------------------------

HOW TO MEASURE IO: IOPS (XFER, TPS)

Commands as iostat and nmon will show statistsic in "io per second" or "transaction per second".
iops, tps and xfer is referring to the same thing. (bps is byte per second)

Performance can be measured by this equation:
Q/T=Rate (IOPS or Bandwidth)

Q = the number of parallel IO requests (queue_depth or num_cmd_elems)
T = the I/O request service time
R = the rate, measured in IOPS ( bandwidth)

For example:
20/0.005 (5 millisecond) = 4.000 IOPS
40/0.005 = 8.000 IOPS
20/0.0001 (0.1 millisecond) = 200.000 IOPS

IOPS can be increased by increasing queue or reducing service tim
(Transfer rate (MB/s) can be calculated from IOPS, based on the block size we use.)

---------------------------------------

IOPS: iostat, nmon

$ iostat -RDTld hdisk10

System configuration: lcpu=88 drives=23 paths=184 vdisks=0

Disks: xfers read write queue time
-------------- -------------------------------- ------------------------------------ ------------------------------------ --------------------------------------
%tm bps tps bread bwrtn rps avg min max time fail wps avg min max time fail avg min max avg avg serv
act serv serv serv outs serv serv serv outs time time time wqsz sqsz qfull
hdisk10 0.5 939.5K 10.5 838.8K 100.7K 8.2 2.7 11.7 11.7 0 0 2.3 0.6 0.0 0.0 0 0 14.9 0.0 0.0 1.0 0.0 3.7

bps - amount of data transferred (read or written) per second
tps - number of transfers per second that were issued to the disk. A transfer is an I/O request to the physical disk.
avg serv - average service time in milliseconds
avg time - average time spent in the wait queue in ms
avg wqsz - average wait queue size
avg sqsz - average service queue size
serv qfull – rate of IOs per second submitted to a full queue

---------------------------------------

nmon: after D (3 times)

│ Disk Service Read Service Write Service Wait ServQ WaitQ ServQ
│ Name milli-seconds milli-seconds milli-seconds milli-seconds Size Size Full
│hdisk0 0.4 0.0 0.4 0.0 0.0 0.0 0.0
│hdisk9 0.0 0.0 0.0 0.0 0.0 0.0 0.0
│hdisk6 0.0 0.0 0.0 0.0 0.0 0.0 0.0
│hdisk11 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Wait - average wait time in the queue in ms
ServQ Size - Average service queue size
WaitQ Size - Average wait queue size
ServQ Full - Number of IO requests sent to a full queue for the interval
If you’ve a lot of hdisks, you can press the “.” subcommand to just show the busy disks.

---------------------------------------

SPEED TESTS TIPS:

- to get high I/O rates many threads need to be run parallel (20, 32 or more)
- smaller block sizes results in higher transaction rates (iops), but not suggested to go below 4k
- higher block sizes results in higher bandwidth (MB/s), it can be 128k or 1m
- most databases generate random I/O (and with smaller block sizes 4k or 8k)
- consider testing the 4K blocksize as well, because it demonstrates the maximum capabilities for the storage server
- avoid file system cache, use logical volumes or disks directly (raw io is recommended:rlv, rhdisk) or mount -o CIO to bypass filecache
- gather at least the following metrics from iostat –D: IOPS (tps), response times (avgserv’s) and throughput
- for reliable results run ndisk test for at least 5 minutes: ??? ./ndisk64 -R -t 300 -f /dev/rhdisk0 -M 20 -b 4k -s 400G -r100
- write tests will destroy the data on the target device

In general, the I/O activity can be:
- Random: Smaller blocks (4-32KB),sensitive to the latency. It is demanding for the storage server because it isn’t cache-friendly. Typically, the storage vendors use it for their benchmarks.
- Sequential: Large I/O requests (64KB to 128KB or even more) and the data is read in order. Normally, the throughput is important and the latency is not an issue, since the latency increases with larger I/O sizes. Good for testing the throughput for new HBAs or SAN switches implementations.

Baseline for IBM System Storage DS8000:
read average service time below 15 ms (if larger it might indicate bottleneck is in a lower layer: HBA, SAN, or the storage)
write average service time below 3 ms (if larger it might indicate write cache is full, and there is a bottleneck in the disk)

Baseline for IBM FlashSystem V9000:
For small I/O size workloads (8 KB - 32 KB) is to stay under 1 millisecond. Average response time for read or write operations is between 0.1ms to 0.5ms; actually, getting 1ms response time could be considered high, but it’s still OK. For large I/O size workloads (64 KB - 128 KB) should be 3 milliseconds

https://ibmsystemsmag.com/Power-Systems/11/2018/storage-recommendations-aix-performance
https://ibmsystemsmag.com/Power-Systems/04/2019/san-performance-problems

---------------------------------------

SPEED TEST WITH: dd

READ: If a file cached in memory, read measurement is not valid. umount/mount is the best way to handle this.
WRITE: Write to file are done in memory (unless direct, synchronous or asynchronous I/O is used), so syncd is needed for tests to write-out

synchronous I/O: disk operation is required to read the data into memory
asynchronous I/O: allow applications to initiate read or write operations without being blocked since all I/O is done in background

disk write speed:
sync; date; dd if=/dev/zero of=/export/1000m bs=1m count=1024; date; sync; date

disk read speed: (sequential read thruput, raw device was used)
timex dd if=/dev/rhdisk0 of=/dev/null bs=1m count=1024 <--it is reading directly 1024MB from hdisk, bypassing LVM

1024+0 records in
1024+0 records out

real 25.30 <--1024/25=40MB/s is the reading speed
user 0.00
sys 0.45

test a disk:
1. dd if=/dev/hdisk10 of=/dev/null & <--creates io activity on hdisk10 in the background
2.iostat -D hdisk10 2 10 <--shows io activity of hdisk10
3.iostat -a 2 10 <--shows the io activity of the adapters

-----------------------------------

SPEED TEST WITH: ndisk64

ndisk64 is part of nstress tools - https://www.ibm.com/developerworks/community/wikis/home?lang=en#/wiki/Power%20Systems/page/nstress

./ndisk64 -R -t 60 -f /dev/rlv_bb -M 20 -b 4k -s 10G -r100

-R (random IO)
-t 60 (for quick test 60 sec, for longer test -t 300)
-f it can be file or I used with raw lv to avoid fs cache (raw device has ben used in /dev)
-M 20 (multiple processes, queue_depth was on 20, so I used 20, but can be tried with 32 or more)
-b 4k (-b 1m higher block sizes, higher bandwith)
-s 10G (for logical volumes size has to be specified)
-r100 (r100: read 100%, r0: write 100%, r80: read 80% and write 20%)

An example run:
# ./ndisk64 -R -t 60 -f /dev/rfslv00 -M 20 -b 1m -s 10G -r100
Command: ./ndisk64 -R -t 60 -f /dev/rfslv00 -M 20 -b 1m -s 10G -r100
Synchronous Disk test (regular read/write)
No. of processes = 20
I/O type = Random
Block size = 1048576
Read-Write = Read Only
Sync type: none = just close the file
Number of files = 1
File size = 10737418240 bytes = 10485760 KB = 10240 MB
Run time = 60 seconds
Snooze % = 0 percent
----> Running test with block Size=1048576 (1024KB) ....................
Proc - <-----Disk IO----> | <-----Throughput------> RunTime
Num - TOTAL IO/sec | MB/sec KB/sec Seconds
1 - 9219 153.7 | 153.67 157362.94 59.99
2 - 9229 153.9 | 153.86 157555.99 59.98
3 - 9201 153.4 | 153.40 157084.05 59.98
4 - 9213 153.6 | 153.58 157263.69 59.99
5 - 9241 154.0 | 154.04 157737.37 59.99
6 - 9186 153.1 | 153.11 156783.04 60.00
7 - 9246 154.1 | 154.10 157802.31 60.00
8 - 9267 154.6 | 154.56 158268.18 59.96
9 - 9290 154.9 | 154.93 158645.83 59.96
10 - 9234 154.0 | 153.99 157689.96 59.96
11 - 9245 154.2 | 154.18 157878.06 59.96
12 - 9257 154.4 | 154.37 158072.55 59.97
13 - 9197 153.4 | 153.36 157038.53 59.97
14 - 9285 154.8 | 154.81 158521.14 59.98
15 - 9312 155.2 | 155.24 158967.19 59.98
16 - 9210 153.6 | 153.59 157272.09 59.97
17 - 9223 153.8 | 153.80 157487.53 59.97
18 - 9260 154.3 | 154.34 158047.39 60.00
19 - 9300 155.0 | 154.99 158711.25 60.00
20 - 9271 154.5 | 154.50 158204.44 60.01
TOTALS 184886 3082.4 | 3082.42 3156393.53
- Random procs= 20 read=100% bs=1024KB

-----------------------------------
-----------------------------------
-----------------------------------

LVM - FILESYSTEM RELATED PERFORMANCE

lvmstat            reports input/output statistics for logical volumes

1. lvmstat -v <vgname> -e             <--enables lvmstat
2. lvmstat -v <vgname>                <--lists lvmstat
3. lvmstat -v <vgname> -d           <--disables lvmstat

If lv (or some lp's only) heavily used you can migrate to other disk (migratepv/migratelp)

-----------------------------------

lvmo:

$ for VG in `lsvg`; do lvmo -a -v $VG; echo; done

vgname = rootvg
pv_pbuf_count = 512
total_vg_pbufs = 512
max_vg_pbuf_count = 16384
pervg_blocked_io_count = 62
pv_min_pbuf = 512
global_blocked_io_count = 62

This will show for how much io is blocked because of the pbuf.

$ ioo -a | grep pbuf
pv_min_pbuf = 512

-----------------------------------

defragfs -s /filesystem    shows percent of fragmentation
lslv -p hdisk0 lv01        shows the lv fragmentation:
    USED: not this lv , but other one is using that pp
    FREE: that pp is not used by any lv
    STALE: that pp is not consistent with other partitions (the partition number with a question mark means stale)
    Numbers: it shows the logical partition number of that lv

-----------------------------------

filemon

filemon command shows performance statistics on 4 layers: files, virtual memory segments, logical volumes and physical volumes. This command is very useful in a situation where IO issues are suspected. It should be executed during peak load of the system for a short period of time (usually less than 60s).

To run filemon for 10 seconds and generate a report (fmon.out):
# filemon -v -o fmon.out -O all ; sleep 10 ; trcstop

The command utilizes AIX trace functionality which stores the collected data temporarily in the circular kernel trace buffer. If the load on the system is very high, these records in the buffer may be overwritten and "TRACEBUFFER WRAPAROUND error" will be included in the output file. In that case, the size of the buffer should be increased or the period of time for the data collection should be decreased.

# filemon -u -v -o output.out -O all -T 100000000; sleep 60; trcstop
-u: reports on the files that are opened before the trace daemon is started
-v: includes extra information in the report
-o: defines the name of the output file (instead of stdout)
-O levels: monitors specified levels of the file system (all 4 layers included in this sample)
-T: size of the kernel trace buffer in bytes
sleep 60: data collection process is active for 60s
trcstop: command used to stop the tracing process

Don't forget to execute "trcstop", otherwise trace will be running forever, which can cause other issues.

-----------------------------------

LTG (Logical Track Group) size:
(maximum allowed transfer size)
1. lsvg <vgname> | grep LTG           <--shows LTG size of the vg
2. lquerypv -M <hdiskX>               <--shows LTG size of the disk (lspv <hdiskX) will show the same value at MAX REQUEST)
3. varyonvg -M512K tmpvg              <--this will change the LTG value which is suitable for the disk

This is written at chvg: For volume groups created on AIX 5.3, the -L flag is ignored. When the volume group is varied on, the logical track group size will be set to the common max transfer size of the disks.

-----------------------------------

Fragmentation:

fileplace -v smit.log
...
254 frags over space of 275 frags:   space efficiency = 92.4%    <--spread accross 275 fragments, but only uses 254 (sapace efficiency 92%)
4 extents out of 254 possible:   sequentiality = 98.8%          <--it shows how sequentially are the fragments placed (98%)

defragfs can improve it, but it will not defrag the files, it defragments only the free space.
(if a file is very defragmented solution: 1. backup the file, 2. do defragfs (to have enough continuous free space), 3. restore file)
the reorgvg command will reorganize the partitions by the allocation policy

-----------------------------------

SOME HINTS FOR POOR IO:

AT DISK LEVEL:
    -fragmentation
    -queue_depth

AT LV level:
    -MWC writes
    -Write verify enabled
    -inter/intra policy settings (intra policy=center is good)
    -host spots (too many lvs on a disk)
    -if many io operations occur, and the jfslog is on the same disk (solution: place jfslog separately or additional jfslogs)
    -inode lock: only 1 file in an fs (it is 1 inode) if it is written (inode is locked), so no read operation can occur

9 comments:

Douglas said...: Hello!

I'd like to know with how much I/O disk busy% the end user will notice the server slow ?

thanks!; July 9, 2013 at 2:03 AM
aix said...: Hi, there are many factors which can influence disk performance, but if disk is constantly at 100% this can be a sign of some problems. (Until it is below 100%, in my opinion there are free resources at disk side.); July 9, 2013 at 8:00 AM
Unknown said...: Hello,
I am trying to run the "trace" utility to log all I/O to a scsi disk (reads and writes), but I can't make sense of the data - I use fio to generate 8/16/64 kb sized iops but when going over the log (logged with trace -a -J diskdd and parsed with trcrpt) I see various b_bcount sizes like 1000/5000/11000 but they do not correspond to the sizes I expect to see... Is this related to I/O coalescing? Am I missing something here? Any help will be greatly appreciated...
Thanks; July 22, 2013 at 3:44 PM
aix said...: Hi, I'm afraid I can't help you in that...I have never worked with this utility...but perhaps someone else...
(If you find out something and you could share with us...that would be great.); July 24, 2013 at 8:34 AM
abhilashreddy said...: Hi,
We have Aix servers connnected to shared storage DS800,Recently we have replaced faulty Cisco SAN Switch module.After reconnecting we are getting errors in errpt There is no any path failure all the hardware is fine.I guess this errors are generating memory cache...How to fix?
below is one of the error from errpt

LABEL: SC_DISK_ERR4
IDENTIFIER: DCB47997

Date/Time: Fri Aug 23 20:58:34 GMT+05:30 2013
Sequence Number: 109790
Machine Id: 00F6C61E4C00
Node Id: *******
Class: H
Type: TEMP
WPAR: Global
Resource Name: hdisk3
Resource Class: disk
Resource Type: mpioosdisk

Description
DISK OPERATION ERROR

Probable Causes
MEDIA
DASD DEVICE

User Causes
MEDIA DEFECTIVE

Recommended Actions
FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
MEDIA
DISK DRIVE

Recommended Actions
FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
PERFORM PROBLEM DETERMINATION PROCEDURES; August 26, 2013 at 9:29 AM
Unknown said...: This comment has been removed by the author.; December 17, 2013 at 2:57 PM
Unknown said...: Recommended Actions:
FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
PERFORM PROBLEM DETERMINATION PROCEDURES

So the system behaves with the disappearance of the disc. If you are replacing a faulty device (controller and / or other apart from the disc itself on the storage system), the drive had to continue.

But if you do the cleaning of the unknown drives team exportvg and rmdev -dl hdiskX - inactive drives without FC connections. you need to run the command: cfgmgr and importvg hdiskX or cfgmgr + smitty importvg (see man importvg)

(for this action we need to know which drive was associated with the name of the VolumeVG to export drive was without errors.
Typically, this information can be displayed command lspv, doing her before and after any changes to the LVM.
#lspv
hdisk0 00c4afe7c75c9db5 rootvg active); December 17, 2013 at 3:09 PM
Unknown said...: On the condition that the system-level storage and transmission medium FC are all set (zoning, etc.) and you are sure that the correct drive assigned to your server; December 17, 2013 at 3:14 PM
Saker said...: Hello AIX,
this is a greate topic which i really found it very useful.

i have a question about IO on disks, how i can know the current IO requests on a disk ??; January 23, 2014 at 4:19 PM

AIX for System Administrators

dropdown menu

PERF. - IO LVM

9 comments: