dropdown menu




AIX disk and adapter drivers use 2 queues to handle IO: service queue and wait queue.
IO requests in the service queue are sent to the storage, and the service queue slot is freed when the IO is complete. IO requests in the wait queue stay there until a service queue slot is free. (A read operation is complete when AIX receives the data. A write operation is complete when AIX receives an acknowledgement.)

Service queue for an hdisk is called queue_depth

$ lsattr -El hdisk16
q_err           yes                   Use QERR bit                            True
q_type          simple                Queuing TYPE                            True
qfull_dly       2                     delay in seconds for SCSI TASK SET FULL True
queue_depth     20                    Queue DEPTH                             True
recoverDEDpath  no                    Recover DED Failed Path                 True
reserve_policy  no_reserve            Reserve Policy                          True

pcmpath query devstats 16 --> Maximum column for I/O shows the maximum IOs sent to LUN, and this won’t exceed queue_depth.

For adapters, the maximum value represents the maximum number of IOs submitted to the adapter over time, and this can exceed num_cmd_elems. Thus, the adapter maximum tells us the value we should assign to num_cmd_elems.


Service queue for an adapter is called num_cmd_elems

$ lsattr -El fcs4
intr_priority 3          Interrupt priority                                 False
lg_term_dma   0x800000   Long term DMA                                      True
max_xfer_size 0x100000   Maximum Transfer Size                              True
num_cmd_elems 500        Maximum number of COMMANDS to queue to the adapter True
pref_alpa     0x1        Preferred AL_PA                                    True
sw_fc_class   2          FC Class for Fabric                                True
tme           no         Target Mode Enabled                                True

In fcstat (for example: fcstat fcs4) the values of "No Command Resource Count" shows  that an IO was temporarily blocked waiting for resources, due to num_cmd_elems is too low.
Non-zero values indicate that increasing num_cmd_elems may help improve IO service times

pcmpath query adaptstats --> Maximum column for I/O shows  the maximum value represents the maximum number of IOs submitted to the adapter.
This can exceed num_cmd_elems. Thus, the adapter maximum tells us the value we should assign to num_cmd_elems.

IBM doc: https://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105745



Commands as iostat and nmon will show statistsic in "io per second" or "transaction per second".
iops, tps and xfer is referring to the same thing. (bps is byte per second)

Performance can be measured by this equation:
 Q/T=Rate (IOPS or Bandwidth)

Q = the number of parallel IO requests (queue_depth or num_cmd_elems)
T = the I/O request service time
R = the rate, measured in IOPS ( bandwidth)

For example:
20/0.005 (5 millisecond) =  4.000 IOPS
40/0.005 = 8.000 IOPS
20/0.0001 (0.1 millisecond) = 200.000 IOPS

IOPS can be increased by increasing queue or reducing service tim
(Transfer rate (MB/s) can be calculated from IOPS, based on the block size we use.)


IOPS: iostat, nmon

$ iostat -RDTld hdisk10

System configuration: lcpu=88 drives=23 paths=184 vdisks=0

Disks:                     xfers                                read                                write                                  queue                    time
-------------- -------------------------------- ------------------------------------ ------------------------------------ --------------------------------------
                 %tm    bps   tps  bread  bwrtn   rps    avg    min    max time fail   wps    avg    min    max time fail    avg    min    max   avg   avg  serv
                 act                                    serv   serv   serv outs              serv   serv   serv outs        time   time   time  wqsz  sqsz qfull
hdisk10          0.5 939.5K  10.5 838.8K 100.7K   8.2   2.7   11.7   11.7     0    0   2.3   0.6    0.0    0.0     0    0  14.9    0.0    0.0    1.0   0.0   3.7

bps - amount of data transferred (read or written) per second
tps - number of transfers per second that were issued to the disk.  A transfer is an I/O request to the physical disk.
avg serv - average service time in milliseconds
avg time - average time spent in the wait queue in ms
avg wqsz - average wait queue size
avg sqsz - average service queue size
serv qfull – rate of IOs  per second submitted to a full queue


nmon: after D (3 times)

│ Disk        Service     Read Service   Write Service      Wait      ServQ  WaitQ  ServQ  
│ Name     milli-seconds  milli-seconds  milli-seconds  milli-seconds  Size   Size   Full  
│hdisk0        0.4            0.0            0.4            0.0         0.0    0.0    0.0  
│hdisk9        0.0            0.0            0.0            0.0         0.0    0.0    0.0  
│hdisk6        0.0            0.0            0.0            0.0         0.0    0.0    0.0  
│hdisk11       0.0            0.0            0.0            0.0         0.0    0.0    0.0  

Wait - average wait time in the queue in ms
ServQ Size - Average service queue size
WaitQ Size - Average wait queue size
ServQ Full - Number of IO requests sent to a full queue for the interval
If you’ve a lot of hdisks, you can press the “.” subcommand to just show the busy disks.



- to get high I/O rates many threads need to be run parallel (20, 32 or more)
- smaller block sizes results in higher transaction rates (iops), but not suggested to go below 4k
- higher block sizes results in higher bandwidth (MB/s), it can be 128k or 1m
- most databases generate random I/O (and with smaller block sizes 4k or 8k)
- consider testing the 4K blocksize as well, because it  demonstrates the maximum capabilities for the storage server
- avoid file system cache, use logical volumes or disks directly (raw io is recommended:rlv, rhdisk) or mount -o CIO to bypass filecache
- gather at least the following metrics from iostat –D: IOPS (tps), response times (avgserv’s) and throughput
- for reliable results run ndisk test for at least 5 minutes: ??? ./ndisk64 -R -t 300 -f /dev/rhdisk0 -M 20 -b 4k -s 400G -r100
- write tests will destroy the data on the target device

In general, the I/O activity can be:
- Random: Smaller blocks (4-32KB),sensitive to the latency. It is demanding for the storage server because it isn’t cache-friendly. Typically, the storage vendors use it for their benchmarks. 
- Sequential: Large I/O requests (64KB to 128KB or even more) and the data is read in order. Normally, the throughput is important and the latency is not an issue, since the latency increases with larger I/O sizes. Good for testing the throughput for new HBAs or SAN switches implementations.

Baseline for IBM System Storage DS8000:
read average service time below 15 ms (if larger it might indicate bottleneck is in a lower layer: HBA, SAN, or the storage)
write average service time below 3 ms (if larger it might indicate write cache is full, and there is a bottleneck in the disk)

Baseline for IBM FlashSystem V9000:
For small I/O size workloads (8 KB - 32 KB) is to stay under 1 millisecond. Average response time for read or write operations is between 0.1ms to 0.5ms; actually, getting 1ms response time could be considered high, but it’s still OK. For large I/O size workloads (64 KB - 128 KB) should be 3 milliseconds




READ: If a file cached in memory, read measurement is not valid. umount/mount is the best way to handle this.
WRITE: Write to file are done in memory (unless direct, synchronous or asynchronous I/O is used), so syncd is needed for tests to write-out

synchronous I/O: disk operation is required to read the data into memory
asynchronous I/O: allow applications to initiate read or write operations without being blocked since all I/O is done in background

disk write speed:
sync; date; dd if=/dev/zero of=/export/1000m bs=1m count=1024; date; sync; date
disk read speed: (sequential read thruput, raw device was used)
timex dd if=/dev/rhdisk0 of=/dev/null bs=1m count=1024    <--it is reading directly 1024MB from hdisk, bypassing LVM

1024+0 records in
1024+0 records out

real 25.30                                                <--1024/25=40MB/s is the reading speed
user 0.00
sys  0.45

test a disk:
1. dd if=/dev/hdisk10 of=/dev/null &    <--creates io activity on hdisk10 in the background
2.iostat -D hdisk10 2 10                <--shows io activity of hdisk10
3.iostat -a 2 10                        <--shows the io activity of the adapters



ndisk64 is part of nstress tools - https://www.ibm.com/developerworks/community/wikis/home?lang=en#/wiki/Power%20Systems/page/nstress

./ndisk64 -R -t 60 -f /dev/rlv_bb -M 20 -b 4k -s 10G -r100

-R (random IO)
-t 60 (for quick test 60 sec, for longer test -t 300)
-f it can be file or I used with raw lv to avoid fs cache (raw device has ben used in /dev)
-M 20 (multiple processes, queue_depth was on 20, so I used 20, but can be tried with 32 or more)
-b 4k (-b 1m higher block sizes, higher bandwith)
-s 10G (for logical volumes size has to be specified)
-r100 (r100: read 100%, r0: write 100%, r80: read 80% and write 20%)

An example run:
# ./ndisk64 -R -t 60 -f /dev/rfslv00 -M 20 -b 1m -s 10G -r100
Command: ./ndisk64 -R -t 60 -f /dev/rfslv00 -M 20 -b 1m -s 10G -r100
        Synchronous Disk test (regular read/write)
        No. of processes = 20
        I/O type         = Random
        Block size       = 1048576
        Read-Write       = Read Only
        Sync type: none  = just close the file
        Number of files  = 1
        File size        = 10737418240 bytes = 10485760 KB = 10240 MB
        Run time         = 60 seconds
        Snooze %         = 0 percent
----> Running test with block Size=1048576 (1024KB) ....................
Proc - <-----Disk IO----> | <-----Throughput------> RunTime
Num -     TOTAL   IO/sec |    MB/sec       KB/sec  Seconds
   1 -      9219    153.7 |    153.67    157362.94  59.99
   2 -      9229    153.9 |    153.86    157555.99  59.98
   3 -      9201    153.4 |    153.40    157084.05  59.98
   4 -      9213    153.6 |    153.58    157263.69  59.99
   5 -      9241    154.0 |    154.04    157737.37  59.99
   6 -      9186    153.1 |    153.11    156783.04  60.00
   7 -      9246    154.1 |    154.10    157802.31  60.00
   8 -      9267    154.6 |    154.56    158268.18  59.96
   9 -      9290    154.9 |    154.93    158645.83  59.96
  10 -      9234    154.0 |    153.99    157689.96  59.96
  11 -      9245    154.2 |    154.18    157878.06  59.96
  12 -      9257    154.4 |    154.37    158072.55  59.97
  13 -      9197    153.4 |    153.36    157038.53  59.97
  14 -      9285    154.8 |    154.81    158521.14  59.98
  15 -      9312    155.2 |    155.24    158967.19  59.98
  16 -      9210    153.6 |    153.59    157272.09  59.97
  17 -      9223    153.8 |    153.80    157487.53  59.97
  18 -      9260    154.3 |    154.34    158047.39  60.00
  19 -      9300    155.0 |    154.99    158711.25  60.00
  20 -      9271    154.5 |    154.50    158204.44  60.01
TOTALS    184886   3082.4 |   3082.42   3156393.53
- Random procs= 20 read=100% bs=1024KB



lvmstat            reports input/output statistics for logical volumes

1. lvmstat -v <vgname> -e             <--enables lvmstat
2. lvmstat -v <vgname>                <--lists lvmstat
3. lvmstat -v <vgname> -d             <--disables lvmstat

If lv (or some lp's only) heavily used you can migrate to other disk (migratepv/migratelp)



$ for VG in `lsvg`; do lvmo -a -v $VG; echo; done

vgname = rootvg
pv_pbuf_count = 512
total_vg_pbufs = 512
max_vg_pbuf_count = 16384
pervg_blocked_io_count = 62
pv_min_pbuf = 512
global_blocked_io_count = 62

This will show for how much io is blocked because of the pbuf.

$ ioo -a | grep pbuf
pv_min_pbuf = 512


defragfs -s /filesystem    shows percent of fragmentation
lslv -p hdisk0 lv01        shows the lv fragmentation:
    USED: not this lv , but other one is using that pp
    FREE: that pp is not used by any lv
    STALE: that pp is not consistent with other partitions (the partition number with a question mark means stale)
    Numbers: it shows the logical partition number of that lv



filemon command shows performance statistics on 4 layers: files, virtual memory segments, logical volumes and physical volumes. This command is very useful in a situation where IO issues are suspected. It should be executed during peak load of the system for a short period of time (usually less than 60s).

To run filemon for 10 seconds and generate a report (fmon.out):
# filemon -v -o fmon.out -O all ; sleep 10 ; trcstop

The command utilizes AIX trace functionality which stores the collected data temporarily in the circular kernel trace buffer. If the load on the system is very high, these records in the buffer may be overwritten and "TRACEBUFFER WRAPAROUND error" will be included in the output file. In that case, the size of the buffer should be increased or the period of time for the data collection should be decreased.

# filemon -u -v -o output.out -O all -T 100000000; sleep 60; trcstop
-u: reports on the files that are opened before the trace daemon is started
-v: includes extra information in the report
-o: defines the name of the output file (instead of stdout)
-O levels: monitors specified levels of the file system (all 4 layers included in this sample)
-T: size of the kernel trace buffer in bytes
sleep 60: data collection process is active for 60s
trcstop: command used to stop the tracing process

Don't forget to execute "trcstop", otherwise trace will be running forever, which can cause other issues.


LTG (Logical Track Group) size:
(maximum allowed transfer size)
1. lsvg <vgname> | grep LTG             <--shows LTG size of the vg
2. lquerypv -M <hdiskX>                 <--shows LTG size of the disk (lspv <hdiskX) will show the same value at MAX REQUEST)
3. varyonvg -M512K tmpvg                <--this will change the LTG value which is suitable for the disk

This is written at chvg: For volume groups created on AIX 5.3, the -L flag is ignored. When the volume group is varied on, the logical track group size will be set to the common max transfer size of the disks.                            



fileplace -v smit.log
254 frags over space of 275 frags:   space efficiency = 92.4%    <--spread accross 275 fragments, but only uses 254 (sapace efficiency 92%)
 4 extents out of 254 possible:   sequentiality = 98.8%          <--it shows how sequentially are the fragments placed (98%)

defragfs can improve it, but it will not defrag the files, it defragments only the free space.
(if a file is very defragmented solution: 1. backup the file, 2. do defragfs (to have enough continuous free space), 3. restore file)
the reorgvg command will reorganize the partitions by the allocation policy




AT LV level:
    -MWC writes
    -Write verify enabled
    -inter/intra policy settings (intra policy=center is good)
    -host spots (too many lvs on a disk)
    -if many io operations occur, and the jfslog is on the same disk (solution: place jfslog separately or additional jfslogs)
    -inode lock: only 1 file in an fs (it is 1 inode) if it is written (inode is locked), so no read operation can occur


Douglas said...


I'd like to know with how much I/O disk busy% the end user will notice the server slow ?


aix said...

Hi, there are many factors which can influence disk performance, but if disk is constantly at 100% this can be a sign of some problems. (Until it is below 100%, in my opinion there are free resources at disk side.)

Unknown said...

I am trying to run the "trace" utility to log all I/O to a scsi disk (reads and writes), but I can't make sense of the data - I use fio to generate 8/16/64 kb sized iops but when going over the log (logged with trace -a -J diskdd and parsed with trcrpt) I see various b_bcount sizes like 1000/5000/11000 but they do not correspond to the sizes I expect to see... Is this related to I/O coalescing? Am I missing something here? Any help will be greatly appreciated...

aix said...

Hi, I'm afraid I can't help you in that...I have never worked with this utility...but perhaps someone else...
(If you find out something and you could share with us...that would be great.)

abhilashreddy said...

We have Aix servers connnected to shared storage DS800,Recently we have replaced faulty Cisco SAN Switch module.After reconnecting we are getting errors in errpt There is no any path failure all the hardware is fine.I guess this errors are generating memory cache...How to fix?
below is one of the error from errpt


Date/Time: Fri Aug 23 20:58:34 GMT+05:30 2013
Sequence Number: 109790
Machine Id: 00F6C61E4C00
Node Id: *******
Class: H
Type: TEMP
WPAR: Global
Resource Name: hdisk3
Resource Class: disk
Resource Type: mpioosdisk


Probable Causes

User Causes

Recommended Actions

Failure Causes

Recommended Actions

Unknown said...
This comment has been removed by the author.
Unknown said...

Recommended Actions:

So the system behaves with the disappearance of the disc. If you are replacing a faulty device (controller and / or other apart from the disc itself on the storage system), the drive had to continue.

But if you do the cleaning of the unknown drives team exportvg and rmdev -dl hdiskX - inactive drives without FC connections. you need to run the command: cfgmgr and importvg hdiskX or cfgmgr + smitty importvg (see man importvg)

(for this action we need to know which drive was associated with the name of the VolumeVG to export drive was without errors.
Typically, this information can be displayed command lspv, doing her before and after any changes to the LVM.
hdisk0 00c4afe7c75c9db5 rootvg active)

Unknown said...

On the condition that the system-level storage and transmission medium FC are all set (zoning, etc.) and you are sure that the correct drive assigned to your server

Saker said...

Hello AIX,
this is a greate topic which i really found it very useful.

i have a question about IO on disks, how i can know the current IO requests on a disk ??