dropdown menu



IOPS (I/O per second) for a disk is limited by queue_depth/(average IO service time). Assuming a queue_depth of 3, and an average IO service time of 10 ms, this equals to 300 IOPS for the hdisk. And for many applications this may not be enough throughput.

When an application does an IO request, it is queued at each layer from the app. to the disk:
-fs: filesystem buffer (fsbuf)
-lvm: volume group buffer (pbuf)
-multipath driver (optional)
-hdisk: disk driver queue (queue_depth)
-adapter: FC adapter driver queue (num_cmd_elems)
-SAN fabric devices buffer/cache
-Storage server cache



We can collect data using iostat during the peak load of the system (for 15-30 mins using 15-30s intervals). It can be executed in parallel with additional data collection tools (nmon …) to establish the performance across all the layers.

iostat -s            print the System throughput report since boot
iostat -d hdisk1 5   display only 1 disk statistics (hdisk1)
iostat -a 5          shows adapter statistics as well

iostat -DRTl 5 2     shows 2 statistics in 5 seconds interval
            -D       extended report
            -R       resest min. max values at each interval
            -T       adds timestamp
            -l       long listing mode

Disks:              xfers                            read                            write                           queue           
------  -------------------------------- ------------------------------  ----------------------------- ---------------------------------
         %tm  bps     tps  bread  bwrtn   rps    avg    max   time fail  wps   avg    max  time  fail  avg   min   max   avg   avg  serv
         act                                     serv   serv  outs             serv   serv  outs       time  time  time  wqsz  sqsz qfull
hdisk4  13.9  212.2K  13.9 156.7K  55.5K   9.6   14.9   40.4   0    0    4.4   0.6    1.1     0    0   0.0   0.0   0.0   0.0   0.0   0.0
hdisk5  20.1  246.4K  16.5 204.8K  41.6K  12.4   16.2   49.8   0    0    4.2   0.6    1.3     0    0   0.0   0.0   0.0   0.0   0.0   0.0
hdisk6  19.7  282.4K  17.1 257.9K  24.5K  14.7   13.6   37.3   0    0    2.4   0.6    1.1     0    0   0.0   0.0   0.0   0.0   0.0   0.0
hdisk7  18.7  300.3K  20.3 215.4K  84.9K  13.1   13.9   39.7   0    0    7.2   0.7    5.6     0    0   0.0   0.0   0.0   0.0   0.0   0.0


xfers (transfers):

%tm act:     percent of time the device was active (we can see if disk load is balanced correctly or not, 1 used heavily others not)
bps:         amount of data transferred (read or written) per second (default is in bytes per second)
tps:         number of transfers per second
             (Transfer is an I/O request, and multiple logical requests can be combined into 1 I/O request with different size)
bread:       amount of data read per second (default is in bytes per second)
bwrtn:       amount of data written per second (default is in bytes per second)

The OS has a sensor, regularily asking the disk if it is busy or not. When the disks answers half of the times "I'm busy", then the "% tm_act" will be 50%. If the disk answers every time "I'm busy" then tm_act will be 100%, etc.. A disk answers with "busy", when there are requested operations not yet fulfilled, read or write. If many very small requests go to the disk the chance of the sensor asking exactly when one such operation is still open goes up - much more so than the real activity of the disk.

So, "100% busy" does not necessarily mean the disk is at the edge of its trasnfer bandwidth. It could mean either that because the disk is getting relatively few but big requests (example: stream I/O) but it could also mean that the disk is getting a lot of requests which are relatively small so that the disk is occupied most of the time, but not using its complete transfer bandwith.
To find out which is the case analyse the corresponding "bread" and "bwrtn" column from iostat.

General rule of thumb, if %tm_act greater than 70%, than probably better to migrate something to other disks as well, the more drives that your data hits, the better.



rps/wps:     number of read/write transfers per second.
avgserv:     average service time per read/write transfer (default is in milliseconds)
timeouts:    number of read/write timeouts per second
fail:        number of failed read/write requests per second



avgtime:     average time spent in the wait queue (waiting to get sent to the disk, the disk's queue is full) (default is in millisecs)
avgwqsz:     average wait queue size (waiting to be sent to the disk)
avgsqsz:     average service queue size (this can't exceed queue_depth for the disk)
sqfull:      number of times the service queue becomes full per sec. (rate per sec. at which I/O requests are submitted to a full queue)


Tuning considerations

read avgserv: should be less than 10ms
write avgserv: should be less than 3ms
avgtime, serv qfull, avgwqsize: should be 0

If avgtime, serv qfull and avgwqsize (often) are above 0, the queue is overloaded and the queue_depth variable should be increased. (If I/O service times (avgserv) are poor increasing queue_depth does not help because I/Os will wait at the storage rather than in the queue.)

There is an "in process" and a "wait" queue at each layer, the "in process queue" is sometimes referred to as the "service" queue. Once the queue limit is reached, the IOs go into to "wait queue" until an IO completes and a slot is freed up in the service queue. From the application's point of view, the length of time to do an IO is its service time plus the time it waits in the hdisk wait queue (avgserv+avgtime). Time spent in the wait queue indicates increasing queue_depth may improve performance.

During tuning very useful to run "iostat -D" which shows statistics since system boot, (for this history statistics in sys0 should be iostat=true: lsattr -El sys0).

When you increase the queue_depths (so more IOs will be sent to the disk subsystem), the IO service times are likely to increase (avgserv), but throughput will also increase. If IO service times start approaching the disk timeout value, then you're submitting more IOs than the disk subsystem can handle. If you start seeing IO timeouts and IO completing errors in the error log, then this is the time to look for hardware problems or to make the pipe smaller.

# lsattr -El hdisk400 | grep timeout
rw_timeout    40                 READ/WRITE time out value        True

If read/write max service times are 30 or 60 seconds or close to what the read/write time out value is set to, this likely indicates a command timeout or some type of error recovery the disk driver experienced.

A general rule for tuning queue_depths, is that one can increase queue_depths until IO service times start exceeding ~10-15 ms for small random reads or writes or one isn't filling the queues. Once IO service times start increasing, we've pushed the bottleneck from the AIX disk and adapter queues to the disk subsystem.

ndisk can be used to test what are the limits. Caches and IO service times will affect test results. Read cache hit rates typically increase the second time you run a test. After the first test the cache should be flushed (umount). Write cache helps performance until the write caches fill up at which time performance goes down, so longer running tests with high write rates can show a drop in performance over time. For write caches, consider monitoring the cache to see if it fills up and run your tests long enough to see if the cache continues to fill up faster than the data can be off loaded to disk.

The downside of setting queue depths too high, is that the disk subsystem won't be able to handle the IO requests in a timely fashion, and may even reject the IO or just ignore it. This can result in an IO time out, and IO error recovery code will be called.



%iowait is the percentage of time the CPU is idle AND there is at least one I/O in progress (all CPUs averaged together). A system with four CPUs and one thread doing I/O will report a maximum of 25% iowait. A system with 12 CPUs and one thread doing I/O will report a maximum of 8.3% iowait. High I/O wait does not mean definitely an I/O bottleneck or zero I/O wait dos not mean there is no I/O botleneck. If aix is waiting on a write and has nothing else to do it will keep looking for the incoming i/o completion and in this time 'book' all its time to an i/o wait because that is all it is trying to do. The opposite situation when an application may be busy processing other requests while IOs are taking a long time to complete, and in this case we will see low iowait percentage.



Anonymous said...


Could you explain GPFS concept

Unknown said...

Can you please help me to find how many eth cards & fc cards i have in each vio server based on below info?

ent0 Available 02-00 4-Port Gigabit Ethernet PCI-Express Adapter (e414571614102004)
ent1 Available 02-01 4-Port Gigabit Ethernet PCI-Express Adapter (e414571614102004)
ent2 Available 02-02 4-Port Gigabit Ethernet PCI-Express Adapter (e414571614102004)
ent3 Available 02-03 4-Port Gigabit Ethernet PCI-Express Adapter (e414571614102004)

fcs0 Available 03-00 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)
fcs1 Available 03-01 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)

fcs0 Available 01-00 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)
fcs1 Available 01-01 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)

Unknown said...

Check the following link and also google.

Unknown said...

You've following

VIOS Server1: You have 1 Quad port Ethernet Card and 1 Dual port FC Card

VIOS Server2: You have only 1 Dual port FC Card

Unknown said...

Great blog. Like this blog . Thank you so much ^^^

Kapalin said...

Nicely explained, Thanks a lot for this information

DAC said...

Hello, very good article... In fact, I just solve my problems...
One note on "max_xfer_size": If you're using NPIV, YOU HAVE TO CHANGE THE VALUE ON THE VIOS SERVER FIRST, then reboot (or rmdev -Rl and cfgmgr), and then, change it on the Client LPAR. If you fail to do this in this order, you may find yourself with "cfgmgr" errors or LED 554 if you boot the LPAR.

Thanks, Chris's AIX Blog (https://www.ibm.com/developerworks/community/blogs/cgaix/entry/multibos_saved_my_bacon1?lang=es).

aix said...

Hi, thanks for your valuable info.

Unknown said...

Very good deep dive infos. Thank you.

aix said...

szívesen :)

Durga said...


I have query, I'm having 2 VIOS & 8 LPAR's in frame

2 VIOS - updating the lg_term_dma, max_xfer_size & rebooting

7 LPAR's - updating the lg_term_dma, max_xfer_size & rebooting

1 LPAR - NOT UPDATING the values of lg_term_dma, max_xfer_size

So, the updated values of VIOS will not effect the LPAR, which was not updated.

Thanks & Regards,

Tusar said...

Excellent Explanation....Grt Blog

Unknown said...

How can i take the IOstat data for last one hour

Unknown said...

how can we count IOPS