dropdown menu

IOSTAT - FCSTAT:

IOPS (I/O per second) for a disk is limited by queue_depth/(average IO service time). Assuming a queue_depth of 3, and an average IO service time of 10 ms, this equals to 300 IOPS for the hdisk. And for many applications this may not be enough throughput.

IO is queued at each layer, where travels:
-filesystem - filesystem buffer
-hdisk - queue_depths
-adapter - num_cmd_elems

-----------------------------------

Adapter I/O:

There are no adapter stats in AIX. They are derived from the disk stats. The adapter busy% is simply the sum of the disk busy%.
So if the adapter busy% is, for example, 350% then you have 3.5 disks busy on that adapter. Or it could be 7 disks at 50% busy or 14 disks at 25% or ....

There is no way to determine the adapter busy and in fact it is not clear what it would really mean. The adapter has a dedicated on-board CPU that is always busy (probably no real OS) and we don't run nmon of these adapter CPUs to find out what they are really doing.

-----------------------------------

IOSTAT:

iostat -s            print the System throughput report since boot
iostat -d hdisk1 5   display only 1 disk statistics (hdisk1)
iostat -a 5          shows adapter statistics as well


iostat -DRTl 5 2     shows 2 statistics in 5 seconds interval
            -D       extended report
            -R       resest min. max values at each interval
            -T       adds timestamp
            -l       long listing mode


Disks:              xfers                            read                         write                                  queue                
------  -------------------------------- -------------------------------- ----------------------------- --------------------------------------
          %tm    bps   tps  bread  bwrtn   rps    avg     max  time fail   wps    avg    max time fail    avg    min    max   avg   avg  serv
          act                                    serv    serv  outs              serv   serv outs        time   time   time  wqsz  sqsz qfull
hdisk4   13.9 212.2K  13.9 156.7K  55.5K   9.6   14.9    40.4     0    0   4.4   0.6    1.1     0    0   0.0    0.0    0.0    0.0   0.0   0.0
hdisk5   20.1 246.4K  16.5 204.8K  41.6K  12.4   16.2    49.8     0    0   4.2   0.6    1.3     0    0   0.0    0.0    0.0    0.0   0.0   0.0
hdisk6   19.7 282.4K  17.1 257.9K  24.5K  14.7   13.6    37.3     0    0   2.4   0.6    1.1     0    0   0.0    0.0    0.0    0.0   0.0   0.0
hdisk7   18.7 300.3K  20.3 215.4K  84.9K  13.1   13.9    39.7     0    0   7.2   0.7    5.6     0    0   0.0    0.0    0.0    0.0   0.0   0.0

-----------------------------------

xfers (transfers):

%tm act:     percent of time the device was active (we can see if disk load is balanced correctly or not, 1 used heavily others not)
bps:         amount of data transferred (read or written) per second (default is in bytes per second)
tps:         number of transfers per second
             (Transfer is an I/O request, and multiple logical requests can be combined into 1 I/O request with different size)
bread:       amount of data read per second (default is in bytes per second)
bwrtn:       amount of data written per second (default is in bytes per second)

%tm_act:
The OS has a sensor, regularily asking the disk if it is busy or not. When the disks answers half of the times "I'm busy", then the "% tm_act" will be 50%. If the disk answers every time "I'm busy" then tm_act will be 100%, etc.. A disk answers with "busy", when there are requested operations not yet fulfilled, read or write. If many very small requests go to the disk the chance of the sensor asking exactly when one such operation is still open goes up - much more so than the real activity of the disk.

So, "100% busy" does not necessarily mean the disk is at the edge of its trasnfer bandwidth. It could mean either that because the disk is getting relatively few but big requests (example: stream I/O) but it could also mean that the disk is getting a lot of requests which are relatively small so that the disk is occupied most of the time, but not using its complete transfer bandwith.
To find out which is the case analyse the corresponding "bread" and "bwrtn" column from iostat.

-----------------------------------

read/write:

rps/wps:     number of read/write transfers per second.
avgserv:     average service time per read/write transfer (default is in milliseconds)
timeouts:    number of read/write timeouts per second
fail:        number of failed read/write requests per second

-----------------------------------

queue (wait queue):

avgtime:     average time spent in the wait queue (waiting to get sent to the disk, the disk's queue is full) (default is in milliseconds)
avgwqsz:     average wait queue size (waiting to be sent to the disk)
avgsqsz:     average service queue size (this can't exceed queue_depth for the disk)
sqfull:      number of times the service queue becomes full per second (that is, the disk is not accepting any more service requests)

-----------------------------------

From the application's point of view, the length of time to do an IO is its service time plus the time it waits in the hdisk wait queue.
Time spent in the queue indicates increasing queue_depth may improve performance, for correct tuning check maximum numbers as well.
If avgwqsz is often > 0, then increase queue_depth
If sqfull in the first report is high, then increase queue_depth

When you increase the queue_depths  (so more IO are sent to the disk), the IO service times are likely to increase, but throughput will also increase. If IO service times start approaching the disk timeout value, then you're submitting more IOs than the disk can handle:

# lsattr -El hdisk400 | grep timeout
rw_timeout    40                 READ/WRITE time out value        True

If read/write max service times are 30 or 60 seconds or close to what the read/write time out value is set to, this likely indicates a command timeout or some type of error recovery the disk driver experienced. 

-----------------------------------

A good general rule for tuning queue_depths, is to tune until these rates are achieved:
average read service time ~ 15 ms
with write cache, write average ~ 2 ms (writes typically go to write cache first)

Typically for large disk subsystems that aren't overloaded, IO service times will average around 5-10 ms. When small random reads start averaging greater than 15 ms, this indicates the storage is getting busy.

-----------------------------------

For tuning, we can set up these categories:

1. We're filling up the queues and IOs are waiting in the hdisk or adapter drivers
2. We're not filling up the queues, and IO service times are good
3. We're not filling up the queues, and IO service times are poor
4. We're not filling up the queues, and we're sending IOs to the storage faster than it can handle and it loses the IOs

#2: we want to reach this
#3: indicates bottleneck beyond hdisk (probably in adapter, SAN fabric or at storage box side)
#4: should be avoided (if storage loses IOs, at the host IO will timeout, with recover code it will be resubmitted, in the meantime appl. is waiting for this IO)

-----------------------------------

General rule of thumb, if %tm_act greater than 70%, than probably better to migrate something to other disks as well.
Moving data to less busy drives can obviously help ease this burden. Generally speaking, the more drives that your data hits, the better.

%iowait: percentage of time the CPU is idle AND there is at least one I/O in progress (all CPUs averaged together)
(High I/O wait does not mean definitely I/O bottleneck. Zero I/O wait dos not mean there is no I/P botleneck.)
%iowait>25% system is probably I/O bound.


-----------------------------------
-----------------------------------
-----------------------------------


FCSTAT:


# fcstat fcs0

FIBRE CHANNEL STATISTICS REPORT: fcs0
...
World Wide Node Name: 0x20000000C9F170E8 <--WWPN number
World Wide Port Name: 0x10000000C9F170E8
...
Port Speed (supported): 8 GBIT           <--FC adapter speed
Port Speed (running):   8 GBIT
...
Error Frames:  0                         <--both of them affect IO, when frames are damaged or
Dumped Frames: 0                         <--frames are discarded
...
FC SCSI Adapter Driver Information
No DMA Resource Count: 37512             <--IOs queued at the adapter due to lack of resources (max_xfer_size)
No Adapter Elements Count: 104848        <--number of times since boot, an IO was temporarily blocked due to an inadequate num_cmd_elems value
No Command Resource Count: 13915968      <--same as above, non-zero values indicate that increasing num_cmd_elems may help improve IO service times
...


# lsattr -El fcs0 | egrep 'xfer|num|dma'
lg_term_dma   0x800000   Long term DMA                                      True
max_xfer_size 0x100000   Maximum Transfer Size                              True
num_cmd_elems 2048       Maximum number of COMMANDS to queue to the adapter True

When the default value is used (max_xfer_size=0x100000) the memory area is 16 MB in size. When setting this attribute to any other allowable value (say 0x200000) then the memory area is 128 MB in size. At AIX 6.1 TL2 or later a change was made for virtual FC adapters so the DMA memory area is always 128 MB even with the default max_xfer_size. This memory area is a DMA memory area, but it is different than the DMA memory area controlled by the lg_term_dma attribute (which is used for IO control). The default value for lg_term_dma of 0x800000 is usually adequate.

So for heavy IO and especially for large IOs (such as for backups) it's recommended to set max_xfer_size=0x200000.
Like the hdisk queue_depth attribute, changing the num_cmd_elems value requires stopping use of the resources or a reboot.

11 comments:

  1. HI

    Could you explain GPFS concept

    ReplyDelete
    Replies
    1. Check the following link and also google.
      http://www-03.ibm.com/systems/software/gpfs/

      Delete
  2. Hi,
    Can you please help me to find how many eth cards & fc cards i have in each vio server based on below info?

    VioS1:
    ent0 Available 02-00 4-Port Gigabit Ethernet PCI-Express Adapter (e414571614102004)
    ent1 Available 02-01 4-Port Gigabit Ethernet PCI-Express Adapter (e414571614102004)
    ent2 Available 02-02 4-Port Gigabit Ethernet PCI-Express Adapter (e414571614102004)
    ent3 Available 02-03 4-Port Gigabit Ethernet PCI-Express Adapter (e414571614102004)

    fcs0 Available 03-00 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)
    fcs1 Available 03-01 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)

    VioS2:
    fcs0 Available 01-00 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)
    fcs1 Available 01-01 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)

    ReplyDelete
    Replies
    1. You've following

      VIOS Server1: You have 1 Quad port Ethernet Card and 1 Dual port FC Card

      VIOS Server2: You have only 1 Dual port FC Card

      Delete
  3. Great blog. Like this blog . Thank you so much ^^^

    ReplyDelete
  4. Nicely explained, Thanks a lot for this information

    ReplyDelete
  5. Hello, very good article... In fact, I just solve my problems...
    One note on "max_xfer_size": If you're using NPIV, YOU HAVE TO CHANGE THE VALUE ON THE VIOS SERVER FIRST, then reboot (or rmdev -Rl and cfgmgr), and then, change it on the Client LPAR. If you fail to do this in this order, you may find yourself with "cfgmgr" errors or LED 554 if you boot the LPAR.

    Thanks, Chris's AIX Blog (https://www.ibm.com/developerworks/community/blogs/cgaix/entry/multibos_saved_my_bacon1?lang=es).

    ReplyDelete
    Replies
    1. Hi, thanks for your valuable info.

      Delete
    2. Hi,

      I have query, I'm having 2 VIOS & 8 LPAR's in frame

      2 VIOS - updating the lg_term_dma, max_xfer_size & rebooting

      7 LPAR's - updating the lg_term_dma, max_xfer_size & rebooting

      1 LPAR - NOT UPDATING the values of lg_term_dma, max_xfer_size

      So, the updated values of VIOS will not effect the LPAR, which was not updated.

      Thanks & Regards,

      Delete
  6. Very good deep dive infos. Thank you.

    ReplyDelete