dropdown menu

I/O LVM


IO QUEUE

AIX disk and adapter drivers use 2 queues to handle IO: service queue and wait queue.
IO requests in the service queue are sent to the storage, and the service queue slot is freed when the IO is complete. IO requests in the wait queue stay there until a service queue slot is free. (A read operation is complete when AIX receives the data. A write operation is complete when AIX receives an acknowledgement.)
 
Service queue for an hdisk is called queue_depth

$ lsattr -El hdisk16
...
q_err           yes                   Use QERR bit                            True
q_type          simple                Queuing TYPE                            True
qfull_dly       2                     delay in seconds for SCSI TASK SET FULL True
queue_depth     20                    Queue DEPTH                             True
recoverDEDpath  no                    Recover DED Failed Path                 True
reserve_policy  no_reserve            Reserve Policy                          True

pcmpath query devstats 16 --> Maximum column for I/O shows the maximum IOs sent to LUN, and this won’t exceed queue_depth.

For adapters, the maximum value represents the maximum number of IOs submitted to the adapter over time, and this can exceed num_cmd_elems. Thus, the adapter maximum tells us the value we should assign to num_cmd_elems.

---------------------------------------

Service queue for an adapter is called num_cmd_elems

$ lsattr -El fcs4
...
intr_priority 3          Interrupt priority                                 False
lg_term_dma   0x800000   Long term DMA                                      True
max_xfer_size 0x100000   Maximum Transfer Size                              True
num_cmd_elems 500        Maximum number of COMMANDS to queue to the adapter True
pref_alpa     0x1        Preferred AL_PA                                    True
sw_fc_class   2          FC Class for Fabric                                True
tme           no         Target Mode Enabled                                True

In fcstat (for example: fcstat fcs4) the values of "No Command Resource Count" shows  that an IO was temporarily blocked waiting for resources, due to num_cmd_elems is too low.
Non-zero values indicate that increasing num_cmd_elems may help improve IO service times

pcmpath query adaptstats --> Maximum column for I/O shows  the maximum value represents the maximum number of IOs submitted to the adapter.
This can exceed num_cmd_elems. Thus, the adapter maximum tells us the value we should assign to num_cmd_elems.

IBM doc: https://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD105745

---------------------------------------

HOW TO MEASURE IO: IOPS (XFER, TPS)

Commands as iostat and nmon will show statistsic in "io per second" or "transaction per second".
iops, tps and xfer is referring to the same thing. (bps is byte per second)

Performance can be measured by this equation:
 Q/T=Rate (IOPS or Bandwidth)

Q = the number of parallel IO requests (queue_depth or num_cmd_elems)
T = the I/O request service time
R = the rate, measured in IOPS ( bandwidth)

For example:
20/0.005 (5 millisecond) =  4.000 IOPS
40/0.005 = 8.000 IOPS
20/0.0001 (0.1 millisecond) = 200.000 IOPS

IOPS can be increased by increasing queue or reducing service tim
(Transfer rate (MB/s) can be calculated from IOPS, based on the block size we use.)


---------------------------------------

IOPS: iostat, nmon

$ iostat -RDTld hdisk10

System configuration: lcpu=88 drives=23 paths=184 vdisks=0

Disks:                     xfers                                read                                write                                  queue                    time
-------------- -------------------------------- ------------------------------------ ------------------------------------ --------------------------------------
                 %tm    bps   tps  bread  bwrtn   rps    avg    min    max time fail   wps    avg    min    max time fail    avg    min    max   avg   avg  serv
                 act                                    serv   serv   serv outs              serv   serv   serv outs        time   time   time  wqsz  sqsz qfull
hdisk10          0.5 939.5K  10.5 838.8K 100.7K   8.2   2.7   11.7   11.7     0    0   2.3   0.6    0.0    0.0     0    0  14.9    0.0    0.0    1.0   0.0   3.7


bps - amount of data transferred (read or written) per second
tps - number of transfers per second that were issued to the disk.  A transfer is an I/O request to the physical disk.
avg serv - average service time in milliseconds
avg time - average time spent in the wait queue in ms
avg wqsz - average wait queue size
avg sqsz - average service queue size
serv qfull – rate of IOs  per second submitted to a full queue


---------------------------------------

nmon: after D (3 times)

│ Disk        Service     Read Service   Write Service      Wait      ServQ  WaitQ  ServQ    
│ Name     milli-seconds  milli-seconds  milli-seconds  milli-seconds  Size   Size   Full    
│hdisk0        0.4            0.0            0.4            0.0         0.0    0.0    0.0    
│hdisk9        0.0            0.0            0.0            0.0         0.0    0.0    0.0    
│hdisk6        0.0            0.0            0.0            0.0         0.0    0.0    0.0    
│hdisk11       0.0            0.0            0.0            0.0         0.0    0.0    0.0    


Wait - average wait time in the queue in ms
ServQ Size - Average service queue size
WaitQ Size - Average wait queue size
ServQ Full - Number of IO requests sent to a full queue for the interval
If you’ve a lot of hdisks, you can press the “.” subcommand to just show the busy disks.


---------------------------------------

SPEED TESTS TIPS:

- to get to high I/O rates many threads need to be run parallel (20, 32 or more processes)
- smaller block sizes results in higher transaction rates (iops), but not suggested to go below 4k
- higher block sizes results in higher bandwidth (MB/s), it can be 128k or 1m
- most databases generate random I/O (and with smaller block sizes 4k or 8k)
- avoid the AIX file system cache, use logical volumes or disks directly (raw io is recommended:rlv, rhdisk)


---------------------------------------

SPEED TEST WITH: dd


READ: If a file cached in memory, read measurement is not valid. umount/mount is the best way to handle this.
WRITE: Write to file are done in memory (unless direct, synchronous or asynchronous I/O is used), so syncd is needed for tests to write-out

synchronous I/O: disk operation is required to read the data into memory
asynchronous I/O: allow applications to initiate read or write operations without being blocked since all I/O is done in background


disk write speed:
sync; date; dd if=/dev/zero of=/export/1000m bs=1m count=1024; date; sync; date
                               
disk read speed: (sequential read thruput, raw device was used)
timex dd if=/dev/rhdisk0 of=/dev/null bs=1m count=1024    <--it is reading directly 1024MB from hdisk, bypassing LVM

1024+0 records in
1024+0 records out

real 25.30                                                <--1024/25=40MB/s is the reading speed
user 0.00
sys  0.45


test a disk:
1. dd if=/dev/hdisk10 of=/dev/null &    <--creates io activity on hdisk10 in the background
2.iostat -D hdisk10 2 10                <--shows io activity of hdisk10
3.iostat -a 2 10                        <--shows the io activity of the adapters


-----------------------------------

SPEED TEST WITH: ndisk64

ndisk64 is part of nstress tools - https://www.ibm.com/developerworks/community/wikis/home?lang=en#/wiki/Power%20Systems/page/nstress


./ndisk64 -R -t 60 -f /dev/rlv_bb -M 20 -b 4k -s 10G -r100

-R (random IO)
-t 60 (for quick test 60 sec, for longer test -t 300)
-f it can be file or I used with raw lv to avoid fs cache (raw device has ben used in /dev)
-M 20 (multiple processes, queue_depth was on 20, so I used 20, but can be tried with 32 or more)
-b 4k (-b 1m higher block sizes, higher bandwith)
-s 10G (for logical volumes size has to be specified)
-r100 (r100: read 100%, r0: write 100%, r80: read 80% and write 20%)


An example run:
# ./ndisk64 -R -t 60 -f /dev/rfslv00 -M 20 -b 1m -s 10G -r100
Command: ./ndisk64 -R -t 60 -f /dev/rfslv00 -M 20 -b 1m -s 10G -r100
        Synchronous Disk test (regular read/write)
        No. of processes = 20
        I/O type         = Random
        Block size       = 1048576
        Read-Write       = Read Only
        Sync type: none  = just close the file
        Number of files  = 1
        File size        = 10737418240 bytes = 10485760 KB = 10240 MB
        Run time         = 60 seconds
        Snooze %         = 0 percent
----> Running test with block Size=1048576 (1024KB) ....................
Proc - <-----Disk IO----> | <-----Throughput------> RunTime
Num -     TOTAL   IO/sec |    MB/sec       KB/sec  Seconds
   1 -      9219    153.7 |    153.67    157362.94  59.99
   2 -      9229    153.9 |    153.86    157555.99  59.98
   3 -      9201    153.4 |    153.40    157084.05  59.98
   4 -      9213    153.6 |    153.58    157263.69  59.99
   5 -      9241    154.0 |    154.04    157737.37  59.99
   6 -      9186    153.1 |    153.11    156783.04  60.00
   7 -      9246    154.1 |    154.10    157802.31  60.00
   8 -      9267    154.6 |    154.56    158268.18  59.96
   9 -      9290    154.9 |    154.93    158645.83  59.96
  10 -      9234    154.0 |    153.99    157689.96  59.96
  11 -      9245    154.2 |    154.18    157878.06  59.96
  12 -      9257    154.4 |    154.37    158072.55  59.97
  13 -      9197    153.4 |    153.36    157038.53  59.97
  14 -      9285    154.8 |    154.81    158521.14  59.98
  15 -      9312    155.2 |    155.24    158967.19  59.98
  16 -      9210    153.6 |    153.59    157272.09  59.97
  17 -      9223    153.8 |    153.80    157487.53  59.97
  18 -      9260    154.3 |    154.34    158047.39  60.00
  19 -      9300    155.0 |    154.99    158711.25  60.00
  20 -      9271    154.5 |    154.50    158204.44  60.01
TOTALS    184886   3082.4 |   3082.42   3156393.53
- Random procs= 20 read=100% bs=1024KB


-----------------------------------
-----------------------------------
-----------------------------------

LVM - FILESYSTEM RELATED PERFORMANCE



lvmstat            reports input/output statistics for logical volumes

1. lvmstat -v <vgname> -e             <--enables lvmstat
2. lvmstat -v <vgname>                <--lists lvmstat
3. lvmstat -v <vgname> -d             <--disables lvmstat

If lv (or some lp's only) heavily used you can migrate to other disk (migratepv/migratelp)

-----------------------------------

lvmo:

$ for VG in `lsvg`; do lvmo -a -v $VG; echo; done

vgname = rootvg
pv_pbuf_count = 512
total_vg_pbufs = 512
max_vg_pbuf_count = 16384
pervg_blocked_io_count = 62
pv_min_pbuf = 512
global_blocked_io_count = 62


This will show for how much io is blocked because of the pbuf.

$ ioo -a | grep pbuf
pv_min_pbuf = 512

-----------------------------------

defragfs -s /filesystem    shows percent of fragmentation
lslv -p hdisk0 lv01        shows the lv fragmentation:
    USED: not this lv , but other one is using that pp
    FREE: that pp is not used by any lv
    STALE: that pp is not consistent with other partitions (the partition number with a question mark means stale)
    Numbers: it shows the logical partition number of that lv

-----------------------------------

filemon -O all -o fmon.out; sleep 10; trcstop    creates fmon.out file and shows lvm stats of 10 seconds (in this case)
                                                 it consists: Most Active: Files, Logical Volumes, Physica Volumes
                                                 (rblk/wblk: number of 512 byte blocks read/written)
!!!filemon is taking up almost 60 percent of the CPU, so you need to be very careful when using it!!!

-----------------------------------

LTG (Logical Track Group) size:
(maximum allowed transfer size)
1. lsvg <vgname> | grep LTG             <--shows LTG size of the vg
2. lquerypv -M <hdiskX>                 <--shows LTG size of the disk (lspv <hdiskX) will show the same value at MAX REQUEST)
3. varyonvg -M512K tmpvg                <--this will change the LTG value which is suitable for the disk

This is written at chvg: For volume groups created on AIX 5.3, the -L flag is ignored. When the volume group is varied on, the logical track group size will be set to the common max transfer size of the disks.                            

-----------------------------------

Fragmentation:

fileplace -v smit.log
...
254 frags over space of 275 frags:   space efficiency = 92.4%    <--spread accross 275 fragments, but only uses 254 (sapace efficiency 92%)
 4 extents out of 254 possible:   sequentiality = 98.8%          <--it shows how sequentially are the fragments placed (98%)

defragfs can improve it, but it will not defrag the files, it defragments only the free space.
(if a file is very defragmented solution: 1. backup the file, 2. do defragfs (to have enough continuous free space), 3. restore file)
the reorgvg command will reorganize the partitions by the allocation policy

-----------------------------------

SOME HINTS FOR POOR IO:

AT DISK LEVEL:
    -fragmentation
    -queue_depth

AT LV level:
    -MWC writes
    -Write verify enabled
    -inter/intra policy settings (intra policy=center is good)
    -host spots (too many lvs on a disk)
    -if many io operations occur, and the jfslog is on the same disk (solution: place jfslog separately or additional jfslogs)
    -inode lock: only 1 file in an fs (it is 1 inode) if it is written (inode is locked), so no read operation can occur

9 comments:

  1. Hello!

    I'd like to know with how much I/O disk busy% the end user will notice the server slow ?

    thanks!

    ReplyDelete
    Replies
    1. Hi, there are many factors which can influence disk performance, but if disk is constantly at 100% this can be a sign of some problems. (Until it is below 100%, in my opinion there are free resources at disk side.)

      Delete
  2. Hello,
    I am trying to run the "trace" utility to log all I/O to a scsi disk (reads and writes), but I can't make sense of the data - I use fio to generate 8/16/64 kb sized iops but when going over the log (logged with trace -a -J diskdd and parsed with trcrpt) I see various b_bcount sizes like 1000/5000/11000 but they do not correspond to the sizes I expect to see... Is this related to I/O coalescing? Am I missing something here? Any help will be greatly appreciated...
    Thanks

    ReplyDelete
    Replies
    1. Hi, I'm afraid I can't help you in that...I have never worked with this utility...but perhaps someone else...
      (If you find out something and you could share with us...that would be great.)

      Delete
  3. Hi,
    We have Aix servers connnected to shared storage DS800,Recently we have replaced faulty Cisco SAN Switch module.After reconnecting we are getting errors in errpt There is no any path failure all the hardware is fine.I guess this errors are generating memory cache...How to fix?
    below is one of the error from errpt

    LABEL: SC_DISK_ERR4
    IDENTIFIER: DCB47997

    Date/Time: Fri Aug 23 20:58:34 GMT+05:30 2013
    Sequence Number: 109790
    Machine Id: 00F6C61E4C00
    Node Id: *******
    Class: H
    Type: TEMP
    WPAR: Global
    Resource Name: hdisk3
    Resource Class: disk
    Resource Type: mpioosdisk


    Description
    DISK OPERATION ERROR

    Probable Causes
    MEDIA
    DASD DEVICE

    User Causes
    MEDIA DEFECTIVE

    Recommended Actions
    FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
    PERFORM PROBLEM DETERMINATION PROCEDURES

    Failure Causes
    MEDIA
    DISK DRIVE

    Recommended Actions
    FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
    PERFORM PROBLEM DETERMINATION PROCEDURES


    ReplyDelete
    Replies
    1. Recommended Actions:
      FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
      PERFORM PROBLEM DETERMINATION PROCEDURES

      So the system behaves with the disappearance of the disc. If you are replacing a faulty device (controller and / or other apart from the disc itself on the storage system), the drive had to continue.

      But if you do the cleaning of the unknown drives team exportvg and rmdev -dl hdiskX - inactive drives without FC connections. you need to run the command: cfgmgr and importvg hdiskX or cfgmgr + smitty importvg (see man importvg)

      (for this action we need to know which drive was associated with the name of the VolumeVG to export drive was without errors.
      Typically, this information can be displayed command lspv, doing her before and after any changes to the LVM.
      #lspv
      hdisk0 00c4afe7c75c9db5 rootvg active)

      Delete
    2. On the condition that the system-level storage and transmission medium FC are all set (zoning, etc.) and you are sure that the correct drive assigned to your server

      Delete
  4. Hello AIX,
    this is a greate topic which i really found it very useful.

    i have a question about IO on disks, how i can know the current IO requests on a disk ??

    ReplyDelete