dropdown menu


Network - SEA, Virtualization

viostat            monitors storage performance
seastat            statistics about SEA (first this should be enabled for the SEA: chdev -dev ent10 -attr accounting=enabled)
                   then: seastat -d ent10


topas -E:

#  ifconfig <SEA/en8> up            <--SEA interface must be in up state

# topas -E                          <--topas -E will show detailed info
Topas Monitor for host:                Interval:
Network                                KBPS   I-Pack   O-Pack    KB-In   KB-Out
Topas Monitor for host:        aix10-vios1erval:   2   Thu Nov 15 10:51:27 2012
Network                                KBPS   I-Pack   O-Pack    KB-In   KB-Out
ent8 (SEA)                              2.3     13.5      6.0      1.5      0.9
  |\--ent3 (VETH CTRL)                  2.4      3.5      3.5      1.2      1.2
  |\--ent2 (EC PHYS)                    1.3      9.0      1.5      0.9      0.4
  |  |\--ent1 (PRIM)                    0.9      4.5      1.5      0.5      0.4
  |   \--ent0 (PRIM)                    0.4      4.5      0.0      0.4      0.0
  |\--ent4 (VETH)                       0.6      1.0      4.5      0.1      0.5
   \--ent5 (VETH)                       0.4      3.5      0.0      0.4      0.0
en6                                     0.0      0.0      0.0      0.0      0.0
lo0                                     0.0      0.0      0.0      0.0      0.0


Important details about Virt. Ethernet Adapters:

- Virtual Ethernet by default cannot provide 10Gb. A quote from VIO Wiki: "The transmission speed of Virtual Ethernet adapters is in the range of 1-3 Gigabits per second, depending on the transmission (MTU) size. " It supports MTU sizes: 1500, 9000 and additionally 65280. A SEA (on top of 2x10Gb LACP Eth. Chan.) with 1 Virt. Trunk Adapter (with default settings) will not to provide 20Gb/s bandwidth.

- On a Power 8 server with Jumbo Frames, I could reach on a SEA with 1 Virt. Adapter max. 13 Gb/s. (Without tuning, network speed was 2-3 Gb/s). If you want to use 20Gb/s  (Eth. Channel with 2x10Gb LACP) you need at least 2 active Virt. Adapter in a SEA. (In SEA load-sharing mode at least 4 trunk adapters, 2-2 will be active on each VIO.)

- When using Virt. Ethernet and SEA, server must have enough free CPU (with 10Gb it will use extra CPU)
(Virtual Ethernet traffic is generally heavier than virtual SCSI traffic and Virtual Ethernet connections take up more CPU cycles than Physical Ethernet adapters. The reason is that modern Physical Ethernet adapters are taking over some work from system CPU for example: checksum computation and verification, packet reassembly...)


Network Tuning:
(These values are recommended by Gareth Coats)

- SEA returns the maximum aggregate bandwidth when two Virtual network adapters are configured to it rather than one. It is not possible to achieve line speed of 10Gbit Ethernet through
an SEA with a single attached Virtual Network – it is possible with two virtual network adapters.
- An individual process communicating to a remote process will see an average of 450MB/sec bandwidth through VIO servers using 10Gbit Ethernet
- For an LPAR to achieve the maximum throughput over virtual networks the LPAR must communicate with 2 (or more) virtual networks and use more than one process on each virtual network.

Using the correct configuration options is essential, in each LPAR:
- enable largesend
- use the maximum available mtu size (64k)
- ensure each 10Gbit Ethernet adapter has largesend, flow_control, large_receive and jumbo_frames enabled
- Etherchannel adapters should use 8023ad mode, src_dst_port hash mode and have jumbo frames enabled
- SEA adapter itself needs to have largesend, large_receive and jumbo_frames enabled


Jumbo frames:

With Jumbo frames turned on, each device (which is affected during network communication) has to be able to handle increased size packets. (Netw. Switch, Phys. Adapters, Eth. Chan., client LPAR). You cannot mix MTU sizes within a network, unless you have a suitable router which can fragment large packets to smaller MTU size if needed (consult with network team)

Virtual Ethernet adapters supports MTU of 64K as well, which can have a huge benefit with 10Gbit adapters. These packets will automatically be divided into the physical MTU before leaving the hardware. (VIO layer will handle the conversion to and from the MTU of 9000.)

The MTU size you can use can depend on the device you are using, so if 65280 does not work try 65536 (the full 64K), 65394 (64K minus overhead), 65390 (64K minus VLAN overhead) or you can try usual jumbo frame MTU=9000 .

When mixed MTU size (1500 and 9000) traffic goes through on a Virt. adapter in a SEA, bandwidth will be dropped down to MTU 1500. (So make sure LPARs which are using the same Virt. Adapter in a SEA have same mtu size setting.)


Flow Control:

With 10 Gbit Ethernet it is very useful to turn on flow control to stop the need for re-transmission. Flow control prevents re-transmissions (and the resulting time outs, delays) by stopping transmissions at source when any buffer on the path approaches overflow. (Otherwise it is very easy at high bandwidth to completely fill buffers on switches and adapters so that transmitted packets are dropped. )

Check 10G adapters and ask network team about this feature at switch side.

$ lsdev -dev ent4 -attr | grep flow
flow_ctrl       yes               Request flow control


Large Send, Large Receive:

If these are turned on (large send and large receive) TCP stack can build a message up to 64 KB and send it in one call. (with 1500 bytes that would take 44 calls.)
Turning these on can increase network throughput massively
(One test showed 1Gb/s without largesend, 3.8Gb/s with largesend,  much lower CPU in sender LPAR and in sending VIO, all with MTU at 1500, without  jumbo frames.)

Physical adapters should have by deafult large_send (and large_receive) on:
#lsattr -El ent0 |grep large
large_receive yes              Enable receive TCP segment aggregation
large_send    yes              Enable hardware Transmit TCP segmentation

For SEA  large_receive is turned off by default, so it disables this feature at underlying physical adapters.
(To use this feature, it should be turned on at SEA as well):
# lsattr -El ent11 | grep large
large_receive yes        Enable receive TCP segment aggregation
largesend     1         Enable Hardware Transmit TCP Resegmentation

At AIX LPAR side, largesend  is called mtu_bypass:
# lsattr -El en0 | grep mtu_bypass
mtu_bypass    on              Enable/Disable largesend for virtual Ethernet

Some cautions:
- large send and large receive provide benefits only when using large packets (>1500 byte).
- as large receive aggregates multiple incoming packets, this might have negative impact on latency. (higher appl. latency, more re-transmissions, or packet drops as an extreme example)
(Normally large receive works fine in mixed workload environments with moderate network demands. )
-a receiver with large receive enabled can for example communicate with a sender where large send is disabled, but on a lower throughput and with more cpu overhead on the sender side


ISNO (Interface Specific Network Options)

Some parameters (tcp_sendspace, tcp_recvspace..) have been added for each network interface and are only effective for TCP (and not UDP) connections.  AIX sets default values for these, for both MTU 1500 and for jumbo frame mode (MTU 9000), which provides good performance. Values set manually for an individual interface take precedence over the systemwide values set with the no command.

Default values set by AIX:

# ifconfig -a
        inet netmask 0xffffff00 broadcast
         tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1

Default TCP settings are usually sufficient, but if the TCP send, receive and/or rfc1323 is set, they should be changed to match the above table, unless the settings on the adapter are larger.

If you are setting the tcp_recvspace value to greater than 65536, set the rfc1323 value to 1 on each side of the connection. If you do not set the rfc1323 value on both sides of the connection, the effective value for the tcp_recvspace tunable will be 65536.


No Resource Error:

For high speed traffic necessary resources should be available. An example for these resources are the buffer spaces with differet size: tiny, small, medium... If these buffer spaces are not ready during network traffic, we can get "No resource Error", and other errors as well (Packets Dropped, Hypervisor Receive Failure).

# entstat -d ent1

ETHERNET STATISTICS (ent1) :                                          
Device Type: Virtual I/O Ethernet Adapter (l-lan)                    
Hardware Address: 41:ca:14:e7:26:9b                                  
Elapsed Time: 12 days 2 hours 3 minutes 31 seconds                  
Transmit Statistics:       Receive Statistics:                        
--------------------       -------------------                        
Packets: 5912589961        Packets: 26139812411                      
Bytes: 712365989202        Bytes: 712351516630458                    
Interrupts: 0              Interrupts: 6812561727                    
Transmit Errors: 0         Receive Errors: 0                          
Packets Dropped: 0         Packets Dropped: 81212309      <--attention needed      
Max Collision Errors: 0    No Resource Errors: 16113801   <--attention needed
Hypervisor Send Failures: 0                                          
  Receiver Failures: 0                                                
  Send Errors: 0                                                      
Hypervisor Receive Failures: 16113801                      <--attention needed

For Virtual Ethernet adapers the above errors can be caused by incorrect buffer allocation:
Min Buffers: this is the number of buffers initially provided by the server (number of pre-allocated buffers)
Max Buffers: the absolute maximum (upper limit) of allocated buffers
Max Allocated: in the past what was the highest number of allocated buffers

The buffer allocation history can checked at the end of entstat output:

Receive Information                                                  
  Receive Buffers                                                    
    Buffer Type              Tiny    Small   Medium    Large     Huge
    Min Buffers               512      512      128       24       24
    Max Buffers              2048     2048      256       64       64
    Allocated                 513      535      148       28       64
    Registered                512      510      127       24       13
      Max Allocated           576      951      133       64       64   <--all of them is above Min, additionally Large and Huge reached Max
      Lowest Registered       502      502       64       12       11

Initially server provides the value of Min Buffers (pre-allocated) and if later network traffic demands more, a so called "post-allocation" will occur.
This action takes time and it can negatively affect response time (especially at high speed workloads.)  After this Max Allocated will show a new higher value.
If Max Allocated reaches the value of Max Buffers it is a hint for bottleneck in latency and throughput. (It probably means network traffic would need more buffers, but server cannot provide more as it cannot go above the value of Max Buffers.)

In short:
- it is not optimal if Max Allocated is above of Min Buffers
- it is even worse if Max Allocated reached the value of Max Buffers

Situation ca be solved by tuning the buffers on all bridging Virt. Adapter configured for SEA.

Commands to change these:
chdev -l entX -a max_buf_huge=128 -P
chdev -l entX-a min_buf_huge=64 -P
chdev -l entX -a max_buf_large=128 -P
chdev -l entX -a min_buf_large=64 -P
chdev -l entX-a max_buf_medium=512 -P
chdev -l entX-a min_buf_medium=256 -P
chdev -l entX-a max_buf_small=4096 -P
chdev -l entX-a min_buf_small=2048 -P
chdev -l entX-a max_buf_tiny=4096 -P
chdev -l entX-a min_buf_tiny=2048 -P


Some additional considerations:

-Data Cache Block Flush (dcbflush): This allows the virtual Ethernet device driver to flush the processor’s data cache of any data after it has been received. It increases CPU utilization but also increases throughput: # chdev -l entX -a dcbflush_local=yes –P (need a reboot to take effect)

-Dog thread (Thread): By enabling the dog threads feature, the driver queues the incoming packet to the thread and the thread handles calling IP, TCP, and the socket code. Enable this parameter as your LPAR grows in CPU resources. # ifconfig enX thread or # chdev -l enX -a thread=on

-Disabled Threading on SEA: Threaded mode helps ensure that virtual SCSI and the Shared Ethernet Adapter can share the processor resource appropriately. IBM documentation talks only about when using VSCSI, so if only using NPIV, consider disabling this parameter. The performance gain could be between 16-20 percent for MTU 1500 and 31-38 percent for Jumbo Frames. To disable threading, use the chdev over the SEA with -attr thread=0 option.


Anonymous said...

Dear Team, Balazs,

We use a virtual server with shared uncapped but with 2 times higher entitlement than peak usage, and with vpm_xvcpus=2 to avoid too much folding.
Buffers Maximum and Minimum values set to the possible maximum values like chdev -l ent2 -a buf_mode=max_min

Even min/max buffers are max both from VIO and Lpar side and large send used and VIO and lpar over-configured with CPU and Memory we still have Packets Dropped/No Resource Errors/Receive Failures:

Transmit Statistics: Receive Statistics:
-------------------- -------------------
Transmit Errors: 0 Receive Errors: 0
Packets Dropped: 0 Packets Dropped: 590611
Max Collision Errors: 0 No Resource Errors: 590611
Hypervisor Send Failures: 126
Receiver Failures: 126
Send Errors: 0
Hypervisor Receive Failures: 590611
Receive Information
Receive Buffers
Buffer Type Tiny Small Medium Large Huge
Min Buffers 4096 4096 512 128 128
Max Buffers 4096 4096 512 128 128
Allocated 4096 4096 512 128 128
Registered 4096 4096 512 128 128
Max Allocated 4096 4096 512 128 128

What is your suggestion?

aix said...

Looks your history Max reached the allocated max values, which could be a sign for bottleneck. I would tune more these buffers.

Anonymous said...

Hi, Okay but 4096 is the maximum limit possible to setting in tiny and small buffer. What should do if we reach those limits.

aix said...

Hi, in this case I have no idea. Would it be possible to ask IBM? (I am curious what is their recommendation in this case.)

Anonymous said...

Same here. If someone know what to do if reach the limits