Network Tuning and Testing

On September 18, 2007 Dimitri Katramatos, Kunal Shroff and Shawn McKee tried to test and tune the following machines at BNL and Michigan:

  • BNL
    • dct00.usatlas.bnl.gov
  • UMich
    • umfs02.grid.umich.edu
    • dq2.aglt2.org
    • umfs05.aglt2.org

We used the following tools to help test/debug things:
  • iperf (v2.0.2) --- Network testing tool which is client/server based. Can test TCP or UDP
  • ethtool (v3) --- Network Interface Card (NIC) tool which allows standardized probing and setting of NIC parameters
  • ifconfig --- Used to configure network interfaces in Linux
  • tracepath --- Used to determine the network path and MTU size allowed between two hosts
  • sysctl --- Used to set kernel parameters in Linux
  • modinfo --- Used to get module information in Linux

Initial Information Gathering

We started out our testing by gathering some information about the end system configurations and the network path. At the Michigan end we started with umfs05.agtl2.org. The tracepath to dct00.usatlas.bnl.gov shows:
[umfs05:~]# tracepath dct00.usatlas.bnl.gov
 1:  umfs05.aglt2.org (192.41.230.25)                       0.065ms pmtu 9000
 1:  vl4001-nile.aglt2.or.230.41.192.in-addr.arpa (192.41.230.2)   0.958ms 
 2:  r04chi-te-1-4-ptp-umich.ultralight.org (192.84.86.229)   6.469ms 
 3:  chi-ultralight.es.net (198.125.140.205)                6.648ms 
 4:  chislsdn1-chislmr1.es.net (134.55.219.25)            asymm  5   6.208ms 
 5:  chiccr1-chislsdn1.es.net (134.55.207.33)             asymm  6   6.370ms 
 6:  aofacr1-chicsdn1.es.net (134.55.218.94)              asymm  7  33.318ms 
 7:  bnlmr1-aoacr1.es.net (134.55.217.57)                  35.958ms 
 8:  bnlsite-bnlmr1.es.net (198.124.216.178)               35.700ms 
 9:  bnlsite-bnlmr1.es.net (198.124.216.178)              asymm  8  35.736ms pmtu 1500
10:  dct00.usatlas.bnl.gov (192.12.15.8)                  asymm  9  34.851ms reached
     Resume: pmtu 1500 hops 10 back 9

The NIC used on umfs05 was eth2 (A Myricom 10GE copper (CX4) NIC). We can get the OS level info about this network device with ifconfig:
umfs05:~]# ifconfig eth2

eth2    Link encap:Ethernet  HWaddr 00:60:DD:47:7D:71  
          inet addr:192.41.230.25  Bcast:192.41.230.255  Mask:255.255.255.0
          inet6 addr: fe80::260:ddff:fe47:7d71/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:417188 errors:0 dropped:0 overruns:0 frame:0
          TX packets:338501 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:55153448 (52.5 MiB)  TX bytes:36248433 (34.5 MiB)
          Interrupt:246 

Notice there are no errors, dropped, overuns, frame or carrier failures shown. It is important to watch for non-zero values in these counter because they can indicate hardware/software problems which will impact your network performance.

To get more details we can use the ethtool utility (available on most linux systems or installable via RPM (YUM or Up2date)). The ethtool utility has quite a few options. Run it with no arguments to get a list. The primary ones we are interested in are the -i, -k, -g and -S options (information, offload parameters, ring buffer settings and statistics). For umfs05 eth2 we get:

umfs05:~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on

[umfs05:~]# ethtool -i eth2
driver: myri10ge
version: 1.3.0
firmware-version: 1.4.14 -- 2007/03/20 22:07:22 m
bus-info: 0000:10:00.0

[umfs05:~]# ethtool -g eth2
Ring parameters for eth2:
Pre-set maximums:
RX:             512
RX Mini:        512
RX Jumbo:       0
TX:             512
Current hardware settings:
RX:             512
RX Mini:        512
RX Jumbo:       0
TX:             512

This lets us know the offload settings (all enabled), the driver version and firmware version of the NIC (up to date), and the NIC buffer settings (at maximum). There are no needed changes for these settings.

Before we begin testing it is important to get the current NIC statistics from ethtool and save them to a file:
ethtool -S eth2 >initial_eth2_stats.log

We can then compare them after our tests to see if any specific error counters were incremented.

The same type of information was gathered for dct00.usatlas.bnl.gov and is shown below. For tracepath:

[root@dct00 .ssh]# tracepath umfs05.aglt2.org
 1:  dct00.usatlas.bnl.gov (192.12.15.8)                    0.161ms pmtu 1500
 1:  hsrp.usatlas.bnl.gov (192.12.15.24)                    0.428ms 
 2:  bnlmr1-bnlsite.es.net (198.124.216.177)                0.665ms 
 3:  aoacr1-bnlmr1.es.net (134.55.217.58)                 asymm  4   2.292ms 
 4:  chiccr1-aofacr1.es.net (134.55.218.93)               asymm  5  29.164ms 
 5:  chislsdn1-chicr1.es.net (134.55.207.34)              asymm  6  29.404ms 
 6:  chislmr1-chislsdn1.es.net (134.55.219.26)             29.536ms 
 7:  198.125.140.206 (198.125.140.206)                     29.663ms 
 8:  192.84.86.230 (192.84.86.230)                         35.169ms 
 9:  umfs05.aglt2.org (192.41.230.25)                      34.915ms !H
     Resume: pmtu 1500
Note the 1500 MTU limitation on this path. Also the last line show !!H for umfs05.aglt2.org. This indicates a firewall is interferring with the icmp packets.

The ifconfig info for eth1 on dct00 is:
[root@dct00 .ssh]# ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 00:04:23:E1:8E:A2  
          inet addr:192.12.15.8  Bcast:192.12.15.255  Mask:255.255.255.0
          inet6 addr: fe80::204:23ff:fee1:8ea2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:265555257 errors:0 dropped:178067 overruns:0 frame:0
          TX packets:410537836 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:10000 
          RX bytes:4060269971 (3.7 GiB)  TX bytes:2201437530 (2.0 GiB)
          Base address:0xdcc0 Memory:dfbe0000-dfc00000 

Here we note there are a large number of dropped received (RX) packets. Also the txqueuelen has been increased to 10000 from a default of 1000.

The ethtool info from eth1:
[root@dct00 .ssh]# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on

[root@dct00 .ssh]# ethtool -i eth1
driver: e1000
version: 7.0.33-k2-NAPI
firmware-version: N/A
bus-info: 0000:03:07.0

[root@dct00 ~]# ethtool -G eth1
no ring parameters changed, aborting
[root@dct00 ~]# ethtool -g eth1
Ring parameters for eth1:
Pre-set maximums:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096
Current hardware settings:
RX:             256
RX Mini:        0
RX Jumbo:       0
TX:             256

Two comments here: the version of the e1000 driver is a few minor versions back from the current driver (7.0.33 vs 7.3.15) and the ring buffer settings are relatively small versus the maximum allowable (256 vs 4096).

Initial iperf Tests

Our initial iperf tests used umfs05 as the server and dct00 as the client. On the umfs05 side:
[umfs05:~]# iperf -s -w4M -i5
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:   256 KByte (WARNING: requested 4.00 MByte)
------------------------------------------------------------
[  4] local 192.41.230.25 port 5001 connected with 192.12.15.8 port 39932
[  4]  0.0- 5.0 sec  23.3 MBytes  39.1 Mbits/sec
[  4]  5.0-10.0 sec  25.2 MBytes  42.2 Mbits/sec
[  4] 10.0-15.0 sec  25.1 MBytes  42.1 Mbits/sec
[  4] 15.0-20.0 sec  25.1 MBytes  42.2 Mbits/sec
[  4] 20.0-25.0 sec  25.1 MBytes  42.0 Mbits/sec
[  4] 25.0-30.0 sec  25.1 MBytes  42.1 Mbits/sec
[  4] 30.0-35.0 sec  25.2 MBytes  42.3 Mbits/sec
[  4]  0.0-38.5 sec    192 MBytes  41.8 Mbits/sec
Note the very poor performance (this is a 10GE NIC, full 10GE path and remote host has 1GE NIC). The clue is in the TCP window size:  256 KByte. This window size will limit the achievable bandwidth. We need to explore the stack settings on umfs05.

We can use the sysctl command to see/set kernel parameters. The file /etc/sysctl.conf can persist settings across reboots. For umfs05 it was seemingly set OK:
[umfs05:~]# sysctl -p
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
kernel.pid_max = 4194303
net.ipv4.tcp_rmem = 4096 87380 20000000
net.ipv4.tcp_wmem = 4096 87380 20000000

The -p option tells sysctl to (re)apply the settings in /etc/sysctl.conf. The maximum allowed buffer sizes are up to 20MBytes so why are we limited to 256KBytes? It turns out we were limited by two other parameters which needed to be increased. I added the following to /etc/sysctl.conf and reran sysctl -p:

# maximum receive socket buffer size, default 131071
net.core.rmem_max = 20000000
# maximum send socket buffer size, default 131071
net.core.wmem_max = 20000000

Rerunning iperf on umfs05 then gives a much better result:
[umfs05:~]# iperf -s -w4M -i5
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 8.00 MByte (WARNING: requested 4.00 MByte)
------------------------------------------------------------
[  4] local 192.41.230.25 port 5001 connected with 192.12.15.8 port 39933
[  4]  0.0- 5.0 sec    143 MBytes    239 Mbits/sec
[  4]  5.0-10.0 sec    269 MBytes    451 Mbits/sec
[  4] 10.0-15.0 sec    458 MBytes    769 Mbits/sec
[  4] 15.0-20.0 sec    561 MBytes    942 Mbits/sec
[  4] 20.0-25.0 sec    561 MBytes    942 Mbits/sec
[  4] 25.0-30.0 sec    561 MBytes    942 Mbits/sec
[  4] 30.0-35.0 sec    561 MBytes    942 Mbits/sec
[  4] 35.0-40.0 sec    561 MBytes    942 Mbits/sec
[  4] 40.0-45.0 sec    561 MBytes    942 Mbits/sec
[  4] 45.0-50.0 sec    561 MBytes    942 Mbits/sec
[  4] 50.0-55.0 sec    561 MBytes    942 Mbits/sec
[  4] 55.0-60.0 sec    561 MBytes    942 Mbits/sec
[  4]  0.0-60.0 sec  5.78 GBytes    828 Mbits/sec
Much better result! We are now fully using the 1GE link. Note there is a bit shift (multiply by 2) in the requested window size for iperf so you get double the buffer you ask for. This is actually just about right: doubling the calculated TCP buffer size based upon the round-trip time (RTT) seems to give the best performance.

We can now test using umfs05 as a client in iperf:
[umfs05:~]# iperf -c dct00.usatlas.bnl.gov -w4M -i2 -t60
------------------------------------------------------------
Client connecting to dct00.usatlas.bnl.gov, TCP port 5001
TCP window size: 8.00 MByte (WARNING: requested 4.00 MByte)
------------------------------------------------------------
[  3] local 192.41.230.25 port 59404 connected with 192.12.15.8 port 5001
[  3]  0.0- 2.0 sec  15.1 MBytes  63.3 Mbits/sec
[  3]  2.0- 4.0 sec  19.2 MBytes  80.5 Mbits/sec
[  3]  4.0- 6.0 sec  27.3 MBytes    115 Mbits/sec
[  3]  6.0- 8.0 sec  46.8 MBytes    196 Mbits/sec
[  3]  8.0-10.0 sec  66.7 MBytes    280 Mbits/sec
[  3] 10.0-12.0 sec  95.6 MBytes    401 Mbits/sec
[  3] 12.0-14.0 sec    135 MBytes    565 Mbits/sec
[  3] 14.0-16.0 sec  98.0 MBytes    411 Mbits/sec
[  3] 16.0-18.0 sec    131 MBytes    549 Mbits/sec
[  3] 18.0-20.0 sec    154 MBytes    647 Mbits/sec
[  3] 20.0-22.0 sec    161 MBytes    675 Mbits/sec
[  3] 22.0-24.0 sec    168 MBytes    705 Mbits/sec
[  3] 24.0-26.0 sec    176 MBytes    737 Mbits/sec
[  3] 26.0-28.0 sec    179 MBytes    750 Mbits/sec
[  3] 28.0-30.0 sec    184 MBytes    773 Mbits/sec
[  3] 30.0-32.0 sec    187 MBytes    784 Mbits/sec
[  3] 32.0-34.0 sec    191 MBytes    800 Mbits/sec
[  3] 34.0-36.0 sec    199 MBytes    833 Mbits/sec
[  3] 36.0-38.0 sec    202 MBytes    849 Mbits/sec
[  3] 38.0-40.0 sec    206 MBytes    863 Mbits/sec
[  3] 40.0-42.0 sec    214 MBytes    899 Mbits/sec
[  3] 42.0-44.0 sec    220 MBytes    923 Mbits/sec
[  3] 44.0-46.0 sec    223 MBytes    935 Mbits/sec
[  3] 46.0-48.0 sec    223 MBytes    935 Mbits/sec
[  3] 48.0-50.0 sec    226 MBytes    947 Mbits/sec
[  3] 50.0-52.0 sec    223 MBytes    935 Mbits/sec
[  3] 52.0-54.0 sec    130 MBytes    544 Mbits/sec
[  3] 54.0-56.0 sec    113 MBytes    472 Mbits/sec
[  3] 56.0-58.0 sec    159 MBytes    668 Mbits/sec
[  3] 58.0-60.0 sec    168 MBytes    705 Mbits/sec
[  3]  0.0-60.0 sec  4.43 GBytes    634 Mbits/sec

This is OK but shows some fluctuations. It is possible other "real" traffic was interferring with this test. In any case you can see that single stream testing is able to fully utilize a 1GE path for these end-hosts after some tuning.

UDP Testing with iperf

We also did a quick test using udp instead of tcp with iperf. There was a separate iperf instance running udp on dct00 on port 5002 (instead of the default 5001) . We used umfs05 as a client to send 980 Mbits/sec udp to dct00:
[umfs05:~]# iperf -c dct00.usatlas.bnl.gov -w4M -i2 -t60 -b950M -u -p5002
WARNING: option -b implies udp testing
------------------------------------------------------------
Client connecting to dct00.usatlas.bnl.gov, UDP port 5002
Sending 1470 byte datagrams
UDP buffer size: 8.00 MByte (WARNING: requested 4.00 MByte)
------------------------------------------------------------
[  3] local 192.41.230.25 port 32887 connected with 192.12.15.8 port 5002
[  3]  0.0- 2.0 sec    234 MBytes    980 Mbits/sec
[  3]  2.0- 4.0 sec    234 MBytes    980 Mbits/sec
[  3]  4.0- 6.0 sec    234 MBytes    980 Mbits/sec
[  3]  6.0- 8.0 sec    234 MBytes    980 Mbits/sec
[  3]  8.0-10.0 sec    234 MBytes    980 Mbits/sec
[  3] 10.0-12.0 sec    234 MBytes    980 Mbits/sec
[  3] 12.0-14.0 sec    234 MBytes    980 Mbits/sec
[  3] 14.0-16.0 sec    234 MBytes    980 Mbits/sec
[  3] 16.0-18.0 sec    234 MBytes    980 Mbits/sec
[  3] 18.0-20.0 sec    234 MBytes    980 Mbits/sec
[  3] 20.0-22.0 sec    234 MBytes    980 Mbits/sec
[  3] 22.0-24.0 sec    234 MBytes    980 Mbits/sec
[  3] 24.0-26.0 sec    234 MBytes    980 Mbits/sec
[  3] 26.0-28.0 sec    234 MBytes    980 Mbits/sec
[  3] 28.0-30.0 sec    234 MBytes    980 Mbits/sec
[  3] 30.0-32.0 sec    234 MBytes    980 Mbits/sec
[  3] 32.0-34.0 sec    234 MBytes    980 Mbits/sec
[  3] 34.0-36.0 sec    234 MBytes    980 Mbits/sec
[  3] 36.0-38.0 sec    234 MBytes    980 Mbits/sec
[  3] 38.0-40.0 sec    234 MBytes    980 Mbits/sec
[  3] 40.0-42.0 sec    234 MBytes    980 Mbits/sec
[  3] 42.0-44.0 sec    234 MBytes    980 Mbits/sec
[  3] 44.0-46.0 sec    234 MBytes    980 Mbits/sec
[  3] 46.0-48.0 sec    234 MBytes    980 Mbits/sec
[  3] 48.0-50.0 sec    234 MBytes    980 Mbits/sec
[  3] 50.0-52.0 sec    234 MBytes    980 Mbits/sec
[  3] 52.0-54.0 sec    234 MBytes    980 Mbits/sec
[  3] 54.0-56.0 sec    234 MBytes    980 Mbits/sec
[  3] 56.0-58.0 sec    234 MBytes    980 Mbits/sec
[  3] 58.0-60.0 sec    234 MBytes    980 Mbits/sec
[  3]  0.0-60.0 sec  6.84 GBytes    980 Mbits/sec
[  3] Sent 4999799 datagrams
[  3] Server Report:
[  3]  0.0-60.0 sec  6.62 GBytes    948 Mbits/sec  0.015 ms 164192/4999798 (3.3%)
[  3]  0.0-60.0 sec  1 datagrams received out-of-order

Note that we achieved 948 Mbits/sec (the difference between the sent 980 Mbits/sec and the received 948 Mbits/sec is shown as the 3.3% loss on the second to the last line of the test above). This shows the path is able to support that level of traffic.

-- ShawnMcKee - 19 Sep 2007
Topic revision: r9 - 16 Oct 2009, TomRockwell
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback