Network Tuning and Testing
On September 18, 2007 Dimitri Katramatos, Kunal Shroff and Shawn McKee tried to test and tune the following machines at BNL and Michigan:
- BNL
- UMich
- umfs02.grid.umich.edu
- dq2.aglt2.org
- umfs05.aglt2.org
We used the following tools to help test/debug things:
- iperf (v2.0.2) --- Network testing tool which is client/server based. Can test TCP or UDP
- ethtool (v3) --- Network Interface Card (NIC) tool which allows standardized probing and setting of NIC parameters
- ifconfig --- Used to configure network interfaces in Linux
- tracepath --- Used to determine the network path and MTU size allowed between two hosts
- sysctl --- Used to set kernel parameters in Linux
- modinfo --- Used to get module information in Linux
We started out our testing by gathering some information about the end system configurations and the network path. At the Michigan end we started with umfs05.agtl2.org. The tracepath to dct00.usatlas.bnl.gov shows:
[umfs05:~]# tracepath dct00.usatlas.bnl.gov
1: umfs05.aglt2.org (192.41.230.25) 0.065ms pmtu 9000
1: vl4001-nile.aglt2.or.230.41.192.in-addr.arpa (192.41.230.2) 0.958ms
2: r04chi-te-1-4-ptp-umich.ultralight.org (192.84.86.229) 6.469ms
3: chi-ultralight.es.net (198.125.140.205) 6.648ms
4: chislsdn1-chislmr1.es.net (134.55.219.25) asymm 5 6.208ms
5: chiccr1-chislsdn1.es.net (134.55.207.33) asymm 6 6.370ms
6: aofacr1-chicsdn1.es.net (134.55.218.94) asymm 7 33.318ms
7: bnlmr1-aoacr1.es.net (134.55.217.57) 35.958ms
8: bnlsite-bnlmr1.es.net (198.124.216.178) 35.700ms
9: bnlsite-bnlmr1.es.net (198.124.216.178) asymm 8 35.736ms pmtu 1500
10: dct00.usatlas.bnl.gov (192.12.15.8) asymm 9 34.851ms reached
Resume: pmtu 1500 hops 10 back 9
The NIC used on umfs05 was eth2 (A Myricom 10GE copper (CX4) NIC). We can get the OS level info about this network device with ifconfig:
umfs05:~]# ifconfig eth2
eth2 Link encap:Ethernet HWaddr 00:60:DD:47:7D:71
inet addr:192.41.230.25 Bcast:192.41.230.255 Mask:255.255.255.0
inet6 addr: fe80::260:ddff:fe47:7d71/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:417188 errors:0 dropped:0 overruns:0 frame:0
TX packets:338501 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:55153448 (52.5 MiB) TX bytes:36248433 (34.5 MiB)
Interrupt:246
Notice there are no errors, dropped, overuns, frame or carrier failures shown. It is important to watch for non-zero values in these counter because they can indicate hardware/software problems which will impact your network performance.
To get more details we can use the
ethtool
utility (available on most linux systems or installable via RPM (YUM or Up2date)). The
ethtool
utility has quite a few options. Run it with no arguments to get a list. The primary ones we are interested in are the
-i
,
-k
,
-g
and
-S
options (information, offload parameters, ring buffer settings and statistics). For umfs05 eth2 we get:
umfs05:~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
[umfs05:~]# ethtool -i eth2
driver: myri10ge
version: 1.3.0
firmware-version: 1.4.14 -- 2007/03/20 22:07:22 m
bus-info: 0000:10:00.0
[umfs05:~]# ethtool -g eth2
Ring parameters for eth2:
Pre-set maximums:
RX: 512
RX Mini: 512
RX Jumbo: 0
TX: 512
Current hardware settings:
RX: 512
RX Mini: 512
RX Jumbo: 0
TX: 512
This lets us know the offload settings (all enabled), the driver version and firmware version of the NIC (up to date), and the NIC buffer settings (at maximum). There are no needed changes for these settings.
Before we begin testing it is important to get the current NIC statistics from
ethtool
and save them to a file:
ethtool -S eth2 >initial_eth2_stats.log
We can then compare them after our tests to see if any specific error counters were incremented.
The same type of information was gathered for
dct00.usatlas.bnl.gov
and is shown below. For tracepath:
[root@dct00 .ssh]# tracepath umfs05.aglt2.org
1: dct00.usatlas.bnl.gov (192.12.15.8) 0.161ms pmtu 1500
1: hsrp.usatlas.bnl.gov (192.12.15.24) 0.428ms
2: bnlmr1-bnlsite.es.net (198.124.216.177) 0.665ms
3: aoacr1-bnlmr1.es.net (134.55.217.58) asymm 4 2.292ms
4: chiccr1-aofacr1.es.net (134.55.218.93) asymm 5 29.164ms
5: chislsdn1-chicr1.es.net (134.55.207.34) asymm 6 29.404ms
6: chislmr1-chislsdn1.es.net (134.55.219.26) 29.536ms
7: 198.125.140.206 (198.125.140.206) 29.663ms
8: 192.84.86.230 (192.84.86.230) 35.169ms
9: umfs05.aglt2.org (192.41.230.25) 34.915ms !H
Resume: pmtu 1500
Note the 1500 MTU limitation on this path. Also the last line show !!H for umfs05.aglt2.org. This indicates a firewall is interferring with the icmp packets.
The
ifconfig
info for eth1 on dct00 is:
[root@dct00 .ssh]# ifconfig eth1
eth1 Link encap:Ethernet HWaddr 00:04:23:E1:8E:A2
inet addr:192.12.15.8 Bcast:192.12.15.255 Mask:255.255.255.0
inet6 addr: fe80::204:23ff:fee1:8ea2/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:265555257 errors:0 dropped:178067 overruns:0 frame:0
TX packets:410537836 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:10000
RX bytes:4060269971 (3.7 GiB) TX bytes:2201437530 (2.0 GiB)
Base address:0xdcc0 Memory:dfbe0000-dfc00000
Here we note there are a large number of
dropped
received (RX) packets. Also the txqueuelen has been increased to 10000 from a default of 1000.
The
ethtool
info from eth1:
[root@dct00 .ssh]# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
[root@dct00 .ssh]# ethtool -i eth1
driver: e1000
version: 7.0.33-k2-NAPI
firmware-version: N/A
bus-info: 0000:03:07.0
[root@dct00 ~]# ethtool -G eth1
no ring parameters changed, aborting
[root@dct00 ~]# ethtool -g eth1
Ring parameters for eth1:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 256
RX Mini: 0
RX Jumbo: 0
TX: 256
Two comments here: the version of the e1000 driver is a few minor versions back from the current driver (7.0.33 vs 7.3.15) and the ring buffer settings are relatively small versus the maximum allowable (256 vs 4096).
Initial iperf Tests
Our initial iperf tests used
umfs05
as the server and
dct00
as the client. On the umfs05 side:
[umfs05:~]# iperf -s -w4M -i5
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 256 KByte (WARNING: requested 4.00 MByte)
------------------------------------------------------------
[ 4] local 192.41.230.25 port 5001 connected with 192.12.15.8 port 39932
[ 4] 0.0- 5.0 sec 23.3 MBytes 39.1 Mbits/sec
[ 4] 5.0-10.0 sec 25.2 MBytes 42.2 Mbits/sec
[ 4] 10.0-15.0 sec 25.1 MBytes 42.1 Mbits/sec
[ 4] 15.0-20.0 sec 25.1 MBytes 42.2 Mbits/sec
[ 4] 20.0-25.0 sec 25.1 MBytes 42.0 Mbits/sec
[ 4] 25.0-30.0 sec 25.1 MBytes 42.1 Mbits/sec
[ 4] 30.0-35.0 sec 25.2 MBytes 42.3 Mbits/sec
[ 4] 0.0-38.5 sec 192 MBytes 41.8 Mbits/sec
Note the very poor performance (this is a 10GE NIC, full 10GE path and remote host has 1GE NIC). The clue is in the
TCP window size: 256 KByte
. This window size will limit the achievable bandwidth. We need to explore the stack settings on
umfs05
.
We can use the
sysctl
command to see/set kernel parameters. The file
/etc/sysctl.conf
can
persist settings across reboots. For umfs05 it was seemingly set OK:
[umfs05:~]# sysctl -p
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
kernel.pid_max = 4194303
net.ipv4.tcp_rmem = 4096 87380 20000000
net.ipv4.tcp_wmem = 4096 87380 20000000
The
-p
option tells
sysctl
to (re)apply the settings in /etc/sysctl.conf. The maximum allowed buffer sizes are up to 20MBytes so why are we limited to 256KBytes? It turns out we were limited by two other parameters which needed to be increased. I added the following to
/etc/sysctl.conf
and reran
sysctl -p
:
# maximum receive socket buffer size, default 131071
net.core.rmem_max = 20000000
# maximum send socket buffer size, default 131071
net.core.wmem_max = 20000000
Rerunning
iperf
on
umfs05
then gives a much better result:
[umfs05:~]# iperf -s -w4M -i5
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 8.00 MByte (WARNING: requested 4.00 MByte)
------------------------------------------------------------
[ 4] local 192.41.230.25 port 5001 connected with 192.12.15.8 port 39933
[ 4] 0.0- 5.0 sec 143 MBytes 239 Mbits/sec
[ 4] 5.0-10.0 sec 269 MBytes 451 Mbits/sec
[ 4] 10.0-15.0 sec 458 MBytes 769 Mbits/sec
[ 4] 15.0-20.0 sec 561 MBytes 942 Mbits/sec
[ 4] 20.0-25.0 sec 561 MBytes 942 Mbits/sec
[ 4] 25.0-30.0 sec 561 MBytes 942 Mbits/sec
[ 4] 30.0-35.0 sec 561 MBytes 942 Mbits/sec
[ 4] 35.0-40.0 sec 561 MBytes 942 Mbits/sec
[ 4] 40.0-45.0 sec 561 MBytes 942 Mbits/sec
[ 4] 45.0-50.0 sec 561 MBytes 942 Mbits/sec
[ 4] 50.0-55.0 sec 561 MBytes 942 Mbits/sec
[ 4] 55.0-60.0 sec 561 MBytes 942 Mbits/sec
[ 4] 0.0-60.0 sec 5.78 GBytes 828 Mbits/sec
Much better result! We are now fully using the 1GE link. Note there is a bit shift (multiply by 2) in the requested window size for iperf so you get double the buffer you ask for. This is actually just about right: doubling the calculated TCP buffer size based upon the round-trip time (RTT) seems to give the best performance.
We can now test using
umfs05
as a client in
iperf:
[umfs05:~]# iperf -c dct00.usatlas.bnl.gov -w4M -i2 -t60
------------------------------------------------------------
Client connecting to dct00.usatlas.bnl.gov, TCP port 5001
TCP window size: 8.00 MByte (WARNING: requested 4.00 MByte)
------------------------------------------------------------
[ 3] local 192.41.230.25 port 59404 connected with 192.12.15.8 port 5001
[ 3] 0.0- 2.0 sec 15.1 MBytes 63.3 Mbits/sec
[ 3] 2.0- 4.0 sec 19.2 MBytes 80.5 Mbits/sec
[ 3] 4.0- 6.0 sec 27.3 MBytes 115 Mbits/sec
[ 3] 6.0- 8.0 sec 46.8 MBytes 196 Mbits/sec
[ 3] 8.0-10.0 sec 66.7 MBytes 280 Mbits/sec
[ 3] 10.0-12.0 sec 95.6 MBytes 401 Mbits/sec
[ 3] 12.0-14.0 sec 135 MBytes 565 Mbits/sec
[ 3] 14.0-16.0 sec 98.0 MBytes 411 Mbits/sec
[ 3] 16.0-18.0 sec 131 MBytes 549 Mbits/sec
[ 3] 18.0-20.0 sec 154 MBytes 647 Mbits/sec
[ 3] 20.0-22.0 sec 161 MBytes 675 Mbits/sec
[ 3] 22.0-24.0 sec 168 MBytes 705 Mbits/sec
[ 3] 24.0-26.0 sec 176 MBytes 737 Mbits/sec
[ 3] 26.0-28.0 sec 179 MBytes 750 Mbits/sec
[ 3] 28.0-30.0 sec 184 MBytes 773 Mbits/sec
[ 3] 30.0-32.0 sec 187 MBytes 784 Mbits/sec
[ 3] 32.0-34.0 sec 191 MBytes 800 Mbits/sec
[ 3] 34.0-36.0 sec 199 MBytes 833 Mbits/sec
[ 3] 36.0-38.0 sec 202 MBytes 849 Mbits/sec
[ 3] 38.0-40.0 sec 206 MBytes 863 Mbits/sec
[ 3] 40.0-42.0 sec 214 MBytes 899 Mbits/sec
[ 3] 42.0-44.0 sec 220 MBytes 923 Mbits/sec
[ 3] 44.0-46.0 sec 223 MBytes 935 Mbits/sec
[ 3] 46.0-48.0 sec 223 MBytes 935 Mbits/sec
[ 3] 48.0-50.0 sec 226 MBytes 947 Mbits/sec
[ 3] 50.0-52.0 sec 223 MBytes 935 Mbits/sec
[ 3] 52.0-54.0 sec 130 MBytes 544 Mbits/sec
[ 3] 54.0-56.0 sec 113 MBytes 472 Mbits/sec
[ 3] 56.0-58.0 sec 159 MBytes 668 Mbits/sec
[ 3] 58.0-60.0 sec 168 MBytes 705 Mbits/sec
[ 3] 0.0-60.0 sec 4.43 GBytes 634 Mbits/sec
This is OK but shows some fluctuations. It is possible other "real" traffic was interferring with this test. In any case you can see that single stream testing is able to fully utilize a 1GE path for these end-hosts after some tuning.
UDP Testing with iperf
We also did a quick test using
udp
instead of
tcp
with
iperf. There was a separate
iperf instance running udp on
dct00
on port 5002 (instead of the default 5001) . We used
umfs05
as a client to send 980 Mbits/sec udp to
dct00
:
[umfs05:~]# iperf -c dct00.usatlas.bnl.gov -w4M -i2 -t60 -b950M -u -p5002
WARNING: option -b implies udp testing
------------------------------------------------------------
Client connecting to dct00.usatlas.bnl.gov, UDP port 5002
Sending 1470 byte datagrams
UDP buffer size: 8.00 MByte (WARNING: requested 4.00 MByte)
------------------------------------------------------------
[ 3] local 192.41.230.25 port 32887 connected with 192.12.15.8 port 5002
[ 3] 0.0- 2.0 sec 234 MBytes 980 Mbits/sec
[ 3] 2.0- 4.0 sec 234 MBytes 980 Mbits/sec
[ 3] 4.0- 6.0 sec 234 MBytes 980 Mbits/sec
[ 3] 6.0- 8.0 sec 234 MBytes 980 Mbits/sec
[ 3] 8.0-10.0 sec 234 MBytes 980 Mbits/sec
[ 3] 10.0-12.0 sec 234 MBytes 980 Mbits/sec
[ 3] 12.0-14.0 sec 234 MBytes 980 Mbits/sec
[ 3] 14.0-16.0 sec 234 MBytes 980 Mbits/sec
[ 3] 16.0-18.0 sec 234 MBytes 980 Mbits/sec
[ 3] 18.0-20.0 sec 234 MBytes 980 Mbits/sec
[ 3] 20.0-22.0 sec 234 MBytes 980 Mbits/sec
[ 3] 22.0-24.0 sec 234 MBytes 980 Mbits/sec
[ 3] 24.0-26.0 sec 234 MBytes 980 Mbits/sec
[ 3] 26.0-28.0 sec 234 MBytes 980 Mbits/sec
[ 3] 28.0-30.0 sec 234 MBytes 980 Mbits/sec
[ 3] 30.0-32.0 sec 234 MBytes 980 Mbits/sec
[ 3] 32.0-34.0 sec 234 MBytes 980 Mbits/sec
[ 3] 34.0-36.0 sec 234 MBytes 980 Mbits/sec
[ 3] 36.0-38.0 sec 234 MBytes 980 Mbits/sec
[ 3] 38.0-40.0 sec 234 MBytes 980 Mbits/sec
[ 3] 40.0-42.0 sec 234 MBytes 980 Mbits/sec
[ 3] 42.0-44.0 sec 234 MBytes 980 Mbits/sec
[ 3] 44.0-46.0 sec 234 MBytes 980 Mbits/sec
[ 3] 46.0-48.0 sec 234 MBytes 980 Mbits/sec
[ 3] 48.0-50.0 sec 234 MBytes 980 Mbits/sec
[ 3] 50.0-52.0 sec 234 MBytes 980 Mbits/sec
[ 3] 52.0-54.0 sec 234 MBytes 980 Mbits/sec
[ 3] 54.0-56.0 sec 234 MBytes 980 Mbits/sec
[ 3] 56.0-58.0 sec 234 MBytes 980 Mbits/sec
[ 3] 58.0-60.0 sec 234 MBytes 980 Mbits/sec
[ 3] 0.0-60.0 sec 6.84 GBytes 980 Mbits/sec
[ 3] Sent 4999799 datagrams
[ 3] Server Report:
[ 3] 0.0-60.0 sec 6.62 GBytes 948 Mbits/sec 0.015 ms 164192/4999798 (3.3%)
[ 3] 0.0-60.0 sec 1 datagrams received out-of-order
Note that we achieved 948 Mbits/sec (the difference between the sent 980 Mbits/sec and the received 948 Mbits/sec is shown as the
3.3%
loss on the second to the last line of the test above). This shows the path is able to support that level of traffic.
--
ShawnMcKee - 19 Sep 2007