Linux‎ > ‎

TCP performance tuning - 10G NIC with high RTT in linux

Thank you for visiting this page, this page has been update in another link TCP performance tuning


Just finished TCP tuning for SL6, kernel 2.6.32-358.14.1.el6.x86_64, so share the experience here.

In the past, I played with several types of 10G NIC, all on SL5, only some of them survived from my test, they fail at either at poor performance, or data corruption during multiple streams transfers. To be noted my test is multiple streams test for storage nodes, receive and deliver data in a large range of RTT(0.1 to 300ms), clients mixed with 1G and 10G NIC.
In my recent test, I used the node has 32GB memory, Mellanox 10G NIC, 12 CPUs, it was a SL5 node, just upgraded to SL6. I mounted two LUNs so it has enough I/O bandwith for the test. The first driver I tested was 2.0 which came with SL6.4, it was not succeful, it cashed the kernel in 3 minutes with the following error in kernel.
kernel: swapper: page allocation failure. order:2, mode:0x4020
kernel: Pid: 0, comm: swapper Not tainted 2.6.32-358.14.1.el6.x86_64 #1
kernel: Call Trace:
kernel: <IRQ>  [<ffffffff8112c197>] ? __alloc_pages_nodemask+0x757/0x8d0
kernel: [<ffffffff8147fa38>] ? ip_local_deliver+0x98/0xa0
kernel: [<ffffffff811609ea>] ? alloc_pages_current+0xaa/0x110
kernel: [<ffffffffa01efaa7>] ? mlx4_en_alloc_frags+0x57/0x330 [mlx4_en]
kernel: [<ffffffff8144aada>] ? napi_frags_finish+0x9a/0xb0
kernel: [<ffffffffa01f02df>] ? mlx4_en_process_rx_cq+0x55f/0x990 [mlx4_en]
kernel: [<ffffffffa01f074f>] ? mlx4_en_poll_rx_cq+0x3f/0x80
...

Then, I tried version 1.5.10, which also generates some memory allocation errors, but with some further tunings, it passed my stress tests. Performance is also very good.
Here are sysctl.conf
# sysctl -p
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
vm.min_free_kbytes = 131072
net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0
net.core.netdev_max_backlog = 250000
net.ipv4.tcp_low_latency = 1
net.ipv4.tcp_max_syn_backlog = 8192
net.core.optmem_max = 33554432
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.core.rmem_default = 33554432
net.core.wmem_default = 33554432
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
net.ipv4.tcp_mem = 6672016 6682016 7185248

To be noted is that, I increased tcp_mem to let TCP has more memory, for my data server is mainly being used for data transfer. So, if you have a server also doing something else, then probably you should lower the number for other applications. But, I noticed, under heavy stress test, default configuration could cause memory allocation error.

For SACK and timestamps, I disabled them both to save CPU resources. After kernel 2.6.25, there are lots of patches for SACK to save avoid too much CPU usage. I switched them off mainly because I did not see significent different under stress test.

I set syn_backlog, max_backlog to higher number mainly because there could be short time high rate data taking, but I did not set txqueuelen to higher(default is 1000), also I did not set Mellanox adaptive-rx to off. You could try this if you server traffic patten changes all the time. For example, sometime quiet, then sometime very busy.


Here are some references I think they are all useful and with very good explainations.

http://www.acc.umu.se/~maswan/linux-netperf.txt
http://fasterdata.es.net/host-tuning/linux/
http://en.wikipedia.org/wiki/TCP_window_scale_option
http://www.psc.edu/index.php/networking/641-tcp-tune

http://man7.org/linux/man-pages/man7/tcp.7.html
https://www.frozentux.net/ipsysctl-tutorial/ipsysctl-tutorial.html#TCPVARIABLES
http://en.wikipedia.org/wiki/Transmission_Control_Protocol

http://www.linuxvox.com/2009/11/what-is-the-linux-kernel-parameter-tcp_low_latency
http://www.ibm.com/developerworks/library/l-tcp-sack/

 More references for other platforms

[AIX] For more information, see section 4.6 in the http://www.redbooks.ibm.com/redbooks/SG247347/wwhelp/wwhimpl/js/html/wwhelp.htm document. In addition, see the http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/tcp_streaming_workload_tuning.htm document.

[HP-UX] For more information, see the ndd command information in the following documents:
  • http://docs.hp.com/en/B2355-91020/B2355-91020.pdf
  • http://docs.hp.com/en/TKP-90203/index.html

[HP-UX] Also, see the _recv_hiwater_def and tcp_xmit_hiwater_def parameter information in the following document: http://docs.hp.com/en/11890/perf-whitepaper-tcpip-v1_1.pdf

[Linux] For more information, see the following documents:
  • http://www.ibm.com/developerworks/linux/library/l-hisock.html
  • http://fasterdata.es.net/TCP-tuning/linux.html
  • http://www.onlamp.com/pub/a/onlamp/2005/11/17/tcp_tuning.html?page=2

[Solaris] For more information, see section 2.2 in the following document: http://www.redbooks.ibm.com/redbooks/SG247584/wwhelp/wwhimpl/java/html/wwhelp.htm

[Windows] For information on tuning TCP/IP buffer sizes, see the "TCP window size" section of the http://support.microsoft.com/kb/224829 document. Consider setting the TcpWindowSize value to either 8388608 or 16777216.




Comments