TCP Tuning Tutorial
TCP Tuning Tutorial
Brian L. Tierney
http://gridmon.dl.ac.uk/nfnn/
Wizard Gap
Slide from
Matt Mathis, PSC
Brian L. Tierney Slide: 2
1
Today’s Talk
Outline
TCP Overview
TCP Tuning Techniques (focus on Linux)
TCP Issues
Network Monitoring Tools
Current TCP Research
time
Brian L. Tierney Slide: 4
2
TCP Overview
Congestion avoidance
additive increase: starting from the rough estimate, linearly
increase the congestion window size to probe for additional
available bandwidth
multiplicative decrease: cut congestion window size aggressively if
a timeout occurs
packet loss
timeout
CWND
slow start:
exponential congestion retransmit:
increase avoidance: slow start
linear again
increase
time
TCP Overview
Fast Retransmit: retransmit after 3 duplicate acks (got 3 additional
packets without getting the one you are waiting for)
this prevents expensive timeouts
no need to go into “slow start” again
At steady state, CWND oscillates around the optimal window size
With a retransmission timeout, slow start is triggered again
packet loss
timeout
CWND
slow start:
exponential congestion retransmit:
increase avoidance: slow start
linear again
increase
time
3
Terminology
source sink
Narrow Link
Tight Link
Brian L. Tierney Slide: 7
More Terminology
4
TCP Performance Tuning Issues
5
Importance of TCP Tuning
Throughput (Mbits/sec)
Tuned for Tuned for Tuned for
300 LAN WAN Both
264 264
200
152
112 112
100
44
6
TCP Buffer Tuning: Application
Must adjust buffer size in your applications:
int skt, int sndsize = 2 * 1024 * 1024;
err = setsockopt(skt, SOL_SOCKET, SO_SNDBUF,
(char *)&sndsize,(int)sizeof(sndsize));
and/or
err = setsockopt(skt, SOL_SOCKET, SO_RCVBUF,
(char *)&sndsize,(int)sizeof(sndsize));
7
Buffer Size Example
ping time = 50 ms
Narrow link = 500 Mbps (62 Mbytes/sec)
e.g.: the end-to-end network consists of all 1000 BT
ethernet and OC-12 (622 Mbps)
TCP buffers should be:
.05 sec * 62 = 3.1 Mbytes
UK to...
UK (RTT = 5 ms, narrow link = 1000 Mbps) : 625 KB
Europe: (RTT = 25 ms, narrow link = 500 Mbps): 1.56 MB
US: (RTT = 150 ms, narrow link = 500 Mbps): 9.4 MB
Japan: (RTT = 260, narrow link = 150 Mbps): 4.9 MB
8
More Problems: TCP congestion control
Path =
LBL to
CERN
(Geneva)
OC-3, (in
2000), RTT
= 150 ms
average
BW =
30 Mbps
9
Tuned Buffers vs. Parallel Steams
Throughput (Mbits/sec)
30
25
20
15
10
5
0
no tuning tuned 10 tuned
TCP parallel TCP
buffers streams, buffers, 3
no tuning parallel
streams
Potentially unfair
Places more load
on the end hosts
But they are
necessary when
you don’t have root
access, and can’t
convince the
sysadmin to
increase the max
TCP buffers
graph from Tom Dunigan, ORNL
10
NFNN2, 20th-21st June 2005
National e-Science Centre, Edinburgh
http://gridmon.dl.ac.uk/nfnn/
traceroute
>traceroute pcgiga.cern.ch
traceroute to pcgiga.cern.ch (192.91.245.29), 30 hops max, 40 byte packets
1 ir100gw-r2.lbl.gov (131.243.2.1) 0.49 ms 0.26 ms 0.23 ms
2 er100gw.lbl.gov (131.243.128.5) 0.68 ms 0.54 ms 0.54 ms
3 198.129.224.5 (198.129.224.5) 1.00 ms *d9* 1.29 ms
4 lbl2-ge-lbnl.es.net (198.129.224.2) 0.47 ms 0.59 ms 0.53 ms
5 snv-lbl-oc48.es.net (134.55.209.5) 57.88 ms 56.62 ms 61.33 ms
6 chi-s-snv.es.net (134.55.205.102) 50.57 ms 49.96 ms 49.84 ms
7 ar1-chicago-esnet.cern.ch (198.124.216.73) 50.74 ms 51.15 ms 50.96
ms
8 cernh9-pos100.cern.ch (192.65.184.34) 175.63 ms 176.05 ms 176.05
ms
9 cernh4.cern.ch (192.65.185.4) 175.92 ms 175.72 ms 176.09 ms
10 pcgiga.cern.ch (192.91.245.29) 175.58 ms 175.44 ms 175.96 ms
Can often learn about the network from the router names:
ge = Gigabit Ethernet
oc48 = 2.4 Gbps (oc3 = 155 Mbps, oc12=622 Mbps)
11
Iperf
iperf : very nice tool for measuring end-to-end TCP/UDP performance
http://dast.nlanr.net/Projects/Iperf/
Can be quite intrusive to the network
Example:
Server: iperf -s -w 2M
Client: iperf -c hostname -i 2 -t 20 -l 128K -w 2M
Client connecting to hostname
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 2.0 sec 66.0 MBytes 275 Mbits/sec
[ 3] 2.0- 4.0 sec 107 MBytes 451 Mbits/sec
[ 3] 4.0- 6.0 sec 106 MBytes 446 Mbits/sec
[ 3] 6.0- 8.0 sec 107 MBytes 443 Mbits/sec
[ 3] 8.0-10.0 sec 106 MBytes 447 Mbits/sec
[ 3] 10.0-12.0 sec 106 MBytes 446 Mbits/sec
[ 3] 12.0-14.0 sec 107 MBytes 450 Mbits/sec
[ 3] 14.0-16.0 sec 106 MBytes 445 Mbits/sec
[ 3] 16.0-24.3 sec 58.8 MBytes 59.1 Mbits/sec
[ 3] 0.0-24.6 sec 871 MBytes 297 Mbits/sec
pathrate / pathload
12
pipechar
pipechar output
dpsslx04.lbl.gov(59)>pipechar firebird.ccs.ornl.gov
PipeChar statistics: 82.61% reliable
From localhost: 827.586 Mbps GigE (1020.4638 Mbps)
1: ir100gw-r2.lbl.gov (131.243.2.1 )
| 1038.492 Mbps GigE <11.2000% BW used>
2: er100gw.lbl.gov (131.243.128.5)
| 1039.246 Mbps GigE <11.2000% BW used>
3: lbl2-ge-lbnl.es.net (198.129.224.2)
| 285.646 Mbps congested bottleneck <71.2000% BW used>
4: snv-lbl-oc48.es.net (134.55.209.5)
| 9935.817 Mbps OC192 <94.0002% BW used>
5: orn-s-snv.es.net (134.55.205.121)
| 341.998 Mbps congested bottleneck <65.2175% BW used>
6: ornl-orn.es.net (134.55.208.62)
| 298.089 Mbps congested bottleneck <70.0007% BW used>
7: orgwy-ext.ornl.gov (192.31.96.225)
| 339.623 Mbps congested bottleneck <65.5502% BW used>
8: ornlgwy-ext.ens.ornl.gov (198.124.42.162)
| 232.005 Mbps congested bottleneck <76.6233% BW used>
9: ccsrtr.ccs.ornl.gov (160.91.0.66 )
| 268.651 Mbps GigE (1023.4655 Mbps)
10: firebird.ccs.ornl.gov (160.91.192.165)
Brian L. Tierney Slide: 26
13
tcpdump / tcptrace
Sample use:
tcpdump -s 100 -w /tmp/tcpdump.out host hostname
tcptrace -Sl /tmp/tcpdump.out
xplot /tmp/a2b_tsg.xpl
14
Zoomed In View
Green Line: ACK values received from the receiver
Yellow Line tracks the receive window advertised from the receiver
Green Ticks track the duplicate ACKs received.
Yellow Ticks track the window advertisements that were the same as
the last advertisement.
White Arrows represent segments sent.
Red Arrows (R) represent retransmitted segments
Other Tools
15
Other TCP Issues
host issues
Memory copy speed
I/O Bus speed
Disk speed
16
Duplex Mismatch Issues
A common source of trouble with Ethernet
networks is that the host is set to full duplex, but
the Ethernet switch is set to half-duplex, or visa
versa.
Most newer hardware will auto-negotiate this, but
with some older hardware, auto-negotiation
sometimes fails
result is a working but very slow network (typically only
1-2 Mbps)
best for both to be in full duplex if possible, but some
older 100BT equipment only supports half-duplex
NDT is a good tool for finding duplex issues:
http://e2epi.internet2.edu/ndt/
Jumbo Frames
17
Linux Autotuning
18
ssthresh caching
•The value of
CWND when
this loss
happened will
get cached
19
NFNN2, 20th-21st June 2005
National e-Science Centre, Edinburgh
http://gridmon.dl.ac.uk/nfnn/
20
Proposed TCP Modifications
XCP:
XCP rapidly converges on the optimal congestion window using a
completely new router paradigm.
This makes it very difficult to deploy and test
http://www.ana.lcs.mit.edu/dina/XCP/
FAST TCP:
http://netlab.caltech.edu/FAST/
21
TCP: Reno vs. BIC
TCP-Reno
(Linux 2.4)
BIC-TCP
(Linux 2.6)
• BIC-TCP
recovers from
loss more
aggressively
than TCP-Reno
22
Sample Results
From Doug Leith, Hamilton Institute, http://www.hamilton.ie/net/eval/
Fairness
Between Flows
Link Utilization
23
Linux 2.6 Issues
24
Linux 2.6.12-rc3 Results
RTT = 67 ms
But: on
some
paths BIC
still seems
to have
problems…
RTT = 83 ms
25
NFNN2, 20th-21st June 2005
National e-Science Centre, Edinburgh
http://gridmon.dl.ac.uk/nfnn/
26
Use Asynchronous I/O
I/O followed by
processing
Next IO starts
when processing
ends
overlapped I/O and
processing
process
previous block
27
scp Issues
Conclusions
The wizard gap is starting to close (slowly)
If max TCP buffers are increased
Tuning TCP is not easy!
no single solution fits all situations
need to be careful TCP buffers are not too big or too small
sometimes parallel streams help throughput, sometimes they hurt
Linux 2.6 helps a lot
Design your network application to be as flexible as possible
make it easy for clients/users to set the TCP buffer sizes
make it possible to turn on/off parallel socket transfers
probably off by default
Design your application for the future
even if your current WAN connection is only 45 Mbps (or less), some day
it will be much higher, and these issues will become even more important
28
For More Information
http://dsd.lbl.gov/TCP-tuning/
links to all network tools mentioned here
sample TCP buffer tuning code, etc.
29