0% found this document useful (0 votes)
2 views

cc2 -- all slides

The document discusses congestion control in the Internet, focusing on TCP algorithms like TCP Reno and TCP Cubic. It explains how TCP Reno adjusts its sending rate using techniques such as AIMD and slow start, detailing its phases and mechanisms for handling packet loss. Additionally, it addresses fairness issues in TCP performance related to round-trip time (RTT) and the impact of packet loss on throughput.

Uploaded by

Carlos Hurtado
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

cc2 -- all slides

The document discusses congestion control in the Internet, focusing on TCP algorithms like TCP Reno and TCP Cubic. It explains how TCP Reno adjusts its sending rate using techniques such as AIMD and slow start, detailing its phases and mechanisms for handling packet loss. Additionally, it addresses fairness issues in TCP performance related to round-trip time (RTT) and the impact of packet loss on throughput.

Uploaded by

Carlos Hurtado
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Congestion Control

In The Internet

Part 2:
Implementation
Contents
6. TCP Reno
7. TCP Cubic
8. ECN and AQM, DC-TCP
9. New Directions
6. Congestion Control in the Internet was initially only in TCP
Why?
Easy to add end-to-end congestion control to TCP, as TCP already maintains an end-to-end connection

How? In addition to what we discussed about TCP


• a TCP source adjusts its “rate” to the network-congestion status by:

- adjusting the sliding window

- using techniques like: additive increase/multiplicative decrease (AIMD), and slow start

- leveraging implicit or explicit feedback from network (according to the Decbit principle)

‣ this avoids congestion collapses and ensures some sort of fairness

Many congestion control algorithms


Popular ones are: TCP Reno (the most mature and well explored, widely used until recently)
TCP Cubic (widespread today in Linux servers)
Data Center TCP (Microsoft and Linux servers in data centers)
TCP-BBR (Google, Whatsapp, etc)
TCP Reno uses ∼AIMD, Slow Start, implicit feedback
• Negative feedback = loss detected
- multiplicative decrease
• Positive feedback = new (non-duplicate) ACK received
new ACK new ACK
- multiplicative or additive increase depending on the phase new ACK
Rate ≥ threshold
• 3 phases or states:
timeout
- Slow Start = (approx.) the same as theoretical slow start timeout
with multiplicative increase factor w0 ≈ 2 per RTT
- Congestion Avoidance = additive increase with
term v0 ≈ + 1 MSS per RTT timeout new ACK
- Fast Recovery [see next] Fast Fast
retransmit retransmit
(e.g. 3 dupl. (e.g. 3 dupl.
• Transitions between states: ACKs) ACKs)

- Initial state is Slow Start


- Slow Start → Congestion Avoidance via a threshold
- Any state → Slow Start if loss is detected via timeout
- Any state → Fast Recovery if loss is detected via fast retransmit
What exactly TCP Reno does to adjust its sending rate?
• TCP Reno adjusts the sliding window size
W
• based on the approximation: rate ≈
RTT

W = min (cwnd, offeredWindow)


offeredWindow = window advertized by other end in TCP’s window field
cwnd = controlled by TCP congestion control
Slow Start by adjusting cwnd…
…multiplicative increase: (Slow Start)
• For the initial slow start, the target window ssthresh (for the target rate) is set to 64KB

• At each new (non-duplicate) ack received during slow start:

cwnd = cwnd + MSS (in bytes)


- if counted in packets, this would be: cwnd = cwnd + 1 (in packets)
- i.e. a multiplicative increase with factor w0 = 2 per RTT, approximately

cwnd = 1 seg 2 3 4 5 678

This leads to an
exponential increase of cwnd

• if cwnd ≥ ssthresh, then go to congestion avoidance


AIMD by adjusting cwnd…
cwnd 2 2.5 2.9 3.83
=
additive increase: (Congestion avoidance)
1 MSS
- for every new (non-duplicate) ack received:

cwnd = cwnd + MSS × MSS/cwnd


- if counted in packets, this would be:

cwnd+=1/cwnd,
slightly less than additive increase (< 1 MSS/RTT)
- other implementations also exist:

- e.g. wait until cwnd bytes are ack’ed and then increment cwnd by 1 MSS

multiplicative decrease: (after detecting loss — negative feedback)


- ssthresh = 0.5 × cwnd

- cwnd = 1MSS (if timeout) or something else (if fast retransmit) [see Fast Recovery]
Example of congestion-window evolution without Fast recovery

window size multiplicative


d ecrease
= reduction of
target window 1
by
2
initial se
ssthresh additive increa
n gest io n avo idance”
loss = “co loss
loss cwnd
loss

slow start
Recall:
• there is a slow start phase initially and after every packet loss detected by timeout

• target window of slow start is called ssthresh («slow start threshold»)


𝑡
Fast Recovery
Algorithm:
Why? When loss is detected by 3 duplicate ACKs at any phase:
• Loss detected by fast retransmit is not severe—we just ssthresh = 0.5 × current-cwnd
want to apply multiplicative decrease with u1 = 0.5 ssthresh = max (ssthresh, 2 × MSS)
• but halving the cwnd is not a good approach; cwnd = ssthresh + 3 × MSS
W cwnd = min (cwnd, 64K)
- formula “rate ≈ ” is not true when there is a
RTT Goto Fast Recovery
single isolated packet loss;
- sliding window operation may even stop sending,
if the first packet of a batch is lost When in Fast Recovery, for each duplicate ACK received:
cwnd = cwnd + MSS (exp. increase)
What? cwnd = min (cwnd, 64K)
If loss is repaired
• target window is halved: ssthresh = 0.5 × cwnd
cwnd = ssthresh
• but congestion window is allowed to increase beyond
the target window until the loss is repaired—it is Goto Congestion Avoidance
increased by the value of duplicate ACKs else (timeout)
‣ artificial increase to keep sending segments Goto Slow Start
Fast Recovery Example MSS = 100 TcpMaxDupACKs=3

ssthresh=cwnd = 800 1
seq=201:300 2
During congestion avoidance:
seq=301:350 3 Ack = 201,win=1’000 MSS2
seq=351:400 4 cwnd ← cwnd +
cwnd
seq=401:500 5 Ack = 201,win=1’000
ssthresh=cwnd=813 6 Ack = 201,win=1’000
seq=501:600 7 Ack = 201,win=1’000
8
seq=601:700 9 Ack = 201,win=1’000
10
seq=701:800 11 Ack = 201,win=1’000
ssthresh=407, cwnd=707 12
Ack = 201,win=1’000
seq=201:300 13
seq=801:900 14
ssthresh=407, cwnd=807 15
ssthresh=407, cwnd=907 16 Ack = 901,win=1’000
seq=901:1000 17
ssthresh=407, cwnd=1007 18
ssthresh=407, cwnd=407 19
20
At time 1, the sender is in “congestion avoidance” mode. The congestion window increases with every received non-
duplicate ack (as at time 6). The target window (ssthresh) is equal to the congestion window.

The second packet is lost.

At time 12, its loss is detected by fast retransmit, i.e. reception of 3 duplicate acks. The sender goes into “fast recovery”
mode. The target window is set to half the value of the congestion window; the congestion window is set to the target
window plus 3 packets (one for each duplicate ack received).

At time 13 the source retransmits the lost packet. At time 14 it transmits a fresh packet. This is possible because the
window is large enough. The window size, which is the minimum of the congestion window and the advertised window,
is equal to 707. Since the last acked byte is 201, it is possible to send up to 907.

At times 15, 16 and 18, the congestion window is increased by 1 MSS, i.e. 100 bytes, by application of the fast recovery
algorithm. At time 15, this allows to send one fresh packet, which occurs at time 17.

At time 19 the lost packet is acked, the source exits the fast recovery mode and enters congestion avoidance. The
congestion window is set to the target window.
How many new segments of size 100 bytes
can the source send at time 20 ?
ssthresh=cwnd = 800 1
seq=201:300 2 Go to web.speakup.info
seq=301:350 3 Ack = 201,win=1’000 or
download speakup app
seq=351:400 4
Join room
seq=401:500 5 Ack = 201,win=1’000 46045
ssthresh=cwnd=813 6 Ack = 201,win=1’000
seq=501:600 7 Ack = 201,win=1’000
8 A. 1
seq=601:700 9 Ack = 201,win=1’000
10 B. 2
seq=701:800 11 Ack = 201,win=1’000
ssthresh=407, cwnd=707 12
C. 3
Ack = 201,win=1’000
seq=201:300 13 D. 4
seq=801:900 14
ssthresh=407, cwnd=807 15 E. ≥ 5
ssthresh=407, cwnd=907 16 Ack = 901,win=1’000
seq=901:1000 17 F. 0
ssthresh=407, cwnd=1007 18 G. I don’t know
ssthresh=407, cwnd=407 19
20
Solution
Answer C
The congestion window is 407, the advertised window is 1000, and the last ack
received is 901.
The source can send bytes 901 to 1308, the segment 901:1001 was already sent,
i.e. the source can send 3 new segments of 100 bytes each.
TCP Reno — recap

Figure from our textbook:


"Computer Networking: A top-down approach"
by J. Kurose and K. Ross
Assume a TCP flow uses WiFi with high loss ratio. Assume some packets are
still lost in spite of WiFi retransmissions. When a packet is lost on the WiFi
link…

A. The TCP source knows it is a loss due to channel errors and not congestion,
therefore does not reduce the window
B. The TCP source thinks it is a congestion loss and reduces its window
C. It depends if the MAC layer uses retransmissions
D. I don’t know

Go to web.speakup.info
or
download speakup app
Join room
46045
Solution
Answer B: the TCP source does not know the cause of a loss.

Side-effect:
A flow that experiences accidental losses on its wireless access link may never manage to get
its fair share on another bottleneck link down its path, because it will be constantly reducing its
sending rate “thinking that it experiences congestion”.

Solutions:
Explicit Congestion Notification from the network [see later]
Dynamic (more sophisticated) coding at the physical layer to avoid errors on the wireless link
Fairness of TCP Reno
For long lived flows, the rates obtained with TCP Reno are as if they were distributed according to utility
2 xiτi
fairness, with utility of flow given by Ui(xi) = arctan
τi 2
with xi = rate (in MSSs) = W/τi, τi = RTT (see “Rate adaptation, Congestion Control and Fairness: A Tutorial")

For flows that have same RTT, the fairness of TCP is between max-min and proportional fairness, closer to
proportional fairness:
Reno
≈maxmin
AIMD
proportional fairness
rescaled utility
functions;
RTT = 100 ms
−5
maxmin approx. is ( ) = 1 −
𝑈
𝑥
𝑥
𝑖
TCP Reno and RTT
TCP Reno tends to distribute rate so as to maximize utility of source given by
2 xτ
Ui(xi) = arctan i i
τi 2
The utility depends on the roundtrip time τi;

The utility is a
decreasing function
of RTT

What does this imply ?


{
𝑈
𝑈
𝑖
and send to destination using one TCP connection each, RTTs are 60ms and 140ms.
Bottleneck is link « router-destination ». Who gets more ?

A. 1 gets a higher throughput


B. 2 gets a higher throughput
C. Both get the same
D. I don’t know

S1
router destination
10 Mb/s, 20 ms 1 Mb/s 10 ms
10 Mb/s, 60 ms Go to web.speakup.info
or
S2 download speakup app
Join room
46045
𝟏
𝟐
𝑆
𝑆
𝑺
𝑺
Solution
For long lived flows, the rates obtained
with TCP are as if they were distributed
according to utility fairness, with utility of
2
flow given by ( ) = arctan
√2
1 has a smaller RTT than 2
The utility is less when RTT is large;
therefore, TCP tries less hard to give a
high rate to sources with large RTT.

So, 2 gets less.


𝑖
𝜏
Answer A.
𝑖
𝑈
𝑥
𝑖
𝑖
𝑥
𝜏
𝑆
𝑆
𝑆
𝑖
The RTT Bias of TCP Reno
With TCP Reno, two competing flows with different RTTs are not treated equally
- flow with large RTT obtains less

- a (practical) explanation: additive increase is one packet per RTT (instead of one packet
per constant time interval); so a flow with a smaller RTT can “open” the window faster.

A flow that uses many hops obtains less rate because of two combined factors:
1. If this flow goes over many congested links, it uses more resources. The mechanic of
TCP Reno that is close to proportional fairness leads to this source having less rate,
which is desirable in view of the theory of fairness.
2. If this flow has simply a larger RTT, then things are different. The mechanics of
additive increase leads to this source having less rate,
which is an undesired bias in the design of TCP Reno.
TCP Reno
Loss - Throughput Formula
Consider a large TCP flow size (many bytes to transmit).
Assume we observe that, in average, a fraction q of packets is lost (or marked with ECN).

1.22
The throughput should be close to = .

Formula assumes:
transmission time is negligible compared to RTT,
losses are rare and occur periodically,
time spent in Slow Start and Fast Recovery is negligible.

[see “Rate adaptation, Congestion Control and Fairness: A Tutorial”]


𝑅
𝑇
𝑇
𝑞
𝜃
𝑀
𝑆
𝑆
Guess the ratio between the throughputs θ1 and θ2 of S1 and S2
(assume: same MSS, same loss prob, and negligible transmission/
processing delays for both flows)

3 S1
A. 1= 2
7 Destination
B. 1= 2 router
7 10 Mb/s, 20 ms 1 Mb/s 10 ms
C. 2 1=
3
10 10 Mb/s, 60 ms
D. 1 = 2
3
E. None of the above S2
F. I don’t know 1.22
=
Go to web.speakup.info
or
download speakup app
Join room
𝑅
𝑇
𝑇
𝑞
46045
𝜃
𝜃
𝜃
𝜃
𝜃
𝜃
𝜃
𝑀
𝑆
𝑆
𝜃
𝜃
Solution

ACK numbers S1

S2

time

If processing time is negligible and router drops packets in the same


proportion for all flows, then throughput is proportional to 1/RTT, thus
1 2 7
= 1 i.e. 1=
1 3 2
1 2

Answer C.
𝜏
𝜏
𝜃
𝜃
𝜃
𝜃
TCP Reno — shortcomings
• RTT bias – not nice for users far away from the source

• Periodic losses must occur, not nice for applications (e.g video streaming).

• TCP controls the window, not the rate. Large bursts typically occur when packets are
released by host following e.g. a window increase – not nice for queues in the internet,
makes non smooth behavior.

• Self inflicted delay: if network buffers (in routers and switches) are large, TCP first fills buffers
before adapting the rate. The RTT is increased unnecessarily. Buffers are constantly full,
which reduces their usefulness (bufferbloat syndrome) and increases delay for all users.
Interactive, short flows experience large latency when buffers are large and full.
Congestion control in UDP Applications
UDP applications that can adapt their rate have to implement congestion control.

One method is to use the congestion control module of TCP:


e.g. QUIC, which is over UDP, uses Cubic’s congestion control (in its original version) or
Reno’s congestion control (in the standard version).

Another method (e.g. for videoconferencing application) is to control the rate by


computing the rate that TCP Reno would obtain.
E.g.: TFRC (TCP-Friendly Rate Control) protocol
application adapts the sending rate (by modifying the coding rate for audio and video)
feedback is received in form of count of lost packets, used by source to estimate drop
probability
1.22
source sets rate to = (TCP Reno’s loss throughput formula)
𝑅
𝑇
𝑇
𝑞
𝑥
𝑞
𝑀
𝑆
𝑆
7. TCP Cubic: Improving performance in Long Fat Networks (LFNs)
• In an LFN, additive increase can be too slow
1.4 hours 1.4 hours 1.4 hours

Packet loss Packet loss Packet loss Packet loss TCP

cwnd 100,000 10Gbps

50,000 Slow Increase Fast Decrease


5Gbps
cwnd = cwnd + 1 cwnd = cwnd * 0.5

Slow start Congestion avoidance Time (RTT)


(slide from Presentation: "Congestion Control on High-Speed Networks”, Injong Rhee, Lisong Xu, Slide 7)
the figure assumes: congestion avoidance implementing a strict additive increase of 1 MSS per RTT,
losses are detected by fast retransmit, but the “fast recovery” phase is not used,
MSS = 1250B, RTT = 100 msec.
TCP Cubic modifies Reno
Why? increase TCP rate faster on LFNs
How? Cubic keeps the same slow start, fast recovery phases as Reno, but:
• during congestion avoidance, the increase is not additive but cubic
• multiplicative Decrease with factor ×0.7 (instead of ×0.5)

()
Say congestion avoidance is entered Additive Increase (≈Reno)
at time 0 = 0 and let with RTT = 0.1 s
= value of cwnd, when loss is detected.
Let ( ) = + 0.4( − )3
with such that (0) = 0.7 . Cubic
Then the window increases like
( ) until a loss occurs again.

Units are: data = 1MSS; time = 1s

t
𝑚
𝑎
𝑥
𝑚
𝑎
𝑥
𝑚
𝑎
𝑥
𝑚
𝑎
𝑥
𝑊
𝑡
𝑊
𝑡
𝐾
𝑊
𝑡
𝑊
𝑡
𝐾𝑊
𝑊
𝑡
𝑊
𝑊
𝐾
How does this compare to Reno?
Cubic increases window in concave way until reaches , then increases in a convex way
( ) is independent from RTT, but
- it opens faster than Reno when RTT is large (long networks),
- but may be slower when RTT is small (non-LFNs)

W(t)
Additive Increase (≈Reno)
Cubic
Cubic with RTT = 0.01 s

Additive Increase (≈Reno)


with RTT = 1 s

t
𝑚
𝑎
𝑥
𝑊
𝑊
𝑡
Cubic’s Window Increase
Cubic is always at least as fast as a hypothetical Reno (i.e. AIMD) with additive increase term
MSS per RTT (instead of 1) and multiplicative decrease = 0.7.

Formally:
WCUBIC(t) = max {W(t), WAIMD(t)},
where

()= (0) + and


is computed s.t. this hypothetical Reno has the same loss-throughput formula as standard
1−
Reno: ⇒ =3 = 0.529
1+

➡ Cubic’s throughput ≥ Reno’s throughput


with equality when RTT or bandwidth-delay product is small (i.e. when in non-LFNs)
𝑐
𝑢
𝑏
𝑖
𝑐
𝛽
𝑅
𝑇
𝑇
𝑐
𝑢
𝑏
𝑖
𝑐
𝐴
𝐼
𝑀
𝐷
𝑐
𝑢
𝑏
𝑖
𝑐
𝑟
𝑊
𝑡
𝑊
𝑟
𝑐
𝑐
𝑐
𝑢
𝑢
𝑢
𝑏
𝑏
𝑏
𝑖
𝑖
𝑖
𝑐
𝑐
𝑐
𝑐
𝑢
𝑏
𝑖
𝑐
𝑟
𝛽
𝑟
𝛽
𝑡
Cubic’s Loss throughput formula
Given the same assumptions as for TCP Reno:

( )
1.054 1.22
≈ max 0.25 0.75
, in MSS per second.
Mb/s
RTT = 12.5 ms
Reno
So:
• Cubic’s formula is same as Reno Cubic @ RTT = 100 ms
for small RTTs and small BW-delay products
• but a TCP Cubic connection gets more
throughput than TCP Reno when bit-rate and
RTT are large
RTT = 800 ms

q
𝑅
𝑇
𝑇
𝑞
𝑅
𝑇
𝑇
𝑞
Other details: computation of uses a more complex mechanism called “fast convergence” - see Latest IETF Cubic
𝜃
RFC / Internet Draft or http://elixir.free-electrons.com/linux/latest/source/net/ipv4/tcp_cubic.c
𝑚
𝑎
𝑥
𝑊
8. Tackling the Bufferbloat Syndrome with ECN and AQM
Using loss as congestion feedback has a major drawback = self-inflicted delay:
increased latencies and buffers are not well utilized due to bufferbloat syndrome.
Loss Based
Congestion B
Optimal Control
Delivery Operating Point Operating Point round-trip
rate time

A
RTTmin
bottleneck
link capacity

From : N. Cardwell, Y. Cheng, C. S. Gunn, S. H.


Yeganeh, and V. Jacobson, “BBR: Congestion-
Based Congestion Control,” ACM Queue, vol. window size
14, no. 5, pp. 50:20–50:53, Oct. 2016.
= amount of in-flight data
from [Hock et al, 2017] Mario Hock, Roland Bless, Martina Zitterbart, “Experimental Evaluation
of BBR Congestion Control”, ICNP 2017:

The previous figure illustrates that if the amount of inflight data (i.e. the window size) is just
large enough to fill the available bottleneck link capacity, the bottleneck link is fully utilized and
the queuing delay is zero or close to zero. This is the optimal operating point (A), because the
bottleneck link is already fully utilized at this point. If the amount of inflight data is increased
any further, the bottleneck buffer gets filled with the excess data. The delivery rate, however,
does not increase anymore. The data is not delivered any faster since the bottleneck does not
serve packets any faster and the throughput stays the same for the sender: the amount of
inflight data is larger, but the round-trip time increases by the corresponding amount. Excess
data in the buffer is useless for throughput gain and a queuing delay is caused that rises with
an increasing amount of inflight data. Loss-based congestion controls shift the point of
operation to (B) which implies an unnecessary high end-to-end delay, leading to “bufferbloat”
in case the buffer sizes are large.
ECN - Explicit Congestion Notification…
…aims at avoiding bufferbloat

What? Network signals congestion without dropping packets (similarly to DECbit)

How?
• IP router experiencing congestion marks packet instead of dropping
• TCP destination echoes back the mark to the source
• TCP source interprets an echoed marked packet as if there was a loss detected by fast retransmit
Example

After 6: Source applies multiplicative decrease to cwnd,


as if there was a loss detected by fast retransmit
Steps in the previous slide:
1. S sends a packet using TCP
2. Packet is received at congested router buffer; router marks the Congestion Experienced (CE)
bit in IP header
3. Receiver sees CE in received packet and set the ECN Echo (ECE) flag in the TCP header of
packets sent in the reverse direction
4. Packets with ECE are forwarded towards the source
5. Packets with ECE are forwarded towards the source
6. Packets with ECE are received by source.
7. Source applies multiplicative decrease of the congestion window.
Source sets the Congestion Window Reduced (CWR) flag in TCP header.
The receiver continues to set the ECE flag until it receives a packet with CWR set.
Multiplicative decrease is applied only once per window of data (typically, multiple packets are
received with ECE set inside one window of data).
Assume TCP with ECN is used and there is no packet loss.
Put correct labels…

2
window size
Go to web.speakup.info
or
download speakup app
1 ECE received Join room
ECE received 46045
1

A. 1= CA, 2 = SS CA: congestion avoidance

B. 1 = SS, 2 = MD SS: slow start = multiplicative increase


C. 1 = CA, 2 = MD MD: multiplicative decrease
D. I don’t know
𝑡
Solution
Answer C

window size multiplicative


d ecrease
= reduction of
target window 1
by
2
idance
congestion avo
ECE received
ECE received

Recall: Slow start’s multiplicative increase results in an exponential growth of the cwnd.
So, no slow start phase is shown in this figure.
𝑡
ECN flags in IP and TCP headers
2 bits in IP header, 4 possible codewords:
00 = non ECN Capable (non ECT)
01 or 10 = ECN capable ECT(0) and ECT(1)
historically used at random; today used to
differentiate congestion control (TCP Cubic vs DCTCP)
11 = used by routers to signal that congestion is experienced (CE)
If congested, router marks ECT(0) or ECT(1) packets; but discards non ECT packets

2 bits in TCP header but as separate flags:


• ECE (echo) is set by R to inform S about congestion.
• CWR (congestion window reduced) set by S to inform R that ECE was received and R.
• When receiving ECE, S reduces window only once per RTT and sets the CWR flag in TCP headers.
R sets ECE flag in all TCP headers until CWR is received or if a new CE packet is received.
ECN requires Active Queue Management
Why? decide when to mark a packet with ECN, and more generally, avoid buffer bloat syndrome
How? E.g. with a RED (Random Early Detection) queue:
• Queue estimates its average queue length
avg ← α × measured + (1 - α) × avg
• Incoming packet is marked with probability given by RED curve (see figure)
a uniformization procedure is also applied to prevent bursts of marking events

q 1
(marking prob)

max-p

th-min th-max avg (queue length)

See the difference from passive queue management = drop a packet only when queue is full = “Tail Drop”
But…Active Queue Management does not require ECN
AQM can also be applied even if ECN is not supported
e.g. with RED, a packet is dropped with probability computed by the RED curve
- packet may be discarded even if there is some space available!

Expected benefits in this case:


- mitigate bufferbloat – reduce latency

- avoid irregular drop patterns, as


the drop probability affects all flows
in the same way

In the context of packet dropping (instead of ECN), RED can be replaced by the more recent variant called CoDel (RFC 8289).
𝑞
In a network where all flows use TCP with ECN and all routers
support ECN, we expect that …

A. there is no packet loss


B. there is significantly less packet loss due to congestion in both switches
and routers
C. there is significantly less packet loss due to congestion in routers
D. none of the above
E. I don’t know

Go to web.speakup.info
or
download speakup app
Join room
46045
Solution
Answer C
We expect that routers (almost) do not drop packets due to congestion if all
TCP sources use ECN

However there might be congestion losses in switches (especially the ones in


large networks or Internet exchange points—IXPs), and there might be non-
congestion losses (transmission errors)
Data Centers and congestion control
What is a data center?
a room with lots of racks of PCs and switches
where many distributed apps are running: e.g. youtube, CFF.ch, switchdrive, etc

What is special about data centers?


• most traffic is TCP

• very small latencies (10-100 s)

• lots of bandwidth

• various traffic patterns coexist:

- internal traffic (distributed computing) and

- external traffic (user requests and their responses)

- many short flows with low latency requirements (user queries, mapReduce communication)

- some jumbo flows with huge volume (backup, synchronizations) may use an entire link
𝜇
Given what you have learnt so far,
what would you choose
for TCP flows inside a data center ?

A. TCP Reno, no ECN no RED


B. TCP Reno and ECN
C. TCP Cubic, no ECN no RED
D. TCP Cubic and ECN
E. I don’t know
Go to web.speakup.info
or
download speakup app
Join room
46045
Solution
Answer D (also B could work)
• Cubic has better performance than Reno when bandwidth-delay product is large,
which may occur in data centers.
Also Cubic performs at least as good as Reno in any case.
• Without ECN there will be bufferbloat, which means high latency for short flows

Standard operation of ECN (e.g. with Reno or Cubic) still has drawbacks for jumbo
flows in data center settings:
multiplicative decrease by 50%
or 30% is still abrupt ⇒
throughput inefficiency
Data Center TCP
Why ? Improve performance for jumbo flows when ECN is used
How ?
Avoid brutal multiplicative decrease of 50% (of Reno) or 30% (of Cubic)

Instead, TCP source estimates prob of congestion p from ECN echoes


• ECN echo is modified so that:
the proportion of CE marked ACKs ≈ the probability of congestion p

( 2)
• Multiplicative decrease is × = 1−
𝐷
𝐶
𝑇
𝐶
𝑃
𝛽
𝑝
In a data center: two large TCP flows compete for a
bottleneck link; one uses DCTCP, the other uses Cubic/ECN.
Both have same RTT.
A. Both get roughly the same throughput
B. DCTCP gets much more throughput
C. Cubic gets much more throughput
D. I don’t know

Go to web.speakup.info
or
download speakup app
Join room
46045
Solution
Answer B.
If latency is very small, Cubic with ECN has same throughput performance as Reno with ECN, i.e.
same as AIMD with multiplicative decrease = × 0.5 and window increase of 1 packet per RTT
during congestion avoidance.

DCTCP is similar, in particular has same window increase, but with multiplicative decrease =

( 2)
× 1− so the multiplicative decrease is always less.
DCTCP decreases less and increases the same, therefore it is more aggressive.

In other words, DCTCP competes unfairly with other TCPs; this is why it cannot be deployed
outside data centers (or other controlled environments).
Inside data centers, care must be given to separate the DCTCP flows (i.e. the internal flows) from
other flows. This can be done with class-based queuing [see next].
𝑝
9. Beyond Loss/ECN Based Congestion Control
TCP-BBR
Per Class Queuing
Evolution of Buffer Drain Time in the Internet
Buffer Drain Time = buffer capacity / link rate
To keep buffer drain time constant, the product (memory speed × memory size)
should scale faster than link rate, which is technologically not feasible.

• Access network (1 Gb/s): buffer drain time is 10s of seconds = buffer is “large” w.r.t. rate
⇒ Bufferbloat unless ECN is used
But
• In internet core links (100 Gb/s, 1 Tb/s):
buffer drain time decreases, is now fraction of ms, much less than RTT = buffer is “small”
⇒ impossible to react correctly within round trip time
⇒ feedback control may be inadequate
TCP-BBR
Bottleneck Bandwidth and RTT
TCP-BBR published by Google in 2016 [Caldwell et al 2016]
What ? Avoid per packet feedback, target maximum throughput with minimal delay
How ? BBR-TCP source:
1. estimates the bottleneck bandwidth and the min RTT separately
2. controls directly the rate (not the window) using pacing (= implementing a packet spacer)
that tries to keep amount of inflight data close to
bottleneck bandwidth × minRTT (optimal operating point)

N. Cardwell, Y. Cheng, C. S. Gunn, S. H. Yeganeh, and V.


Jacobson, “BBR: Congestion-Based Congestion
Control,” ACM Queue, vol. 14, no. 5, pp. 50:20–50:53,
Oct. 2016
BBRv1 operation — simplified
• source views network as a single link
(the bottleneck link)
• estimates min RTT by taking the min over the last 10 sec = RTprop Figure from:
Ware, R., Mukerjee, M.K., Seshan, S. and
• estimates bottleneck rate (bandwidth); = max of delivery rate Sherry, J., 2019, October. Modeling bbr's
interactions with loss-based congestion

over last 10 RTTs; delivery rate = amount of ACKed data per Δ control. In Proceedings of the internet
measurement conference (pp. 137-143).

• sends data at rate × ( )


where ( ) = 1.25; 0.75, ; 1; 1; 1; 1; 1; 1 is the pacing gain; i.e., c(t) is 1.25 during one RTprop,
then 0.75 during one RTprop, then 1 during 6 RTprops (“probe bandwidth” followed by
“drain excess” followed by steady state)
• if no new RTprop value for 10 seconds, the source enters Probe RTT state: sends only 4
packets to drain any possible queue and get a real estimation of the RTprop
• for safety, max data in flight is limited to 2 × × and by the offered window
• there is also a startup phase (similar to CUBIC and Reno) with exponential increase of rate
• no reaction to losses or ECN
𝑟
𝑟
𝑟
𝑒
𝑠
𝑡
𝑏
𝑏
𝑐
𝑡
𝑡
𝑏
𝑐
𝑡
𝑅
𝑇
𝑇
…BBRv1 in more detail
1) Overview: The main objective of BBR is to ensure that the bottleneck remains
saturated but not congested, resulting in maximum throughput with minimal delay.
Therefore, BBR estimates bandwidth as maximum observed delivery rate BtlBw and
propagation delay RTprop as minimum observed RTT over certain intervals. Both values
cannot be measured simultaneously, as probing for more bandwidth increases the delay
through the creation of a queue at the bottleneck and vice-versa. Consequently, they are
measured separately.
To control the amount of data sent, BBR uses pacing gain. This parameter, most of the
time set to one, is multiplied with BtlBw to represent the actual sending rate.

2) Phases: The BBR algorithm has four different phases: Startup, Drain, Probe
Bandwidth, and Probe RTT. The first phase adapts the exponential Startup behavior
from CUBIC by doubling the sending rate with each round-trip. Once the measured
bandwidth does not increase further, BBR assumes to have reached the bottleneck
bandwidth. Since this observation is delayed by one RTT, a queue was already created
at the bottleneck. BBR tries to Drain it by temporarily reducing the pacing gain.
Afterwards, BBR enters the Probe Bandwidth phase in which it probes for more available
bandwidth. This is performed in eight cycles, each lasting RTprop: First, pacing gain is
set to 1.25, probing for more bandwidth, followed by 0.75 to drain created queues. For
the remaining six cycles BBR sets the pacing gain to 1. BBR continuously samples the
bandwidth and uses the maximum as BtlBw estimator, whereby values are valid for the
timespan of ten RTprop. After not measuring a new RTprop value for ten seconds, BBR
stops probing for bandwidth and enters the Probe RTT phase. During this phase the
bandwidth is reduced to four packets to drain any possible queue and get a real
estimation of the RTT. This phase is kept for 200 ms plus one RTT. If a new minimum
value is measured, RTprop is updated and valid for ten seconds.
Performance of BBRv1
Google and other data center companies report
improvement on throughput (green curve), the latency measurement here is irrelevant and should be ignored

http://blog.cerowrt.org/post/bbrs_basic_beauty/
Performance of BBRv1
But…BBRv1 takes no feedback from Hock, M., Bless, R. and
Zitterbart, M., 2017, October.
network – no reaction to loss or ECN Experimental evaluation of BBR
congestion control. In 2017 IEEE
• [Hock et al, 2017] find that BBRv1’s 25th International Conference
on Network Protocols (ICNP)
estimated bottleneck bandwidth (pp. 1-10). IEEE.

ignores how many flows are competing


→ fairness issues with:
- BBR flows of different RTTs and
- BBR versus other TCPs
• [Ware et al, 2019] find that in-flight cap, Ware, R., Mukerjee,
M.K., Seshan, S. and
designed as a safety mechanism, is Sherry, J., 2019,
October. Modeling
determinant bbr's interactions with
loss-based congestion
control. In Proceedings
of the internet
measurement
conference (pp.
137-143).
Google proposed BBRv2 and BBRv3 to
address these and other shortcomings…
Per-class Queuing
Routers classify packets (using an access list)
each class is guaranteed a dedicated queue and a weight —> hence a rate
classes may exceed the guaranteed rate by borrowing from other classes if spare capacity exists

It is implemented in routers with dedicated queues for every class and a scheduler
such as Weighted Round Robin (WRR) or Deficit Round Robin (DRR).
WRR and DRR have one queue per class.
At every round, queues are visited in sequence.
WRR serves packets of class in one round. DRR serves bits of class in one round.

Used in
enterprise or industrial networks to support non-congestion-controlled flows (e.g. real-time flows);
provider networks to separate customers / isolate suspicious flows (network virtualization).
𝑖
𝑖
𝑤
𝑖𝑞
𝑖
Example of Class-Based Queuing

Class 1 for PMUs (power measurement units) is guaranteed a rate of 2.5 Mb/s; it can exceed
this rate by borrowing capacity available from the total 10 Mb/s if class 2 does not need it.
Class 2 for PCs is guaranteed a rate of 7.5 Mb/s; it can exceed this rate by borrowing capacity
available from the total 10 Mb/s if class 1 does not need it.
Suppose PMUs behave properly as expected.
Which rates will PC1 and PC2 achieve, if their RTTs are equal?

A. 5 Mb/s each
B. 4 Mb/s each
C. PC1: 5 Mb/s, PC2: 3 Mb/s Go to web.speakup.info
or
D. I don’t know download speakup app
Join room
46045
Solution
9 Mb/s available
8 Mb/s available 8 Mb/s available S1
7.5 Mb/s guaranteed
7.5 Mb/s guaranteed 7.5 Mb/s guaranteed
10 Mb/s 10 Mb/s 10 Mb/s

class 2
low prio
PC1 PC2

PC1 and PC2 see this network.


Since PMU1 and PMU2 stream at 1 Mb/s and class 2 may borrow the remaining
capacity, the available capacities for class 2 are: 9 Mb/s, 8 Mb/s and 8 Mb/s.
Solution
9 Mb/s available
8 Mb/s available 8 Mb/s available S1
7.5 Mb/s guaranteed
7.5 Mb/s guaranteed 7.5 Mb/s guaranteed
10 Mb/s 10 Mb/s 10 Mb/s

class 2
low prio
PC1 PC2

TCP allocates rates 1 and 2 so as to maximize ( 1) + ( 2) where is the utility function of TCP;
the function is the same for PC1 and PC2 because RTTs are the same.
The constraints are 1 ≤ 9 Mb/s, x1 ≤ 8, 1 + 2 ≤ 8 Mb/s.
Thus TCP solves a utility optimization problem: maximize ( 1) + ( 2) subject to 1 + 2 ≤ 8 Mb/s
By symmetry, 1 = 2 = 4 Mb/s
You can also check max-min fair allocation ( 1 = 2 = 4 Mb/s) and proportionally fair allocation
( 1 = 2 = 4 Mb/s) .

Answer B.
𝑈
𝑈
𝑥
𝑥
𝑈
𝑈
𝑥
𝑥
𝑥
𝑥
𝑥
𝑈𝑈𝑥
𝑥
𝑥
𝑥
𝑥
𝑥
𝑥
𝑥
𝑥
𝑥
The Future of Congestion Control
In the past, most TCP versions have relied on loss or ECN as negative signal. Some versions also
relied on delay only (TCP Vegas) or use delay as well as loss (PCC).

Congestion control today wants to also achieve “per-flow fairness”. But each flow may use a different
congestion control algorithm.
Brown, L., Ananthanarayanan, G.,
Is fairness achieved? Is every flow "TCP friendly”? Katz-Bassett, E., Krishnamurthy, A.,
Ratnasamy, S., Schapira, M. and
Shenker, S., 2020, November. On the
Is the “flow” the right abstraction/fairness-actor? future of congestion control for the
public internet. In Proceedings of the
What are the alternatives? 19th ACM Workshop on Hot Topics in
Networks (pp. 30-37).

Traffic isolation (e.g. with per-class traffic shapers or per-class queuing) is a possible future
alternative; packet dropping/ECN marking becomes a function of the traffic aggregate/class a
packet belongs to.
But does this comply with network neutrality regulations (= ISPs provide no competitive advantage to
specific apps/services, either through pricing or QoS)? How could network neutrality be maintained?
Conclusion
Congestion control is in TCP or in QUIC (a form of congestion-controlled UDP).

Traditional TCP uses:


• the window to control the amount of traffic: additive increase or cubic (as long as no loss);
multiplicative decrease (following a loss).
• loss as congestion signal.

Too much buffer is as bad as too little buffer—bufferbloat provokes large latency for interactive flows.
• ECN can avoid this – it replaces loss by an explicit congestion signal; but it is partly deployed in the
Internet; although it is part of Data Center TCP.
• TCP-BBR aims at avoiding this by pacing traffic:
it estimates the available bottleneck bandwidth and the min RTT
and it tries to keep amount of inflight data close to bottleneck bandwidth × minRTT

Per-Class-based queuing can separate flows in enterprise networks or classes of flows in provider networks.

You might also like