0% found this document useful (0 votes)
0 views

Understanding Host Network Stack Overheads

This paper analyzes the inefficiencies of traditional Linux kernel network stacks in handling high-bandwidth datacenter access links, revealing that CPU overheads, particularly data copying from kernel to application buffers, have become the primary bottleneck. The authors suggest that emerging zero-copy mechanisms and better orchestration of host resources could improve CPU efficiency, while also emphasizing the need for application-aware CPU scheduling and packet processing pipelines. The study highlights the importance of understanding these overheads to inform the design of future operating systems, network protocols, and hardware.

Uploaded by

s.ali.r.salamat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Understanding Host Network Stack Overheads

This paper analyzes the inefficiencies of traditional Linux kernel network stacks in handling high-bandwidth datacenter access links, revealing that CPU overheads, particularly data copying from kernel to application buffers, have become the primary bottleneck. The authors suggest that emerging zero-copy mechanisms and better orchestration of host resources could improve CPU efficiency, while also emphasizing the need for application-aware CPU scheduling and packet processing pipelines. The study highlights the importance of understanding these overheads to inform the design of future operating systems, network protocols, and hardware.

Uploaded by

s.ali.r.salamat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Understanding Host Network Stack Overheads

Qizhe Cai Shubham Chaudhary Midhul Vuppalapati


Cornell University Cornell University Cornell University

Jaehyun Hwang Rachit Agarwal


Cornell University Cornell University

ABSTRACT 1 INTRODUCTION
Traditional end-host network stacks are struggling to keep up with The slowdown of Moore’s Law, the end of Dennard’s scaling, and
rapidly increasing datacenter access link bandwidths due to their the rapid adoption of high-bandwidth links have brought tradi-
unsustainable CPU overheads. Motivated by this, our community is tional host network stacks at the brink of a breakdown—while
exploring a multitude of solutions for future network stacks: from datacenter access link bandwidths (and resulting computing needs
Linux kernel optimizations to partial hardware offload to clean-slate for packet processing) have increased by 4 − 10× over the past few
userspace stacks to specialized host network hardware. The design years, technology trends for essentially all other host resources
space explored by these solutions would benefit from a detailed (including core speeds and counts, cache sizes, NIC buffer sizes,
understanding of CPU inefficiencies in existing network stacks. etc.) have largely been stagnant. As a result, the problem of design-
This paper presents measurement and insights for Linux kernel ing CPU-efficient host network stacks has come to the forefront,
network stack performance for 100Gbps access link bandwidths. and our community is exploring a variety of solutions, including
Our study reveals that such high bandwidth links, coupled with Linux network stack optimizations [11, 12, 21, 24, 32, 41], hardware
relatively stagnant technology trends for other host resources (e.g., offloads [3, 6, 9, 16], RDMA [31, 34, 43], clean-slate userspace net-
core speeds and count, cache sizes, NIC buffer sizes, etc.), mark a work stacks [4, 27, 30, 33, 36], and even specialized host network
fundamental shift in host network stack bottlenecks. For instance, hardware [2]. The design space explored by these solutions would
we find that a single core is no longer able to process packets at line benefit from a detailed understanding of CPU inefficiencies of tradi-
rate, with data copy from kernel to application buffers at the receiver tional Linux network stack. Building such an understanding is hard
becoming the core performance bottleneck. In addition, increase in because the Linux network stack is not only large and complex, but
bandwidth-delay products have outpaced the increase in cache sizes, also comprises of many components that are tightly integrated into
resulting in inefficient DMA pipeline between the NIC and the CPU. an end-to-end packet processing pipeline.
Finally, we find that traditional loosely-coupled design of network Several recent papers present a preliminary analysis of Linux
stack and CPU schedulers in existing operating systems becomes a network stack overheads for short flows [21, 30, 32, 38, 40]. This
limiting factor in scaling network stack performance across cores. fails to provide a complete picture due to two reasons. First, for
Based on insights from our study, we discuss implications to design datacenter networks, it is well-known that an overwhelmingly large
of future operating systems, network protocols, and host hardware. fraction of data is contained in long flows [1, 5, 28]; thus, even if
there are many short flows, most of the CPU cycles may be spent in
CCS CONCEPTS processing packets from long flows. Second, datacenter workloads
• Networks → Transport protocols; Network performance contain not just short flows or long flows in exclusion, but a mixture
analysis; Data center networks; • Hardware → Networking of different flow sizes composed in a variety of traffic patterns; as
hardware; we will demonstrate, CPU characteristics change significantly with
varying traffic patterns and mixture of flow sizes.
KEYWORDS This paper presents measurement and insights for Linux kernel
Datacenter networks, Host network stacks, Network hardware network stack performance for 100Gbps access link bandwidths.
Our key findings are:
ACM Reference Format:
Qizhe Cai, Shubham Chaudhary, Midhul Vuppalapati, Jaehyun Hwang, High-bandwidth links result in performance bottlenecks
and Rachit Agarwal. 2021. Understanding Host Network Stack Overheads. shifting from protocol processing to data copy. Modern Linux
In ACM SIGCOMM 2021 Conference (SIGCOMM ’21), August 23–27, 2021, network stack can achieve ∼42Gbps throughput-per-core by ex-
Virtual Event, USA. ACM, New York, NY, USA, 13 pages. https://doi. ploiting all commonly available features in commodity NICs, e.g.,
org/10.1145/3452296.3472888 segmentation and receive offload, jumbo frames, and packet steer-
Permission to make digital or hard copies of all or part of this work for personal or ing. While this throughput is for the best-case scenario of a single
classroom use is granted without fee provided that copies are not made or distributed long flow, the dominant overhead is consistent across a variety of
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
scenarios—data copy from kernel buffers to application buffers (e.g.,
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, > 50% of total CPU cycles for a single long flow). This is in sharp
to post on servers or to redistribute to lists, requires prior specific permission and/or a contrast to previous studies on short flows and/or low-bandwidth
fee. Request permissions from [email protected].
SIGCOMM ’21, August 23–27, 2021, Virtual Event, USA
links, where protocol processing was shown to be the main bottle-
© 2021 Association for Computing Machinery. neck. We also observe receiver-side packet processing to become a
ACM ISBN 978-1-4503-8383-7/21/08. . . $15.00 bottleneck much earlier than the sender-side.
https://doi.org/10.1145/3452296.3472888

65
• Implications. Emerging zero-copy mechanisms from the Linux number of concurrently scheduled flows, potentially enabling
networking community [11, 12] may alleviate data copy over- high CPU efficiency for future network stacks.
heads, and may soon allow the Linux network stack to process as
much as 100Gbps worth of data using a single core. Integration The need to revisit host layering and packet processing
of other hardware offloads like I/OAT [37] that transparently pipelines. We observe as much as ∼43% reduction in throughput-
mitigate data copy overheads could also lead to performance per-core compared to the single flow case when applications gen-
improvements. Hardware offloads of transport protocols [3, 43] erating long flows share CPU cores with those generating short
and userspace network stacks [21, 27, 30] that do not provide flows. This is both due to increased scheduling overheads, and also
zero-copy interfaces may improve throughput in microbench- due to high CPU overheads for short flow processing. In addition,
marks, but will require additional mechanisms to achieve CPU short flows and long flows suffer from very different performance
efficiency when integrated into an end-to-end system. bottlenecks—the former have high packet processing overheads
while the latter have high data copy overheads; however, today’s
The reducing gap between bandwidth-delay product (BDP) network stacks use the same packet processing pipeline indepen-
and cache sizes leads to suboptimal throughput. Modern CPU dent of the type of the flow. Finally, we observe ∼20% additional
support for Direct Cache Access (DCA) (e.g., Intel DDIO [25]) allows drop in throughput-per-core when applications generating long
NICs to DMA packets directly into L3 cache, reducing data copy flows are running on CPU cores that are not in the same NUMA
overheads; given its promise, DDIO is enabled by default in most domain as the NIC (due to additional data copy overheads).
systems. While DDIO is expected to improve performance during • Implications. Design of CPU schedulers independent of the net-
data copy, rather surprisingly, we observe that it suffers from high work layer was beneficial for independent evolution of the two
cache miss rates (49%) even for a single flow, thus providing limited layers; however, with performance bottlenecks shifting to hosts,
performance gains. Our investigation revealed that the reason for we need to revisit such a separation. For instance, application-
this is quite subtle: host processing becoming a bottleneck results aware CPU scheduling (e.g., scheduling applications that generate
in increased host latencies; combined with increased access link long flows on NIC-local NUMA node, scheduling long-flow and
bandwidths, BDP values increase. This increase outpaces increase short-flow applications on separate CPU cores, etc.) are required
in L3 cache sizes—data is DMAed from the NIC to the cache, and to improve CPU efficiency. We should also rethink host packet
for larger BDP values, cache is rapidly overwritten before the ap- processing pipelines—unlike today’s designs that use the same
plication performs data copy of the cached data. As a result, we pipeline for short and long flows, achieving CPU efficiency re-
observe as much as 24% drop in throughput-per-core. quires application-aware packet processing pipelines.
• Implications. We need better orchestration of host resources Our study1 not only corroborates many exciting ongoing activities
among contending connections to minimize latency incurred in systems, networking and architecture communities on designing
at the host, and to minimize cache miss rates during data copy. In CPU-efficient host network stacks, but also highlights several inter-
addition, window size tuning should take into account not only esting avenues for research in designing future operating systems,
traditional metrics like latency and throughput, but also L3 sizes. network protocols and network hardware. We discuss them in §4.
Host resource sharing considered harmful. We observe as Before diving deeper, we outline several caveats of our study.
much as 66% difference in throughput-per-core across different traf- First, our study uses one particular host network stack (the Linux
fic patterns (single flow, one-to-one, incast, outcast, and all-to-all) kernel) running atop one particular host hardware. While we fo-
due to undesirable effects of multiple flows sharing host resources. cus on identifying trends and drawing general principles rather
For instance, multiple flows on the same NUMA node (thus, sharing than individual data points, other combinations of host network
the same L3 cache) make the cache performance even worse—the stacks and hardware may exhibit different performance characteris-
data DMAed by the NIC into the cache for one flow is polluted by tics. Second, our study focuses on CPU utilization and throughput;
the data DMAed by the NIC for other flows, before application for host network stack latency is another important metric, but re-
the first flow could perform data copy. Multiple flows sharing host quires exploring many additional bottlenecks in end-to-end system
resources also results in packets arriving at the NIC belonging to (e.g., network topology, switches, congestion, etc.); a study that
different flows; this, in turn, results in packet processing overheads establishes latency bottlenecks in host network stacks, and their
getting worse since existing optimizations (e.g., coalescing packets contribution to end-to-end latency remains an important and rel-
using generic receive offload) lose a chance to aggregate larger atively less explored space. Third, kernel network stacks evolve
number of packets. This increases per-byte processing overhead, rapidly; any study of our form must fix a version to ensure consis-
and eventually scheduling overheads. tency across results and observations; nevertheless, our preliminary
exploration [7] suggests that the most recent Linux kernel exhibits
• Implications. In the Internet and in early-generation datacenter
performance very similar to our results. Finally, our goal is not to
networks, performance bottlenecks were in the network core;
take a position on how future network stacks will evolve (in-kernel,
thus, multiple flows “sharing” host resources did not have per-
userspace, hardware), but rather to obtain a deeper understanding
formance implications. However, for high-bandwidth networks,
of a highly mature and widely deployed network stack.
such is no longer the case—if the goal is to design CPU-efficient
network stacks, one must carefully orchestrate host resources so 1 All
Linux instrumentation code and scripts along with all the documentation
as to minimize contention between active flows. Recent receiver- needed to reproduce our results are available at https : / / github . com /
driven transport protocols [18, 35] can be extended to reduce the Terabit-Ethernet/terabit-network-stack-profiling.

66
  
   
 "
 !&
    
   Component Description
       
 skb ! From user space to kernel
 '" Data copy
 &!'!
space, and vice versa.
! !
!$! All the packet processing at
        TCP/IP
      !%! TCP/IP layers.
  '   Netdevice and NIC driver op-
  Netdevice sub-
erations (e.g., NAPI polling,
system
!"! GSO/GRO, qdisc, etc.).
  
    ! ! skb manage- Functions to build, split, and
 !# " & !         !! ! ment release skb.
 skb !
&    Memory skb de-/allocation and page-
    ' !! de-/alloc related operations.
! "" Lock-related operations (e.g.,
    Lock/unlock
spin locks).
 &!'!     ""
 " Scheduling/context-
Scheduling
switching among threads.
All the remaining functions
  Others
(e.g., IRQ handling).

Figure 1: Sender and receiver-side data path in the Linux network stack. See §2.1 for description. Table 1: CPU usage taxonomy. The compo-
nents are mapped into layers as shown in Fig. 1.

2 PRELIMINARIES NIC can use to DMA received frames. Each descriptor is associated
The Linux network stack tightly integrates many components into with enough memory for one MTU-sized frame.
an end-to-end pipeline. We start this section by reviewing these Upon receiving a new frame, the NIC uses one of the Rx descrip-
components (§2.1). We also discuss commonly used optimizations, tors, and DMAs the frame to the kernel memory associated with the
and corresponding hardware offloads supported by commodity descriptor. Ordinarily, the NIC DMAs the frame to DRAM; however,
NICs. A more detailed description is presented in [7]. We then modern CPUs have support for Direct Cache Access (DCA) (e.g.,
summarize the methodology used in our study (§2.2). using Intel’s Data Direct I/O technology (DDIO) technology [25])
that allows NIC to DMA the frames directly to the L3 cache. DCA
2.1 End-to-End Data Path enables applications to avoid going to DRAM to access the data.
Asynchronously, the NIC generates an Interrupt ReQuests (IRQ)
The Linux network stack has slightly different data paths for the to inform the driver of new data to be processed. The CPU core that
sender-side (application to NIC) and the receiver-side (NIC to ap- processes the IRQ is selected by the NIC using one of the hardware
plication), as shown in Fig. 1. We describe them separately. steering mechanisms; see Table 2 for a summary, and [7] for details
Sender-side. When the sender-side application executes a write on how receiver-side flow steering techniques work. Upon receiving
system call, the kernel initializes socket buffers (skbs). For the data an IRQ, the driver triggers NAPI polling [17], that provides an
referenced by the skbs, the kernel then performs data copy from the alternative to purely interrupt-based network layer processing—the
userspace buffer to the kernel buffer. The skbs are then processed system busy polls on incoming frames until a certain number of
by the TCP/IP layer. When ready to be transmitted (e.g., conges- frames are received or a timer expires2 . This reduces the number of
tion control window/rate limits permitting), the data is processed IRQs, especially for high-speed networks where incoming data rate
by the network subsystem; here, among other processing steps, is high. While busy polling, the driver allocates an skb for each
skbs are segmented into Maximum Transmission Unit (MTU) sized frame, and makes a cross reference between the skb and the kernel
chunks by Generic Segmentation offload (GSO) and are enqueued memory where the frame has been DMAed. If the NIC has written
in the NIC driver Tx queue(s). Most commodity NICs also support enough data to consume all Rx descriptors, the driver allocates more
hardware offload of packet segmentation, referred to as TCP seg- DMA memory using the page-pool and creates new descriptors.
mentation offload (TSO); see more details in [7]. Finally, the driver The network subsystem then attempts to reduce the number of
processes the Tx queue(s), creating the necessary mappings for the skbs by merging them using Generic Receive Offload (GRO), or its
NIC to DMA the data from the kernel buffer referenced by skbs. corresponding hardware offload Large Receive Offload (LRO); see
Importantly, almost all sender-side processing in today’s Linux discussion in [7]. Next, TCP/IP processing is scheduled on one of the
network stack is performed at the same core as the application. CPU cores using the flow steering mechanism enabled in the system
(see Table 2). Importantly, with aRFS enabled, all the processing (the
Receiver-side. The NIC has a number of Rx queues and a per-Rx
queue page-pool from which DMA memory is allocated (backed by 2 These NAPI parameters can be tuned via net.core.netdev_budget and
the kernel pageset). The NIC also has a configurable number of net.core.netdev_budget_usecs kernel parameters, which are set to 300 and 2ms
Rx descriptors, each of which contains a memory address that the by default in our Linux distribution.

67
Mechanism Description 

  
      

Receive Packet Steering (RPS) Use the 4-tuple hash for core selection.  






Receive Flow Steering (RFS) Find the core that the application is running on.
Receive Side Steering (RSS) Hardware version of RPS supported by NICs.
accelerated RFS (aRFS) Hardware version of RFS supported by NICs.

   

Table 2: Receiver-side flow steering techniques.     

   






(a) Single (b) One-to-one (c) Incast (d) Outcast (e) All-to-all
IRQ handler, TCP/IP and application) is performed on the same CPU
core. Once scheduled, the TCP/IP layer processing is performed and Figure 2: Traffic patterns used in our study. (a) Single flow from one
all in-order skbs are appended to the socket’s receive queue. Finally, sender core to one receiver core. (b) One flow from each sender core to a
the application thread performs data copy of the payload in the unique receiver core. (c) One flow from each sender core, all to a single
receiver core. (d) One flow to each receiver core all from a single sender
skbs in the socket receive queue to the userspace buffer. Note that
core. (e) One flow between every pair of sender and receiver cores.
at both the sender-side and the receiver-side, data copy of packet
payloads is performed only once (when the data is transferred
between userspace and kernel space). All other operations within
3 LINUX NETWORK STACK OVERHEADS
the kernel are performed using metadata and pointer manipulations We now evaluate the Linux network stack overheads for a variety of
on skbs, and do not require data copy. scenarios, and present detailed insights on observed performance.

2.2 Measurement Methodology 3.1 Single Flow


In this subsection, we briefly describe our testbed setup, experimen- We start with the case of a single flow between the two servers, each
tal scenarios, and measurement methodology. running an application on a CPU core in the NIC-local NUMA node.
We find that, unlike the Internet and early incarnations of datacenter
Testbed setup. To ensure that bottlenecks are at the network networks where the throughput bottlenecks were primarily in the
stack, we setup a testbed with two servers directly connected core of the network (since a single CPU was sufficient to saturate
via a 100Gbps link (without any intervening switches). Both of the access link bandwidth), high-bandwidth networks introduce
our servers have a 4-socket NUMA-enabled Intel Xeon Gold 6128 new host bottlenecks even for the simple case of a single flow.
3.4GHz CPU with 6 cores per socket, 32KB/1MB/20MB L1/L2/L3 Before diving deeper, we make a note on our experimental con-
caches, 256GB RAM, and a 100Gbps Mellanox ConnectX-5 Ex NIC figuration for the single flow case. When aRFS is disabled, obtaining
connected to one of the sockets. Both servers run Ubuntu 16.04 stable and reproducible measurements is difficult since the default
with Linux kernel 5.4.43. Unless specified otherwise, we enable RSS mechanism uses hash of the 4-tuple to determine the core for
DDIO, and disable hyperthreading and IOMMU in our experiments. IRQ processing (§2.1). Since the 4-tuple can change across runs,
Experimental scenarios. We study network stack performance the core that performs IRQ processing could be: (1) the application
using five standard traffic patterns (Fig. 2)—single flow, one-to-one, core; (2) a core on the same NUMA node; or, (3) a core on a differ-
incast, outcast, and all-to-all—using workloads that comprise long ent NUMA node. The performance in each of these three cases is
flows, short flows, and even a mix of long and short flows. For different, resulting in non-determinism. To ensure deterministic
generating long flows, we use a standard network benchmarking measurements, when aRFS is disabled, we model the worst-case sce-
tool, iPerf [14], which transmits a flow from sender to receiver; nario (case 3): we explicitly map the IRQs to a core on a NUMA node
for generating short flows, we use netperf [22] that supports ping- different from the application core. For a more detailed analysis of
pong style RPC workloads. Both of these tools perform minimal other possible IRQ mapping scenarios, see [7].
application-level processing, which allows us to focus on perfor- A single core is no longer sufficient. For 10 − 40Gbps access
mance bottlenecks in the network stack (rather than those arising link bandwidths, a single thread was able to saturate the network
due to complex interactions between applications and the network bandwidth. However, such is no longer the case for high-bandwidth
stack); many of our results may have different characteristics if networks: as shown in Fig. 3(a), even with all optimization enabled,
applications were to perform additional processing. We also study Linux network stack achieves throughput-per-core of ∼42Gbps3 .
the impact of in-network congestion, impact of DDIO and impact Both Jumbo frames4 and TSO/GRO reduce the per-byte processing
of IOMMU. We use Linux’s default congestion control algorithm, overhead as they allow each skb to bring larger payloads (up to
TCP Cubic, but also study impact of different congestion control 9000B and 64KB respectively). Jumbo frames are useful even when
protocols. For each scenario, we describe the setup inline. GRO is enabled, because the number of skbs to merge is reduced
Performance metrics. We measure total throughput, total CPU with a larger MTU size, thus reducing the processing overhead for
utilization across all cores (using sysstat [19], which includes packet aggregation in software. aRFS, along with DCA, generally
kernel and application processing), and throughput-per-core—ratio
3 We observe a maximum throughput-per-core of upto 55Gbps, either by tuning NIC
of total throughput and total CPU utilization at the bottleneck
Rx descriptors and TCP Rx buffer size carefully (See Fig. 3(e)), or using LRO instead
(sender or receiver). To perform CPU profiling, we use the standard of GRO (See [7]). However, such parameter tuning is very sensitive to the hardware
sampling-based technique to obtain a per-function breakdown of setup, and so we leave them to their default values for all other experiments. Moreover,
CPU cycles [20]. We take the top functions that account for ∼95% the current implementation of LRO causes problems in some scenarios as it might
discard important header data, and so is often disabled in the real world [10]. Thus we
of the CPU utilization. By examining the kernel source code, we use GRO as the receive offload mechanism for the rest of our experiments.
4 Using larger MTU size (9000 bytes) as opposed to the normal (1500 bytes).
classify these functions into 8 categories as described in Table 1.

68
    
   !"
     #!$  !"   
  " %  
   
   
  

   

   




    
 
  

   

          

(a) Throughput-per-core (Gbps) (b) CPU utilization (%) (c) Sender CPU breakdown
   3000
        Avg. Latency

Latency from NAPI to App (us)


         Tail (99p) Latency
     2500
   
   2000
 
   1500
  
1000
  
  500
 
  
   0
         
 100 200 400 800 1600 3200 6400 12800
   TCP Rx buffer size(KB)

(d) Receiver CPU breakdown (e) Cache miss rate (%) (f) Latency from NAPI to start of data copy

Figure 3: Linux network stack performance for the case of a single flow. (a) Each column shows throughput-per-core achieved for different combinations
of optimizations. Within each column, optimizations are enabled incrementally, with each colored bar showing the incremental impact of enabling the
corresponding optimization. (b) Sender and Receiver total CPU utilization as all optimizations are enabled incrementally. Independent of the optimizations
enabled, receiver-side CPU is the bottleneck. (c, d) With all optimizations enabled, data copy is the dominant consumer of CPU cycles. (e) Increase in NIC ring
buffer size and increase in TCP Rx buffer size result in increased cache miss rates and reduced throughput. (f) Network stack processing latency from NAPI to
start of data copy increases rapidly beyond certain TCP Rx buffer sizes. See §3.1 for description.

improves throughput by enabling applications on the NIC-local These packet processing overheads are mitigated with several
NUMA node cores to perform data copy directly from L3 cache. optimizations: TSO allows using large-sized skb at the sender-
side, reducing both TCP/IP processing and Netdevice subsystem
Receiver-side CPU is the bottleneck. Fig. 3(b) shows the overall
overheads as segmentation is offloaded to the NIC (Fig. 3(c)). On
CPU utilization at sender and receiver sides. Independent of the
the receiver-side, GRO reduces the CPU usage by reducing the
optimizations enabled, receiver-side CPU is the bottleneck. There
number of skbs, passed to the upper layer, so TCP/IP processing
are two dominant overheads that create the gap between sender and
and lock/unlock overheads are reduced dramatically, at the cost of
receiver CPU utilization: (1) data copy and (2) skb allocation. First,
increasing the overhead of the network device subsystem where
when aRFS is disabled, frames are DMAed to remote NUMA mem-
GRO is performed (Fig. 3(d)). This GRO cost can be reduced by
ory at the receiver; thus, data copy is performed across different
66% by enabling Jumbo frames as explained above. These reduced
NUMA nodes, increasing per-byte data copy overhead. This is not
packet processing overheads lead to throughput improvement, and
an issue on the sender-side since the local L3 cache is warm with
the main overhead is now shifted to data copy, which takes almost
the application send buffer data. Enabling aRFS alleviates this issue
49% of total CPU utilization at the receiver-side when GRO and
reducing receiver-side CPU utilization by as much as 2× (right-most
Jumbo frames are enabled.
bar in Fig. 3(b)) compared to the case when no optimizations are
Once aRFS is enabled, co-location of the application context
enabled; however, CPU utilization at the receiver is still higher than
thread and the IRQ context thread at the receiver leads to improved
the sender. Second, when TSO is enabled, the sender is able to allo-
cache and NUMA locality. The effects of this are two-fold:
cate large-sized skbs. The receiver, however, allocates MTU-sized
skbs at device driver and then the skbs are merged at GRO layer. (1) Since the application thread runs on the same NUMA node as
Therefore, the receiver incurs higher overheads for skb allocation. the NIC, it can now perform data copy directly from the L3
cache (DMAed by the NIC via DCA). This reduces the per-byte
Where are the CPU cycles going? Figs. 3(c) and 3(d) show the
data copy overhead, resulting in higher throughput-per-core.
CPU usage breakdowns of sender- and receiver-side for each com-
(2) skbs are allocated in the softirq thread and freed in the appli-
bination of optimizations. With none of the optimizations, CPU
cation context thread (once data copy is done). Since the two
overheads mainly come from TCP/IP processing as per-skb pro-
are co-located, memory deallocation overhead reduces. This
cessing overhead is high (here, skb size is 1500B at both sides5 ).
is because page free operations to local NUMA memory are
When aRFS is disabled, lock overhead is high at the receiver-side
significantly cheaper than those for remote NUMA memory.
because of the socket contention due to the application context
thread (recv system call) and the interrupt context thread (softirq)
Even a single flow experiences high cache misses. Although
attempting to access the same socket instance.
aRFS allows applications to perform data copy from local L3 cache,
5 Linux kernel 4.17 onwards, GSO is enabled by default. We modified the kernel to
we observe as much as 49% cache miss rate in this experiment.
disable GSO in “no optimization” experiments to evaluate benefits of skb aggregation. This is surprising since, for a single flow, there is no contention

69
for L3 cache capacity. To investigate this further, we varied various 60
Throughput Per Core
120
Receiver: Cache Miss Rate

Throughput Per Core(Gbps)


parameters to understand their effect on cache miss rate. Among 50 100

Cache Miss Rate(%)


our experiments, varying the maximum TCP receive window size, 40 80

and the number of NIC Rx descriptors revealed an interesting trend. 30 60

Fig. 3(e) shows the variation of throughput and L3 cache miss rate 20 40

with varying number of NIC Rx descriptors and varying TCP Rx 10 20


buffer size6 . We observe that, with increase in either of the number 0 0
of NIC Rx descriptors or the TCP buffer size, the L3 cache miss NIC-local NUMA NIC-remote NUMA

increases and correspondingly, the throughput decreases. We have Figure 4: Linux network stack performance for the case of a single
flow on NIC-remote NUMA node. When compared to the NIC-local
found two reasons for this phenomenon: (1) BDP values being larger
NUMA node case, single flow throughput-per-core drops by ∼20%.
than the L3 cache capacity; and (2) suboptimal cache utilization.
To understand the first one, consider an extreme case of large
TCP Rx buffer sizes. In such a case, TCP will keep BDP worth of the optimal single-core throughput of ∼55Gpbs. An interesting
data in flight, where BDP is defined as the product of access link observation here is that the default auto-tuning mechanism used
bandwidth and latency (both network and host latency). It turns in the Linux kernel network stack today is unaware of DCA effects,
out that large TCP buffers can cause a significant increase in host and ends up overshooting beyond the optimal operating point.
latency, especially when the core processing packets becomes a DCA limited to NIC-local NUMA nodes. In our analysis so far,
bottleneck. In addition to scheduling delay of IRQ context and the application was run on a CPU core on the NIC-local NUMA
application threads, we observe that each packet observe large node. We now examine the impact of running the application on
queueing behind previous packets. We measure the delay between a NIC-remote NUMA node for the same single flow experiment.
frame reception and start of data copy by logging the timestamp Fig. 4 shows the resulting throughput-per-core and L3 cache miss
when NAPI processing for an skb happens, and the timestamp rate relative to the NIC-local case (with all optimizations enabled in
when the data copy of it starts, and measure the difference between both cases). When the application runs on NIC-remote NUMA node,
the two. Fig. 3(f) shows the average and 99th percentile delays we see a significant increase in L3 cache miss rate and ∼20% drop in
observed with varying TCP Rx buffer size. As can be seen, the delays throughput-per-core. Since aRFS is enabled, the NIC DMAs frames
rise rapidly with increasing TCP Rx buffer size beyond 1600KB. to the target CPU’s NUMA node memory. However, because the
Given that DCA cache size is limited7 , this increase in latency has target CPU core is on a NIC-remote NUMA node, DCA is unable to
significant impact: since TCP buffers and BDP values are large, NIC push the DMAed frame data into the corresponding L3 cache [25].
always has data to DMA; thus, since the data DMAed by the NIC As a result, cache misses increase and throughput-per-core drops.
is not promptly copied to userspace buffers, it is evicted from the
cache when NIC performs subsequent DMAs (if the NIC runs out of 3.2 Increasing Contention via One-to-one
Rx descriptors, the driver replenishes the NIC Rx descriptors during
We now evaluate the Linux network stack with higher contention
NAPI polling). As a result, cache misses increase and throughput
for the network bandwidth. Here, each sender core sends a flow to
reduces. When TCP buffer sizes are large enough, this problem
one unique receiver core, and we increase the number of core/flows
persists independent of NIC ring buffer sizes.
from 1 to 24. While each flow still has the entire host core for itself,
To understand the second reason, consider the other extreme
this scenario introduces two new challenges compared to the single-
where TCP buffer sizes are small but NIC ring buffer sizes are large.
flow case: (1) network bandwidth becomes saturated as multiple
We believe cache misses in this case might be due to an imperfect
cores are used; and (2) flows run on both NIC-local and NIC-remote
cache replacement policy and/or cache’s complex addressing, re-
NUMA nodes (our servers have 6 cores on each NUMA node).
sulting in suboptimal cache utilization; recent work has observed
Similar to §3.1, to obtain deterministic measurements when aRFS
similar phenomena, although in a different context [15, 39]. When
is disabled, we explicitly map IRQs for individual applications to a
there are a large number of NIC Rx descriptors, there is a corre-
unique core on a different NUMA node.
spondingly larger number of memory addresses available for the
NIC to DMA the data. Thus, even though the total amount of in- Host optimizations become less effective with increasing
flight data is smaller than the cache capacity, the likelihood of a number of flows. Fig. 5(a) shows that, as the number of flows
DCA write evicting some previously written data increases with increases, throughput-per-core decreases by 64% (i.e., 15Gbps at
the number of NIC Rx descriptors. This limits the effective utiliza- 24 flows), despite each core processing only a single flow. This is
tion of cache capacity, resulting in high cache miss rates and low because of reduced effectiveness of all optimizations. In particular,
throughput-per-core. when compared to the single flow case, the effectiveness of aRFS
Between these two extremes, both of the factors contribute to the reduces by as much as 75% for the 24-flow case; this is due to re-
observed performance in Fig. 3(e). Indeed, in our setup, DCA cache duced L3 cache locality for data copy for NIC-local NUMA node
capacity is ∼3MB and hence TCP buffer size of 3200KB and fewer cores (all cores share L3 cache), and also due to some of the flows
than 512 NIC Rx descriptors (512 × 9000 bytes ≈ 4MB) delivers running on NIC-remote NUMA nodes (that cannot exploit DCA, see
§3.1, Fig. 4). The effectiveness of GRO also reduces: since packets
6 The kernel uses an auto-tuning mechanism for the TCP Rx socket buffer size with the
at the receiver are now interleaved across flows, there are fewer
goal of maximizing throughput. In this experiment, we override the default auto-tuning
mechanism by specifying an Rx buffer size. opportunities for aggregation; this will become far more prominent
7 DCA can only use 18% (∼3 MB) of the L3 cache capacity in our setup. in the all-to-all case, and is discussed in more depth in §3.5.

70
50 100 0.6 0.6
No Opt. 1 ow 1 ow
Throughput Per Core(Gbps)

8 ows 8 ows

Fraction of CPU Cycles

Fraction of CPU Cycles


TSO/GRO 0.5 0.5

Total Throughput(Gbps)
40 Jumbo 80 16 ows 16 ows
aRFS 0.4 24 ows 0.4 24 ows
Total Thpt
30 60
0.3 0.3

20 40 0.2 0.2

0.1 0.1
10 20
0 .
0 .
op
y
sin
g
tem mt oc ck ling etc op
y
sin
g
tem mt oc ck ling etc
0 0 ac ces sys mg eall nlo edu ac ces sys mg eall nlo edu
dat pro ub skb c/d k/u sch dat pro ub skb c/d k/u sch
1 8 16 24 /ip es allo loc /ip es allo loc
vic ry vic ry
tcp de mo tcp de mo
# Flows net m e net m e

(a) Throughput-per-core (Gbps) (b) Sender CPU breakdown (c) Receiver CPU breakdown

Figure 5: Linux network stack performance for one-to-one traffic pattern. (a) Each column shows throughput-per-core achieved for different number
of flows. At 8 flows, the network is saturated, however, throughput-per-core decreases with more flows. (b, c) With all optimizations enabled, as the number
of flows increase, the fraction of CPU cycles spent in data copy decreases. On the receiver-side, network saturation leads to lower memory management
overhead (due to better page recycling) and higher scheduling overhead (due to frequent idling). The overall receiver-side CPU utilizations for x= 1, 8, 16 and
24 cases are, 1, 3.75, 5.21 and 6.58 cores, respectively. See §3.2 for description.

60 60 0.6 45 100
No Opt. 1 ow Throughput Per Core
Throughput Per Core(Gbps)

Throughput Per Core(Gbps)


8 ows
Fraction of CPU Cycles

50 TSO/GRO 50 0.5 Receiver: Cache Miss Rate 90


Total Throughput(Gbps)

Jumbo 16 ows

Cache Miss Rate(%)


aRFS 0.4 24 ows
40 Total Thpt 40 40 80
0.3
30 30 70
0.2
20 20 35 60
0.1
10 10 50
0 .
op
y
sin
g
tem mt oc ck ling etc
0 0 ac ces sys mg eall nlo edu 30 40
dat pro ub skb /d k/u sch
1 8 16 24 /ip es lloc loc 1 8 16 24
vic ya
tcp de or
# Flows net me
m
# Flows

(a) Throughput-per-core (Gbps) (b) Receiver CPU breakdown (c) L3 cache miss rate (%)

Figure 6: Linux network stack performance for incast traffic pattern. (a) Each column shows throughput-per-core for different number of flows
(receiver core is bottlenecked in all cases). Total throughput decreases with increase in the number of flows. (b) With all optimizations enabled, the fraction of
CPU cycles used by each component does not change significantly with number of flows. See [7] for sender-side CPU breakdown. (c) Receiver-side cache miss
rate increases with number of flows, resulting in higher per-byte data copy overhead, and reduced throughput-per-core. See §3.3 for description.

Processing overheads shift with network saturation. As as L3 cache and (2) CPU scheduling among application threads. We
shown in Fig. 5(a), at 8 flows, the network link becomes the bottle- discuss how these changes affect the network processing overheads.
neck, and throughput ends up getting fairly shared among all cores.
Per-byte data copy overhead increases with increasing flows
Fig. 5(c) shows that bottlenecks shift in this regime: scheduling
per-core. Fig. 6(a) shows that throughput-per-core decreases with
overhead increases and memory management overhead decreases.
increase in number of flows, observing as much as ∼19% drop with
Intuitively, when the network is saturated, the receiver cores start
8 flows when compare to the single-flow case. Fig. 6(b) shows that
to become idle at certain times—threads repeatedly go to sleep while
the CPU breakdown does not change significantly with increasing
waiting for data, and wake up when new data arrives; this results in
number of flows, implying that there is no evident shift in CPU
increased context switching and scheduling overheads. This effect
overheads. Fig. 6(c) provides some intuition for the root cause of
becomes increasingly prominent with increase in number of flows
the throughput-per-core degradation. As number of flows per core
(Fig. 5(b), Fig. 5(c)), as the CPU utilization per-core decreases.
increases at the receiver side, applications for different flows com-
To understand reduction in memory alloc/dealloc overheads, we
pete for the same L3 cache space resulting in increased cache miss
observe that the kernel page allocator maintains per-core pageset
rate (the miss rate increases from 48% to 78%, as the number of
that includes a certain number of free pages. Upon an allocation re-
flows goes from 1 to 8.). Among other things, this leads to increased
quest, pages can be fetched directly from the pageset, if available;
per-byte data copy overhead and reduced throughput-per-core. As
otherwise the global free-list needs to be accessed (which is a more
shown in Fig. 6(c), the increase in L3 cache miss rate with increasing
expensive operation). When multiple flows share the access link
flows correlates well with degradation in throughput-per-core.
bandwidth, each core serves relatively less amount of traffic com-
pared to the single flow case. This allows used pages to be recycled Sender-driven nature of TCP precludes receiver-side sched-
back to the pageset before it becomes empty, hence reducing the uling. Higher cache contention observed above is the result of
memory allocation overhead (Fig. 5(c)). multiple active flows on the same core. While senders could po-
tentially reduce such contention using careful flow scheduling, the
issue at the receiver side is fundamental: the sender-driven nature
3.3 Increasing Receiver Contention via Incast
of the TCP protocol precludes the receiver to control the number of
We now create additional contention at the receiver core using an active flows per core, resulting in unavoidable CPU inefficiency. We
incast traffic pattern, varying number of flows from 1 to 24 (each believe receiver-driven protocols [18, 35] can provide such control
using a unique core at the sender). Compared to previous scenarios, to the receiver, thus enabling CPU-efficient transport designs.
this scenario induces higher contention for (1) CPU resources such

71

  0.6  
1 ow  
  8 ows

Fraction of CPU Cycles


0.5      


  16 ows
  24 ows      
0.4  

 
 
  0.3  
 
0.2
 


 
0.1
   
0
  op
y
sin
g
tem mt oc ck ling etc
.  
ac ces sys mg eall nlo edu
    dat pro ub skb c/d k/u sch    
/ip es allo loc
vic ry
tcp de mo
   net m e 

(a) Throughput-per-core (Gbps) (b) Sender CPU breakdown (c) CPU utilization (%)

Figure 7: Linux network stack performance for outcast traffic pattern. (a) Each column shows throughput-per-sender-core achieved for different
number of flows, that is the maximum throughput sustainable using a single sender core (we ignore receiver core utilization here). Throughput-per-sender-core
increases from 1 to 8 flows, and then decreases as the number of flows increases. (b) With all optimizations enabled, as the number of flows increases from 1
to 8, data copy overhead increases but does not change much when the number of flows is increased further. Refer to [7] for receiver-side CPU breakdown. (c)
For 1 flow, sender-side CPU is underutilised. Sender-side cache miss rate increases slightly as the number of flows increases from 8 to 24, increasing the
per-byte data copy overhead, and there is a corresponsing decrease in throughput-per-core. See §3.4 for description.

3.4 Increasing Sender Contention via Outcast 3.5 Maximizing Contention with All-to-All
All our experiments so far result in receiver being the bottleneck. We now evaluate Linux network stack performance for all-to-all
To evaluate sender-side processing pipeline, we now use an outcast traffic patterns, where each of x sender cores transmit a flow to each
scenario where a single sender core transmits an increasing number of the x receiver cores, for x varying from 1 to 24. In this scenario,
of flows (1 to 24), each to a unique receiver core. To understand the we were unable to explicitly map IRQs to specific cores because,
efficiency of sender-side processing pipeline, this subsection focuses for the largest number of flows (576), the number of flow steering
on throughput-per-sender-core: that is, the maximum throughput entries requires is larger than what can be installed on our NIC.
achievable by a single sender core. Nevertheless, even without explicit mapping, we observed reason-
ably deterministic results for this scenario since the randomness
Sender-side processing pipeline can achieve up to 89Gbps
across a large number of flows averages out.
per core. Fig. 7(a) shows that, with increase in number of flows from
Fig. 8(a) shows that throughput-per-core reduces by ∼67% going
1 to 8, throughput-per-sender-core increases significantly enabling
from 1 × 1 to 24 × 24 flows, due to reduced effectiveness of all
total throughput as high as ∼89Gbps; in particular, throughput-per-
optimizations. The benefits of aRFS drop by ∼64%, almost the same
sender-core is 2.1× when compared to throughput-per-receiver-
as observed in the one-to-one scenario (§3.2). This is unsurprising,
core in the incast scenario (§3.3). This demonstrates that, in today’s
given the lack of cache locality for cores in non-NIC-local NUMA
Linux network stack, sender-side processing pipeline is much more
nodes, and given that cache miss rate is already abysmal (as dis-
CPU-efficient when compared to receiver-side processing pipeline.
cussed in §3.2). Increasing the number of flows per core on top of
We briefly discuss some insights below.
this does not make things worse in terms of cache miss rate.
The first insight is related to the efficiency of TSO. As shown
in Fig. 7(a), TSO in the outcast scenario contributes more to Per-flow batching opportunities reduce due to large number
throughput-per-core improvements, when compared to GRO in of flows. Similar to the one-to-one case, the network link becomes
the incast scenario (§3.3). This is due to two reasons. First, TSO is a the bottleneck in this scenario, resulting in fair-sharing of band-
hardware offload mechanism supported by the NIC; thus, unlike width among flows. Since there are a large number of flows (e.g.,
GRO which is software-based, there are no CPU overheads associ- 24 × 24 with 24 cores), each flow achieves very small throughput (or
ated with TSO processing. Second, unlike GRO, the effectiveness alternatively, the number of packets received for any flow in a given
of TSO does not degrade noticeably with increasing number of time window is very small). This results in reduced effectiveness of
flows since data from applications is always put into 64KB size optimizations like GRO (that operate on a per-flow basis) since they
skbs independent of the number of flows. Note that Jumbo frames do not have enough packets in each flow to aggregate. As a result,
do not help over TSO that much compared to the previous cases as upper layers receive a larger number of smaller skbs, increasing
segmentation is now performed in the NIC. packet processing overheads.
Second, aRFS continues to provide significant benefits, contribut- Fig. 8(c) shows the distribution of skb sizes (post-GRO) for vary-
ing as much as ∼46% of the total throughput-per-sender-core. This ing number of flows. We see that as the number of flows increase,
is because, as discussed earlier, L3 cache at the sender is always the average skb size reduces, leading to our argument above about
warm: while cache miss rate increases slightly with larger number the reduced effectiveness of GRO. We note that the above phenom-
of flows, the absolute number remains low (∼11% even with 24 enon is not unique to the all-to-all scenario: the number of flows
flows); furthermore, outcast scenario ensures that not too many sharing a bottleneck resource also increase in the incast and one-
flows compete for the same L3 cache at the receiver (due to receiver to-one scenarios. Indeed, this effect would also be present in those
cores distributed across multiple NUMA nodes). Fig. 7(b) shows scenarios, however the total number of flows in those cases is not
that data copy continues to be the dominant CPU consumer, even large enough to make these effects noticeable (max of 24 flows in
when sender is the bottleneck. incast and one-to-one versus 24 × 24 flows in all-to-all).

72
50 100 0.6 0.6
No Opt. 1x1 ow 4x4 ows
Throughput Per Core(Gbps)

8x8 ows 8x8 ows

Fraction of CPU Cycles


TSO/GRO 0.5 0.5

Total Throughput(Gbps)
40 Jumbo 80 16x16 ows 16x16 ows

Fraction of Samples
aRFS 0.4 24x24 ows 24x24 ows
Total Thpt 0.4
30 60
0.3
0.3
20 40 0.2
0.2
0.1
10 20 0.1
0 .
op
y
sin
g
tem mt oc ck ling etc
0 0 ac ces sys mg eall nlo edu 0
dat pro ub skb c/d k/u sch
1x1 8x8 16x16 24x24 /ip es allo loc 5 10 15 20 25 30 35 40 45 50 55 60 65
vic ry
tcp de mo
# Flows net m e SKB size(KB)

(a) Throughput-per-core (Gbps) (b) Receiver CPU breakdown (c) skb size distribution

Figure 8: Linux network stack performance for all-to-all traffic pattern. (a) Each column shows throughput-per-core achieved for different number of
flows. With 8 × 8 flows, the network is fully saturated. Throughput-per-core decreases as the number of flows increases. (b) With all optimizations enabled,
as the number of flows increase, the fraction of CPU cycles spent in data copy decreases. On the receiver-side, network saturation leads to lower memory
management overhead (due to better page recycling) and higher scheduling overhead (due to frequent idling and greater number of threads per core.). TCP/IP
processing overhead increases due to smaller skb sizes. The overall receiver-side CPU utilizations for x= 1 × 1, 8 × 8, 16 × 16 and 24 × 24 are 1, 4.07, 5.56 and
6.98 cores, respectively. See [7] for sender-side CPU breakdown. (c) The fraction of 64KB skbs after GRO decreases as the number of flows increases because
the larger number of flows prevent effective aggregation of received packets. See §3.5 for description.

    


           




               

      
         
  
   

 

 
  
 


  
 
 
 
  
            
 

(a) Throughput-per-core (Gbps) (b) CPU Utilization (c) Sender CPU breakdown (d) Receiver CPU breakdown

Figure 9: Linux network stack performance for the case of a single flow, with varying packet drop rates. (a) Each column shows throughput-per-
core achieved for a specific packet drop rate. Throughput-per-core decreases as the packet drop rate increases. (b) As the packet drop rate increases, the gap
between sender and receiver CPU utilisation decreases because the sender spends more cycles for retransmissions. (c, d) With all optimizations enabled, as
the packet drop rate increases, the overhead of TCP/IP processing and netdevice subsystem increases. See §3.6 for description.

3.6 Impact of In-network Congestion The minimal impact is due to increased ACK processing.
In-network congestion may lead to packet drops at switches, which Upon detailed CPU profiling, we found increased ACK process-
in turn impacts both the sender and receiver side packet processing. ing and packet retransmissions to be the main causes for increased
In this subsection, we study the impact of such packet drops on overheads. In particular:
CPU efficiency. To this end, we add a network switch between the • At the receiver, the fraction of CPU cycles spent in generating
two servers, and program the switch to drop packets randomly. We and sending ACKs increases by 4.87× (1.52% → 7.4%) as the
increase the loss rate from 0 to 0.015 in the single flow scenario loss rate goes from 0 to 0.015. This is because, when a packet is
from §3.1, and observe the effect on throughput and CPU utilization dropped, the receiver gets out-of-order TCP segments, and ends
at both sender and receiver. up sending duplicate ACKs to the sender. This contributes to an
Impact on throughput-per-core is minimal. As shown in increase in both TCP and netdevice subsystem overheads.
Fig. 9(a) the throughput-per-core decreases by ∼24% as the drop rate • At the sender, the fraction of CPU cycles spent in processing
is increased from 0 to 0.015. Fig. 9(b) shows that the receiver-side ACKs increases by 1.45× (5.79% → 8.41%) as the loss rate goes
CPU utilization decreases with increasing loss rate. As a result, the from 0 to 0.015. This is because the sender has to process ad-
total throughput becomes lower than throughput-per-core, and the ditional duplicate ACKs. Further, the fraction of CPU spent in
gap between the two increases. Interestingly, the throughput-per- packet retransmission operations increases by 1.34%. Both of
core slightly increases when the loss rate goes from 0 to 0.00015. these contribute to an increase in TCP and netdevice subsys-
We observe that the corresponding receiver-side cache miss rate tem overheads, while the former contributes to increased IRQ
is reduced from 48% to 37%. This is because packet loss essentially handling (which is classified under “etc.” in our taxonomy).
reduces TCP sending rate, thus resulting in better cache hit rates at
Sender observes higher impact of packet drops. Fig. 9(b)
the receiver-side.
shows the CPU utilization at the sender and the receiver. As drop
Figs. 9(c) and 9(d) show CPU profiling breakdowns for different
rates increase, the gap between sender and receiver utilization de-
loss rates. With increasing loss rate, at both sender and receiver,
creases, indicating that the increase in CPU overheads is higher
we see that the fraction of CPU cycles spent in TCP, netdevice
at the sender side. This is due to the fact that, upon a packet drop,
subsystem, and other (etc.) processing increases, hence leading to
the sender is responsible for doing the bulk of the heavy lifting in
fewer available cycles for data copy.
terms of congestion control and retransmission of the lost packet.

73
  0.6  
  4KB     

Fraction of CPU Cycles


 0.5 16KB     !!  
  32KB  
  64KB


0.4

  
   
0.3  

  0.2  
  0.1
 
0
  op
y
sin
g
tem mt oc ck ling etc
.
ac ces sys mg eall nlo edu
    dat pro ub skb c/d k/u sch  
/ip es allo loc
vic ry
tcp de mo       
net m e

(a) Throughput-per-core (Gbps) (b) Server CPU breakdown (c) NIC-remote NUMA effect (4KB)

Figure 10: Linux network stack performance for short flow, 16:1 incast traffic pattern, with varying RPC sizes. (a) Each column shows throughput-
per-core achieved for a specific RPC size. Throughput-per-core increases with increasing RPC size. For small RPCs, optimizations like GRO do not provide
much benefit due to fewer aggregation opportunities. (b) With all optimizations enabled, data copy quickly becomes the bottleneck. The server-side CPU was
completely utilized for all scenarios. See [7] for client-side CPU breakdown. (c) Unlike long flows, no significant throughput-per-core drop is observed even
when application runs on NIC-remote NUMA node core at the server. See §3.7 for description.

3.7 Impact of Flow Sizes  

We now study the impact of flow sizes on the Linux network stack 



performance. We start with the case of short flows: a ping-pong    


style RPC workload, with message sizes for both request/response  
being equal, and varying from 4KB to 64KB. Since a single short flow


 
is unable to bottleneck CPU at either the sender or the receiver,
 
we consider the incast scenario—16 applications on the sender    

send ping-pong RPCs to a single application on the receiver (the 

latter becoming the bottleneck). Following the common deployment (a) Throughput-per-core (Gbps)
scenario, each application uses a long-running TCP connection.
We also evaluate the impact of workloads that comprise of a mix 0.6
0 short ow
1 short ow
Fraction of CPU Cycles
0.5
of both long and short flows. For this scenario, we use a single core 4 short ows
0.4 16 short ows
at both the sender and the receiver. We run a single long flow, and
0.3
mix it with a variable number of short flows. We set the RPC size 0.2
of short flows to 4KB. 0.1

0
DCA does not help much when workloads comprise of ac
op
y
ces
sin
g
sys
tem mg
mt
eall
oc
nlo
ck
edu
ling etc
.
dat pro ub skb c/d k/u sch
es allo loc
extremely short flows. Fig. 10(a) shows that, as expected, tcp
/ip
ne tdevic
me
m ory

throughput-per-core increases with increase in flow sizes. We make


(b) Server CPU breakdown
several observations. First, as shown in Fig. 10(b), data copy is no
longer the prominent consumer of CPU cycles for extremely small Figure 11: Linux network stack performance for workloads that
flows (e.g., 4KB)—TCP/IP processing overhead is higher due to low mix long and short flows on a single core. (a) Each column shows
throughput-per-core achieved for different number of short flows colocated
GRO effectiveness (small flow sizes make it hard to batch skbs),
with a long flow. Throughput-per-core decreases with increasing number of
and scheduling overhead is higher due to ping-pong nature of the short flows. (b) Even with 16 flows colocated with a long flows, data copy
workload causing applications to repeatedly block while waiting overheads dominate, but TCP/IP processing and scheduling overheads start
for data. Second, data copy not being the dominant consumer of to consume significant CPU cycles. The server-side CPU was completely
CPU cycles for extremely short flows results in DCA not contribut- utilized for all scenarios.; refer to [7] for client-side CPU breakdown. See
ing to the overall performance as much as it did in the long-flow §3.7 for description.
case: as shown in Fig. 10(c), while NIC-local NUMA nodes achieve
significantly lower cache miss rates when compared to NIC-remote We note that all the observations above become relatively obso-
NUMA nodes, the difference in throughput-per-core is only mar- lete even with slight increase in flow sizes—with just 16KB RPCs,
ginal. Third, while DCA benefits reduce for extremely short flows, data copy becomes the dominant factor and with 64KB RPCs, the
other cache locality benefits of aRFS still apply: for example, skb CPU breakdown becomes very similar to the case of long flows.
accesses during packet processing benefit from cache hits. However, Mixing long and short flows considered harmful. Fig. 11(a)
these benefits are independent of the NUMA node on which the shows that, as expected, the overall throughput-per-core drops by
applications runs. The above three observations suggest interesting ∼43% as the number of short flows colocated with the long flow is
opportunities for orchestrating host resources between long and increased from 0 to 16. More importantly, while throughput-per-
short flows: while executing on NIC-local NUMA nodes helps long core for a single long flow and 16 short flows is ∼42Gbps (§3.1) and
flows significantly, short flows can be scheduled on NIC-remote ∼6.15Gbps in isolation (no mixing), it drops to ∼20Gbps and ∼2.6
NUMA nodes without any significant impact on performance; in Gbps, respectively when the two are mixed (48% and 42% reduction
addition, carefully scheduling the core across short flows sharing for long and short flows). This suggests that CPU-efficient network
the core can lead to further improvements in throughput-per-core. stacks should avoid mixing long and short flows on the same core.

74
   
 
 


       



   
   
 
 

 
 
 
 


 
 
   
   
    
    
 
 
         

(a) Throughput-per-core (Gbps) (b) Sender CPU breakdown (c) Receiver CPU breakdown

Figure 12: Impact of DCA and IOMMU on Linux network stack performance. (a) Each column shows throughput-per-core achieved for different
DCA and IOMMU configurations: Default has DCA enabled and IOMMU disabled. Either of disabling DCA or enabling IOMMU leads to decrease in
throughput-per-core. (b, c) Disabling DCA does not cause a significant shift in CPU breakdown. Enabling IOMMU causes a significant increase in memory
management overheads at both the sender and the recever. See §3.8 and §3.9 for description.

   
   



   


   
  


 
   
 
 

 

 
 
 

 
 


 
 
   

    


    
    


   
       
 
       
     
     
     
        
 

(a) Throughput-per-core (Gbps) (b) Sender-side CPU breakdown (c) Receiver CPU breakdown (IOMMU enabled)

Figure 13: Impact of congestion control protocols on Linux network stack performance. (a) Each column shows throughput-per-core achieved for
different congestion control protocols. There is no significant change in throughput-per-core across protocols. (b, c) BBR causes a higher scheduling overhead
on the sender-side. On the receiver-side, the CPU utilization breakdowns are largely similar. For all cases, receiver-side core is fully utilized for all protocols.
See §3.10 for description.

3.8 Impact of DCA degradation in network stack performance. As seen in Fig. 12(a),
All our experiments so far were run with DCA enabled (as is the case enabling IOMMU reduces throughput-per-core by 26% (compared
by default on Intel Xeon processors). To understand the benefits of to Default). Figs. 12(b) and 12(c) show the core reason for this
DCA, we now rerun the single flow scenario from §3.1, but with degradation: memory alloc/dealloc becoming more prominent in
DCA disabled. Fig. 12(a) shows the throughput-per-core without CPU consumption at both sender and receiver (now consuming
DCA relative to the scenario with DCA enabled (Default), as each 30% of CPU cycles at the receiver). This is because of two additional
of the optimizations are incrementally enabled. Unsurprisingly, per-page operations required by IOMMU: (1) when the NIC driver
with all optimizations enabled, we observe a 19% degradation in allocates new pages for DMA, it has to also insert these pages into
throughput-per-core when DCA is disabled. In particular, we see a the device’s pagetable (domain) on the IOMMU; (2) once DMA is
∼50% reduction in the effectiveness of aRFS; this is expected since done, the driver has to unmap those pages. These two additional
disabling DCA reduces the data copy benefits of NIC DMAing the per-page operations result in increased overheads.
data directly into the L3 cache. The other benefits of aRFS (§3.1)
still apply. Without DCA, the receiver-side remains the bottleneck, 3.10 Impact of Congestion control protocols
and we do not observe any significant shift in the CPU breakdowns Our experiments so far use TCP CUBIC, the default congestion
at sender and receiver (Figs. 12(b) and 12(c)). control algorithm in Linux. We now study the impact of congestion
control algorithms on network stack performance using two other
3.9 Impact of IOMMU popular algorithms implemented in Linux, BBR [8] and DCTCP [1],
IOMMU (IO Memory Management Unit) is used in virtualized en- again for the single flow scenario (§3.1). Fig. 13(a) shows that choice
vironments to efficiently virtualize fast IO devices. Even for non- of congestion control algorithm has minimal impact on throughput-
virtualized environments, they are useful for memory protection. per-core. This is because, as discussed earlier, receiver-side is the
With IOMMU, devices specify virtual addresses in DMA requests core throughput bottleneck in high-speed networks; all these al-
which the IOMMU subsequently translates into physical addresses gorithms being “sender-driven”, have minimal difference in the
while implementing memory protection checks. By default, the receiver-side logic. Indeed, the receiver-side CPU breakdowns are
IOMMU is disabled in our setup. In this subsection, we study the largely the same for all protocols (Fig. 13(c)). BBR has relatively
impact of IOMMU on Linux network stack performance for the higher scheduling overheads on the sender-side (Fig. 13(b)); this
single flow scenario (§3.1). is because BBR uses pacing for rate control (with qdisc) [42], and
The key take-away from this subsection is that IOMMU, due to repeated thread wakeups when packets are released by the pacer
increased memory management overheads, results in significant result in increased scheduling overhead.

75
4 FUTURE DIRECTIONS Rearchitecting the host stack. We discuss two directions in rel-
We have already discussed several immediate avenues of future atively clean-slate design for future network stacks. First, today’s
research in individual subsections—e.g., optimizations to today’s network stacks use a fairly static packet processing pipeline for
Linux network stack (e.g., independent scaling of each process- each connection—the entire pipeline (buffers, protocol processing,
ing layer in the stack, rethinking TCP auto-tuning mechanisms host resource provisioning, etc.) is determined at the time of socket
for receive buffer sizing, window/rate mechanisms incorporating creation, and remains unchanged during the socket lifetime, inde-
host bottlenecks, etc.), extensions to DCA (e.g., revisiting L3 cache pendent of other connections and their host resource requirements.
management, support for NIC-remote NUMA nodes, etc.) and, in This is one of the core reasons for the many bottlenecks identified
general, the idea of considering host bottlenecks when designing in our study: when the core performing data copy becomes the
network stacks for high-speed networks. In this section, we outline bottleneck for long flows, there is no way to dynamically scale
a few more forward-looking avenues of future research. the number of cores performing data copy; even if short flows and
long flows have different bottlenecks, the stack uses a completely
Zero-copy mechanisms. The Linux kernel has recently intro- application-agnostic processing pipeline; and, there is no way to
duced new mechanisms to achieve zero-copy transmission and dynamically allocate host resources to account for changes in con-
reception on top of the TCP/IP stack: tention upon new flow arrivals. As performance bottlenecks shift
• For zero-copy on the sender-side, the kernel now has to hosts, we should rearchitect the host network stack to achieve a
MSG_ZEROCOPY feature [11] (since kernel 4.14), which pins design that is both more dynamic (allows transparent and indepen-
application buffers upon a send system call, allowing the NIC to dent scaling of host resources to individual connections), and more
directly fetch this data through DMA reads. application-aware (exploits characteristics of applications colocated
• For zero-copy on the receiver-side, the kernel now supports a on a server to achieve improved host resource orchestration).
special mmap overload for TCP sockets [12] (since kernel 4.18). The second direction relates to co-designing CPU schedulers
This implementation enables applications to obtain a virtual with the underlying network stack. Specifically, CPU schedulers in
address that is mapped by the kernel to the physical address operating systems have traditionally been designed independent of
where the NIC DMAs the data. the network stack. This was beneficial for independent evolution
of the two layers. However, with increasingly many distributed
Some specialized applications [13, 26] have demonstrated achieving
applications and with performance bottlenecks shifting to hosts,
∼100Gbps of throughput-per-core using the sender-side zero-copy
we need to revisit such a separation. For instance, our study shows
mechanism. However, as we showed in §3, receiver is likely to be
that network-aware CPU scheduling (e.g., scheduling applications
the throughput bottleneck for many applications in today’s Linux
that generate long flows on NIC-local NUMA node, scheduling
network stack. Hence, it is more crucial to eliminate data copy over-
long-flow and short-flow applications on separate CPU cores, etc.)
heads on the receiver-side. Unfortunately, the above receiver-side
has the potential to lead to efficient host stacks.
zero-copy mechanism requires changes in the memory manage-
ment semantics, and thus requires non-trivial application-layer
modifications. Linux eXpress Data Path (XDP) [23] offers zero copy 5 CONCLUSION
operations for applications that use AF_XDP socket [29] (intro- We have demonstrated that recent adoption of high-bandwidth
duced in kernel 4.18), but requires reimplementation of the entire links in datacenter networks, coupled with relatively stagnant tech-
network and transport protocols in the userspace. It would be in- nology trends for other host resources (e.g., core speeds and count,
teresting to explore zero-copy mechanisms that do not require cache sizes, etc.), mark a fundamental shift in host network stack
application modifications and/or reimplementation of network pro- bottlenecks. Using measurements and insights for Linux network
tocols; if feasible, such mechanisms will allow today’s Linux net- stack performance for 100Gbps links, our study highlights several
work stack to achieve 100Gbps throughput-per-core with minimal avenues for future research in designing CPU-efficient host network
or no modifications. stacks. These are exciting times for networked systems research—
with emergence of Terabit Ethernet, the bottlenecks outlined in
CPU-efficient transport protocol design. The problem of trans-
this study are going to become even more prominent, and it is only
port design has traditionally focused on designing congestion and
by bringing together operating systems, computer networking and
flow control algorithms to achieve a multi-objective optimization
computer architecture communities that we will be able to design
goal (e.g., a combination of objectives like low latency, high through-
host network stacks that overcome these bottlenecks. We hope our
put, etc.). This state of affairs is because, for the Internet and for
work will enable a deeper understanding of today’s host network
early incarnations of datacenter networks, performance bottlenecks
stacks, and will guide the design of not just future Linux kernel
were primarily in the core of the network. Our study suggests that
network stack, but also future network and host hardware.
this is no longer the case: adoption of high-bandwidth links shifts
performance bottlenecks to the host. Thus, future protocol designs
should explicitly orchestrate host resources (just like they orches- ACKNOWLEDGMENTS
trate network resources today), e.g., by taking not just traditional We thank our shepherd, Neil Spring, SIGCOMM reviewers, Shrijeet
metrics like latency and throughput into account, but also available Mukherjee, Christos Kozyrakis and Amin Vahdat for their insightful
cores, cache sizes and DCA capabilities. Recent receiver-driven feedback. This work was supported by NSF grants CNS-1704742
protocols [18, 35] have the potential to enable such fine-grained and CNS-2047283, a Google faculty research scholar award and a
orchestration of both the sender and the receiver resources. Sloan fellowship. This work does not raise any ethical concerns.

76
REFERENCES [23] Toke Høiland-Jørgensen, Jesper Dangaard Brouer, Daniel Borkmann, John
[1] Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Fastabend, Tom Herbert, David Ahern, and David Miller. 2018. The eXpress
Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. 2010. Data Path: Fast Programmable Packet Processing in the Operating System Kernel.
Data Center TCP (DCTCP). In ACM SIGCOMM. In ACM CoNEXT.
[2] Amazon. 2021. Amazon EC2 F1 Instances. https://aws.amazon.com/ec2/ [24] Jaehyun Hwang, Qizhe Cai, Ao Tang, and Rachit Agarwal. 2020. TCP ≈ RDMA:
instance-types/f1/. (2021). CPU-efficient Remote Storage Access with i10. In USENIX NSDI.
[3] Mina Tahmasbi Arashloo, Alexey Lavrov, Manya Ghobadi, Jennifer Rexford, [25] Intel. 2012. Intel® Data Direct I/O Technology. https : / / www .
David Walker, and David Wentzlaff. 2020. Enabling Programmable Transport intel.com/content/dam/www/public/us/en/documents/technology-briefs/
Protocols in High-Speed NICs. In USENIX NSDI. data-direct-i-o-technology-brief.pdf. (2012).
[4] Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, [26] Intel. 2020. SPDK NVMe-oF TCP Performance Report. https://ci.spdk.io/
and Edouard Bugnion. 2014. IX: A Protected Dataplane Operating System for download/performance-reports/SPDK_tcp_perf_report_2010.pdf. (2020).
High Throughput and Low Latency. In USENIX OSDI. [27] EunYoung Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong, Sunghwan
[5] Theophilus Benson, Aditya Akella, and David A Maltz. 2010. Network traffic Ihm, Dongsu Han, and KyoungSoo Park. 2014. mTCP: a Highly Scalable User-level
characteristics of data centers in the wild. In IMC. TCP Stack for Multicore Systems. In USENIX NSDI.
[6] Zhan Bokai, Yu Chengye, and Chen Zhonghe. 2005. TCP/IP Offload Engine (TOE) [28] Srikanth Kandula, Sudipta Sengupta, Albert Greenberg, Parveen Patel, and Ronnie
for an SOC System. https://www.intel.com/content/dam/www/programmable/ Chaiken. 2009. The nature of data center traffic: measurements & analysis. In
us/en/pdfs/literature/dc/_3_3-2005_taiwan_3rd_chengkungu-web.pdf. (2005). IMC.
[29] Magnus Karlsson and Björn Töpel. 2018. The Path to DPDK Speeds for AF XDP.
[7] Qizhe Cai, Shubham Chaudhary, Midhul Vuppalapati, Jaehyun Hwang, and
In Linux Plumbers Conference.
Rachit Agarwal. 2021. Understanding Host Network Stack Overheads. https:
[30] Antoine Kaufmann, Tim Stamler, Simon Peter, Naveen Kr. Sharma, Arvind Krish-
//github.com/Terabit-Ethernet/terabit-network-stack-profiling. (2021).
namurthy, and Thomas Anderson. 2019. TAS: TCP Acceleration as an OS Service.
[8] Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh, and
In ACM Eurosys.
Van Jacobson. 2016. BBR: Congestion-Based Congestion Control. ACM Queue
[31] Yuliang Li, Rui Miao, Hongqiang Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng
14, September-October (2016), 20 – 53.
Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, and Minlan Yu. 2019. HPCC:
[9] Adrian M Caulfield, Eric S Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers,
High Precision Congestion Control. In ACM SIGCOMM.
Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim,
[32] Xiaofeng Lin, Yu Chen, Xiaodong Li, Junjie Mao, Jiaquan He, Wei Xu, and
et al. 2016. A cloud-scale acceleration architecture. In IEEE/ACM MICRO.
Yuanchun Shi. 2016. Scalable Kernel TCP Design and Implementation for Short-
[10] Jonathan Corbet. 2009. JLS2009: Generic receive offload. https://lwn.net/Articles/
Lived Connections. In ACM ASPLOS.
358910/. (2009).
[33] Ilias Marinos, Robert NM Watson, and Mark Handley. 2014. Network stack
[11] Jonathan Corbet. 2017. Zero-copy networking. https://lwn.net/Articles/726917/.
specialization for performance. ACM SIGCOMM Computer Communication Review
(2017).
44, 4 (2014), 175–186.
[12] Jonathan Corbet. 2018. Zero-copy TCP receive. https://lwn.net/Articles/752188/.
[34] Radhika Mittal, Alexander Shpiner, Aurojit Panda, Eitan Zahavi, Arvind Krishna-
(2018).
murthy, Sylvia Ratnasamy, and Scott Shenker. 2018. Revisiting Network Support
[13] Patrick Dehkord. 2019. NVMe over TCP Storage with SPDK. https://ci.spdk.
for RDMA. In ACM SIGCOMM.
io/download/events/2019-summit/(Solareflare)+NVMe+over+TCP+Storage+
[35] Behnam Montazeri, Yilong Li, Mohammad Alizadeh, and John Ousterhout. 2018.
with+SPDK.pdf. (2019).
Homa: A Receiver-driven Low-latency Transport Protocol Using Network Priori-
[14] Jon Dugan, John Estabrook, Jim Ferbuson, Andrew Gallatin, Mark Gates, Kevin
ties. In ACM SIGCOMM.
Gibbs, Stephen Hemminger, Nathan Jones, Gerrit Renker Feng Qin, Ajay Tiru-
[36] George Prekas, Marios Kogias, and Edouard Bugnion. 2017. ZygOS: Achieving
mala, and Alex Warshavsky. 2021. iPerf - The ultimate speed test tool for TCP,
Low Tail Latency for Microsecond-scale Networked Tasks. In ACM SOSP.
UDP and SCTP. https://iperf.fr/. (2021).
[37] Quoc-Thai V Le, Jonathan Stern, and Stephen M Brenner. 2017. Fast memcpy with
[15] Alireza Farshin, Amir Roozbeh, Gerald Q. Maguire Jr., and Dejan Kostić. 2020.
SPDK and Intel I/OAT DMA Engine. https://software.intel.com/content/www/
Reexamining Direct Cache Access to Optimize I/O Intensive Applications for
us/en/develop/articles/fast-memcpy-using-spdk-and-ioat-dma-engine.html.
Multi-hundred-gigabit Networks. In USENIX ATC.
(2017).
[16] Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza
[38] Livio Soares and Michael Stumm. 2010. FlexSC: Flexible System Call Scheduling
Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric
with Exception-Less System Calls. In USENIX OSDI.
Chung, et al. 2018. Azure accelerated networking: SmartNICs in the public cloud.
[39] Amin Tootoonchian, Aurojit Panda, Chang Lan, Melvin Walls, Katerina Argyraki,
In USENIX NSDI.
Sylvia Ratnasamy, and Scott Shenker. 2018. ResQ: Enabling SLOs in Network
[17] The Linux Foundation. 2016. Linux Foundation DocuWiki: napi. https://wiki.
Function Virtualization. In USENIX NSDI.
linuxfoundation.org/networking/napi. (2016).
[40] Vijay Vasudevan, David G. Andersen, and Michael Kaminsky. 2011. The Case for
[18] Peter X Gao, Akshay Narayan, Gautam Kumar, Rachit Agarwal, Sylvia Ratnasamy,
VOS: The Vector Operating System. In USENIX HotOS.
and Scott Shenker. 2015. phost: Distributed near-optimal datacenter transport
[41] Kenichi Yasukata, Michio Honda, Douglas Santry, and Lars Eggert. 2016.
over commodity network fabric. In ACM CoNEXT.
StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs.
[19] Sebastien Godard. 2021. Performance monitoring tools for Linux. https://github.
In USENIX ATC.
com/sysstat/sysstat. (2021).
[42] Neal Cardwell Yuchung Cheng. [n. d.]. Making Linux TCP Fast. "https:
[20] Brendan Gregg. 2020. Linux perf Examples. http://www.brendangregg.com/perf.
//netdevconf.info/1.2/papers/bbr-netdev-1.2.new.new.pdf". ([n. d.]).
html. (2020).
[43] Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn,
[21] Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. 2012.
Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and
MegaPipe: A New Programming Interface for Scalable Network I/O. In USENIX
Ming Zhang. 2015. Congestion control for large-scale RDMA deployments. ACM
OSDI.
SIGCOMM Computer Communication Review 45, 4 (2015), 523–536.
[22] HewlettPackard. 2021. Netperf. https://github.com/HewlettPackard/netperf.
(2021).

77

You might also like