Monitoring and Tuning The Linux Networking Stack - Receiving Data PDF
Monitoring and Tuning The Linux Networking Stack - Receiving Data PDF
packagecloud:blog
Subscribe to our blog via email
Sign up!
back to posts
linux
TL;DR
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 1/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
This blog post explains how computers running the Linux kernel receive
packets, as well as how to monitor and tune each component of the
networking stack as packets ow from the network toward userland
programs.
UPDATE We’ve released the counterpart to this post: Monitoring and Tuning
the Linux Networking Stack: Sending Data.
UPDATE Take a look at the Illustrated Guide to Monitoring and Tuning the
Linux Networking Stack: Receiving Data, which adds some diagrams for the
information presented below.
TL;DR
Special thanks
General advice on monitoring and tuning the Linux networking stack
Overview
Detailed Look
Network Device Driver
We use cookies to enhance the user experience on packagecloud.
Initialization
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 2/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
PCI initialization
PCI probe
A peek into PCI initialization
More Linux PCI driver information
Network device initialization
struct net_device_ops
ethtool registration
IRQs
NAPI
NAPI initialization in the igb driver
Bringing a network device up
Preparing to receive data from the network
Enable NAPI
Register an interrupt handler
Enable Interrupts
The network device is now up
Monitoring network devices
Using ethtool -S
Using sysfs
Using /proc/net/dev
Tuning network devices
Check the number of RX queues being used
Adjusting the number of RX queues
Adjusting the size of the RX queues
We use cookies to enhance the user experience on packagecloud.
Adjusting the processing weight of RX queues
By using our site, you acknowledge that you have read and understand our
Adjusting
Cookie Policy, the and
Privacy Policy, rx our
hashTermselds for network ows
of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 3/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
dev_gro_receive
napi_skb_finish
Flow limits
Monitoring: Monitor drops due to full input_pkt_queue or ow
limit
Tuning
Tuning: Adjusting netdev_max_backlog to prevent drops
Tuning: Adjust the NAPI weight of the backlog poll loop
Tuning: Enabling ow limits and tuning ow limit hash table size
backlog queue NAPI poller
process_backlog
__netif_receive_skb_core delivers data to packet taps and protocol
layers
Packet tap delivery
Protocol layer delivery
Protocol layer registration
IP protocol layer
ip_rcv
net lter and iptables
ip_rcv_finish
udp_queue_rcv_skb
sk_rcvqueues_full
Special thanks
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 7/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
The information presented here builds upon the work done for Private
Internet Access, which was originally published as a 5 part series starting
here.
UPDATE We’ve released the counterpart to this post: Monitoring and Tuning
the Linux Networking Stack: Sending Data.
UPDATE Take a look at the Illustrated Guide to Monitoring and Tuning the
Linux Networking Stack: Receiving Data, which adds some diagrams for the
information presented below.
The networking stack is complex and there is no one size ts all solution. If
the performance and health of your networking is critical to you or your
business, you will have no choice but to invest a considerable amount of
time, effort, and money into understanding how the various parts of the
system interact.
Ideally, you should consider measuring packet drops at each layer of the
We use cookies to enhance the user experience on packagecloud.
network stack. That way you can determine and narrow down which
By using our site, you acknowledge that you have read and understand our
component
Cookie Policy, needs to be
Privacy Policy, andtuned.
our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 8/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
This is where, I think, many operators go off track: the assumption is made
that a set of sysctl settings or /proc values can simply be reused
wholesale. In some cases, perhaps, but it turns out that the entire system is
so nuanced and intertwined that if you desire to have meaningful
monitoring or tuning, you must strive to understand how the system
functions at a deep level. Otherwise, you can simply use the default
settings, which should be good enough until further optimization (and the
required investment to deduce those settings) is necessary.
Many of the example settings provided in this blog post are used solely for
illustrative purposes and are not a recommendation for or against a certain
con guration or default setting. Before adjusting any setting, you should
develop a frame of reference around what you need to be monitoring to
notice a meaningful change.
Overview
For reference, you may want to have a copy of the device data sheet handy.
This post will examine the Intel I350 Ethernet controller, controlled by the
igb device driver. You can nd that data sheet (warning: LARGE PDF) here
We use cookies to enhance the user experience on packagecloud.
forByyour reference.
using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 9/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
The high level path a packet takes from arrival to socket receive buffer is as
follows:
The protocol layers examined below are the IP and UDP protocol layers.
Much of the information presented will serve as a reference for other
protocol layers, as well.
Detailed Look
UPDATE We’ve released the counterpart to this post: Monitoring and Tuning
the Linux Networking Stack: Sending Data.
UPDATE Take a look at the Illustrated Guide to Monitoring and Tuning the
Linux Networking Stack: Receiving Data, which adds some diagrams for the
information presented below.
This blog post will be examining the Linux kernel version 3.13.0 with links
to code on GitHub and code snippets throughout this post.
Understanding exactly how packets are received in the Linux kernel is very
involved. We’ll need to closely examine and understand how a network
driver works, so that parts of the network stack later are more clear.
This blog post will look at the igb network driver. This driver is used for a
relatively common server NIC, the Intel Ethernet Controller I350. So, let’s
start by understanding how the igb network driver works.
/**
* igb_init_module - Driver Registration Routine
*
* igb_init_module is the first routine called when the driver is
* loaded. All it does is register with the PCI subsystem.
**/
static int __init igb_init_module(void)
{
int ret;
pr_info("%s - version %s\n", igb_driver_string, igb_driver_version);
pr_info("%s\n", igb_copyright);
/* ... */
ret = pci_register_driver(&igb_driver);
return ret;
}
module_init(igb_init_module);
The bulk of the work to initialize the device happens with the call to
pci_register_driver as we’ll see next.
PCI initialization
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
The Intel I350 network card is a PCI express device.
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 12/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
The kernel uses this table to determine which device driver to load to
control the device.
That’s how the OS can gure out which devices are connected to the system
and which driver should be used to talk to the device.
This table and the PCI device IDs for the igb driver can be found in
drivers/net/ethernet/intel/igb/igb_main.c and
drivers/net/ethernet/intel/igb/e1000_hw.h , respectively:
static DEFINE_PCI_DEVICE_TABLE(igb_pci_tbl) = {
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_BACKPLANE_1GBPS) },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_SGMII) },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_BACKPLANE_2_5GBPS) },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I211_COPPER), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_COPPER), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_FIBER), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SERDES), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SGMII), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_COPPER_FLASHLESS), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SERDES_FLASHLESS), board_82575 },
/* ... */
};
MODULE_DEVICE_TABLE(pci, igb_pci_tbl);
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 13/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
From drivers/net/ethernet/intel/igb/igb_main.c :
/* ... */
};
PCI probe
Once a device has been identi ed by its PCI IDs, the kernel can then select
the proper driver to use to control the device. Each PCI driver registers a
probe function with the PCI system in the kernel. The kernel calls this
function for devices which have not yet been claimed by a device driver.
Once a device is claimed, other drivers will not be asked about the device.
Most drivers have a lot of code that runs to get the device ready for use. The
exact things done vary from driver to driver.
Let’s take a quick look at some of these operations in the igb driver in the
function igb_probe .
The following code from the igb_probe function does some basic PCI
con guration. From drivers/net/ethernet/intel/igb/igb_main.c:
err = pci_enable_device_mem(pdev);
/* ... */
/* ... */
pci_enable_pcie_error_reporting(pdev);
pci_set_master(pdev);
pci_save_state(pdev);
Next, the DMA mask will be set. This device can read and write to 64bit
memory addresses, so dma_set_mask_and_coherent is called with
DMA_BIT_MASK(64) .
Phew.
Going into the full explanation of how PCI devices work is beyond the scope
of this post, but this excellent talk, this wiki, and this text le from the linux
kernel are excellent resources.
struct net_device_ops
netdev->netdev_ops = &igb_netdev_ops;
And the functions that this net_device_ops structure holds pointers to are
set in the same le. From drivers/net/ethernet/intel/igb/igb_main.c:
.ndo_change_mtu = igb_change_mtu,
.ndo_do_ioctl = igb_ioctl,
/* ... */
As you can see, there are several interesting elds in this struct like
ndo_open , ndo_stop , ndo_start_xmit , and ndo_get_stats64 which hold the
addresses of functions implemented by the igb driver.
ethtool registration
ethtool is a command line program you can use to get and set various
driver and hardware options. You can install it on Ubuntu by running apt-
get install ethtool .
The ethtool program talks to device drivers by using the ioctl system
call. The device drivers register a series of functions that run for the
ethtool operations and the kernel provides the glue.
When an ioctl call is made from ethtool , the kernel nds the ethtool
structure registered by the appropriate driver and executes the functions
We use cookies to enhance the user experience on packagecloud.
registered. The driver’s ethtool function implementation can do anything
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 18/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
from change a simple software ag in the driver to adjusting how the actual
NIC hardware works by writing register values to the device.
igb_set_ethtool_ops(netdev);
From drivers/net/ethernet/intel/igb/igb_ethtool.c :
Above that, you can nd the igb_ethtool_ops structure with the ethtool
functions the igb driver supports set to the appropriate elds.
From drivers/net/ethernet/intel/igb/igb_ethtool.c :
.get_regs = igb_get_regs,
/* ... */
The monitoring section below will show how to use ethtool to access
these detailed statistics.
IRQs
When a data frame is written to RAM via DMA, how does the NIC tell the
rest of the system that data is ready to be processed?
The New Api (NAPI) was created as a mechanism for reducing the number of
IRQs generated by network devices on packet arrival. While NAPI reduces
the number of IRQs, it cannot eliminate them completely.
We use cookies to enhance the user experience on packagecloud.
We’ll seeourwhy
By using that
site, you is, exactly,
acknowledge in have
that you laterread
sections.
and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 20/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
NAPI
NAPI differs from the legacy method of harvesting data in several important
ways. NAPI allows a device driver to register a poll function that the NAPI
subsystem will call to harvest data frames.
The device driver implements a poll function and registers it with NAPI by
calling netif_napi_add . When registering a NAPI poll function with
netif_napi_add , the driver will also specify the weight . Most of the drivers
hardcode a value
We use cookies of 64
to enhance the .user
This valueonand
experience its meaning will be described in
packagecloud.
By using our site, you acknowledge that you have read and understand our
more detail below.
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 21/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
Let’s take a look at igb_alloc_q_vector to see how the poll callback and
its private data are registered.
We use cookies to enhance the user experience on packagecloud.
From drivers/net/ethernet/intel/igb/igb_main.c:
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 22/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
/* initialize NAPI */
netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64);
/* ... */
The above code is allocation memory for a receive queue and registering
the function igb_poll with the NAPI subsystem. It provides a reference to
the struct napi_struct associated with this newly created RX queue
( &q_vector->napi above). This will be passed into igb_poll when called by
the NAPI subsystem when it comes time to harvest packets from this RX
queue.
This will be important later when we examine the ow of data from drivers
up the network stack.
In the case of the igb driver, the function attached to the ndo_open eld of
the net_device_ops structure is called igb_open .
Most NICs you’ll nd today will use DMA to write data directly into RAM
where the OS can retrieve the data for processing. The data structure most
NICs use for this purpose resembles a queue built on circular buffer (or a
ring buffer).
In order to do this, the device driver must work with the OS to reserve a
region of memory that the NIC hardware can use. Once this region is
reserved, the hardware is informed of its location and incoming data will be
written to RAM where it will later be picked up and processed by the
networking subsystem.
This seems simple enough, but what if the packet rate was high enough
that a single CPU was not able to properly process all incoming packets?
The data structure is built on a xed length region of memory, so incoming
We use cookies to enhance the user experience on packagecloud.
packets
By using would beacknowledge
our site, you dropped.that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 24/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
Some devices have the ability to write incoming packets to several different
regions of RAM simultaneously; each region is a separate queue. This allows
the OS to use multiple CPUs to process incoming data in parallel, starting at
the hardware level. This feature is not supported by all NICs.
The Intel I350 NIC does support multiple queues. We can see evidence of
this in the igb driver. One of the rst things the igb driver does when it is
brought up is call a function named igb_setup_all_rx_resources . This
function calls another function, igb_setup_rx_resources , once for each RX
queue to arrange for DMA-able memory where the device will write
incoming data.
If you are curious how exactly this works, please see the Linux kernel’s DMA
API HOWTO.
It turns out the number and size of the RX queues can be tuned by using
ethtool . Tuning these values can have a noticeable impact on the number
of frames which are processed vs the number of frames which are dropped.
The NIC uses a hash function on the packet header elds (like source,
destination, port, etc) to determine which RX queue the data should be
directed to.
Some NICs let you adjust the weight of the RX queues, so you can send
more traf c to speci c queues.
Fewer NICs let you adjust this hash function itself. If you can adjust the
hash function, you can send certain ows to speci c RX queues for
processing ortoeven
We use cookies drop
enhance theexperience
the user packets on at the hardware level, if desired.
packagecloud.
By using our site, you acknowledge that you have read and understand our
We’ll take
Cookie a Privacy
Policy, look at how
Policy, andto
ourtune
Terms these settings shortly.
of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 25/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
Enable NAPI
When a network device is brought up, a driver will usually enable NAPI.
We saw earlier how drivers register poll functions with NAPI, but NAPI is
not usually enabled until the device is brought up.
In the case of the igb driver, NAPI is enabled for each q_vector that was
initialized when the driver was loaded or when the queue count or size are
changed with ethtool .
From drivers/net/ethernet/intel/igb/igb_main.c:
After enabling NAPI, the next step is to register an interrupt handler. There
are different methods a device can use to signal an interrupt: MSI-X, MSI,
and legacy interrupts. As such, the code differs from device to device
depending on what the supported interrupt methods are for a particular
piece ofcookies
We use hardware.
to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 26/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
The driver must determine which method is supported by the device and
register the appropriate handler function that will execute when the
interrupt is received.
Some drivers, like the igb driver, will try to register an interrupt handler
with each method, falling back to the next untested method on failure.
MSI-X interrupts are the preferred method, especially for NICs that support
multiple RX queues. This is because each RX queue can have its own
hardware interrupt assigned, which can then be handled by a speci c CPU
(with irqbalance or by modifying /proc/irq/IRQ_NUMBER/smp_affinity ). As
we’ll see shortly, the CPU that handles the interrupt will be the CPU that
processes the packet. In this way, arriving packets can be processed by
separate CPUs from the hardware interrupt level up through the networking
stack.
You can nd the code in the driver which attempts each interrupt method in
drivers/net/ethernet/intel/igb/igb_main.c:
err = igb_request_msix(adapter);
if (!err)
goto request_done;
/* fall back to MSI */
/* ... */
}
/* ... */
/* ... */
}
if (err)
dev_err(&pdev->dev, "Error %d getting interrupt\n", err);
request_done:
return err;
}
As you can see in the abbreviated code above, the driver rst attempts to
set an MSI-X interrupt handler with igb_request_msix , falling back to MSI
on failure. Next, request_irq is used to register igb_intr_msi , the MSI
We use cookies to enhance the user experience on packagecloud.
interrupt handler. If this fails, the driver falls back to legacy interrupts.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 28/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
And this is how the igb driver registers a function that will be executed
when the NIC raises an interrupt signaling that data has arrived and is ready
for processing.
Enable Interrupts
At this point, almost everything is setup. The only thing left is to enable
interrupts from the NIC and wait for data to arrive. Enabling interrupts is
hardware speci c, but the igb driver does this in __igb_open by calling a
helper function named igb_irq_enable .
/* ... */
/* ... */
}
Drivers may do a few more things like start timers, work queues, or other
hardware-speci c setup. Once that is completed. the network device is up
and ready for use.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 29/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
Let’s take a look at monitoring and tuning settings for network device
drivers.
There are several different ways to monitor your network devices offering
different levels of granularity and complexity. Let’s start with most granular
and move to least granular.
Using ethtool -S
Monitor detailed NIC device statistics (e.g., packet drops) with `ethtool -S`.
Monitoring this data can be dif cult. It is easy to obtain, but there is no
We use cookies to enhance the user experience on packagecloud.
standardization
By using our site, youof the eldthatvalues.
acknowledge you haveDifferent drivers,
read and understand our or even different
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 30/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
versions of the same driver might produce different eld names that have
the same meaning.
You should look for values with “drop”, “buffer”, “miss”, etc in the label. Next,
you will have to read your driver source. You’ll be able to determine which
values are accounted for totally in software (e.g., incremented when there is
no memory) and which values come directly from hardware via a register
read. In the case of a register value, you should consult the data sheet for
your hardware to determine what the meaning of the counter really is;
many of the labels given via ethtool can be misleading.
Using sysfs
sysfs also provides a lot of statistics values, but they are slightly higher
level than the direct NIC level stats provided.
You can nd the number of dropped incoming network data frames for, e.g.
eth0 by using cat on a le.
$ cat /sys/class/net/eth0/statistics/rx_dropped
2
The counter values will be split into les like collisions , rx_dropped ,
rx_errors , rx_missed_errors , etc.
If these values are critical to you, you will need to read your driver source to
understand exactly what your driver thinks each of these values means.
Using /proc/net/dev
$ cat /proc/net/dev
Inter-| Receive |
face |bytes packets errs drop fifo frame compressed multicast|by
eth0: 110346752214 597737500 0 2 0 0 0 2096
lo: 428349463836 1579868535 0 0 0 0 0
This le shows a subset of the values you’ll nd in the sysfs les mentioned
above, but it may serve as a useful general reference.
The caveat mentioned above applies here, as well: if these values are
important to you, you will still need to read your driver source to
understand exactly when, where, and why they are incremented to ensure
your understanding of an error, drop, or fo are the same as your driver.
If your NIC and the device driver loaded on your system support RSS /
We use cookies to enhance the user experience on packagecloud.
multiqueue, you
By using our site, you can usually
acknowledge thatadjust
you have the number
read and of our
understand RX queues (also calledRX
channels), by using ethtool .
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 32/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
This output is displaying the pre-set maximums (enforced by the driver and
the hardware) and the current settings.
Note: not all device drivers will have support for this operation.
This means that your driver has not implemented the ethtool get_channels
operation. This could be because the NIC doesn’t support adjusting the
number of queues, doesn’t support RSS / multiqueue, or your driver has not
been updated to handle this feature.
Once you’ve found the current and maximum queue count, you can adjust
the values by using sudo ethtool -L .
Note: some devices and their drivers only support combined queues that
are paired for transmit and receive, as in the example in the above
section.
If your device and driver support individual settings for RX and TX and you’d
like to change only the RX queue count to 8, you would run:
Note: making these changes will, for most drivers, take the interface
down and then bring it back up; connections to this interface will be
interrupted. This may not matter much for a one-time change, though.
Some NICs and their drivers also support adjusting the size of the RX queue.
Exactly how this works is hardware speci c, but luckily ethtool provides a
generic way for users to adjust the size. Increasing the size of the RX queue
can help prevent network data drops at the NIC during periods where large
numbers of data
We use cookies frames
to enhance areexperience
the user received. Data may still be dropped in software,
on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 34/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
the above output indicates that the hardware supports up to 4096 receive
and transmit descriptors, but it is currently only using 512.
Note: making these changes will, for most drivers, take the interface
down and then bring it back up; connections to this interface will be
interrupted. This may not matter much for a one-time change, though.
Some
We useNICs support
cookies to enhancethe ability
the user to adjust
experience the distribution of network data
on packagecloud.
By using our site, you acknowledge that you have read and understand our
among the RX
Cookie Policy, queues
Privacy by our
Policy, and setting
Terms ofaService.
weight. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 35/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
This output shows packet hash values on the left, with receive queue 0 and
1 listed. So, a packet which hashes to 2 will be delivered to receive queue 0,
while a packet which hashes to 3 will be delivered to receive queue 1.
If you want to set custom weights to alter the number of packets which hit
certain receive queues (and thus CPUs), you can specify those on the
command line, as well:
Some NICs will also let you adjust the elds which be used in the hash
algorithm, as we’ll see now.
You can use ethtool to adjust the elds that will be used when computing
a hash for use with RSS.
Check which elds are used for UDP RX ow hash with ethtool -n .
For eth0, the elds that are used for computing a hash on UDP ows is the
IPv4 source and destination addresses. Let’s include the source and
destination ports:
The sdfn string is a bit cryptic; check the ethtool man page for an
explanation of each letter.
Adjusting the elds to take a hash on is useful, but ntuple ltering is even
more useful for ner grained control over which ows will be handled by
We use cookies to enhance the user experience on packagecloud.
which RX queue.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 37/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
Some NICs support a feature known as “ntuple ltering.” This feature allows
the user to specify (via ethtool ) a set of parameters to use to lter
incoming network data in hardware and queue it to a particular RX queue.
For example, the user can specify that TCP packets destined to a particular
port should be sent to RX queue 1.
AsWe
mentioned, ntuple ltering can be con gured with ethtool , but rst,
use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
you’ll
Cookieneed
Policy,to ensure
Privacy Policy,that this
and our feature
Terms is enabled on your device. back to top
of Service.
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 38/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
Once you’ve enabled ntuple lters, or veri ed that it is enabled, you can
check the existing ntuple rules by using ethtool :
As you can see, this device has no ntuple lter rules. You can add a rule by
specifying it on the command line to ethtool . Let’s add a rule to direct all
TCP traf c with a destination port of 80 to RX queue 2:
Add ntuple lter to send TCP ows with destination port 80 to RX queue
2
You can also use ntuple ltering to drop packets for particular ows at the
hardware level. This can be useful for mitigating heavy incoming traf c
from speci c IP addresses. For more information about con guring ntuple
lter rules, see the ethtool man page.
You can usually get statistics about the success (or failure) of your ntuple
rules by checking values output from ethtool -S [device name] . For
example, on Intel NICs, the statistics fdir_match and fdir_miss calculate
the number of matches and misses for your ntuple ltering rules. Consult
your device driver source and device data sheet for tracking down statistics
counters (if available).
SoftIRQs
Before examining the network stack, we’ll need to take a short detour to
examine something in the Linux kernel called SoftIRQs.
What is a softirq?
The softirq system in the Linux kernel is a mechanism for executing code
outside of the context of an interrupt handler implemented in a driver. This
system is important because hardware interrupts may be disabled during all
or part of the execution of an interrupt handler. The longer interrupts are
disabled, the greater chance that events may be missed. So, it is important
to defer any long running actions outside of the interrupt handler so that it
can complete as quickly as possible and re-enable interrupts from the
device.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 40/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
There are other mechanisms that can be used for deferring work in the
kernel, but for the purposes of the networking stack, we’ll be looking at
softirqs.
The softirq system can be imagined as a series of kernel threads (one per
CPU) that run handler functions which have been registered for different
softirq events. If you’ve ever looked at top and seen ksoftirqd/0 in the list
of kernel threads, you were looking at the softirq kernel thread running on
CPU 0.
ksoftirqd
Since softirqs are so important for deferring the work of device drivers, you
might imagine that the ksoftirqd process is spawned pretty early in the
life cycle of the kernel and you’d be correct.
register_cpu_notifier(&cpu_nfb);
BUG_ON(smpboot_register_percpu_thread(&softirq_threads));
return 0;
}
early_initcall(spawn_ksoftirqd);
As you can see from the struct smp_hotplug_thread de nition above, there
are two function pointers being registered: ksoftirqd_should_run and
run_ksoftirqd .
__do_softirq
So,Wewhen you look at graphs of CPU usage and see softirq or si you now
use cookies to enhance the user experience on packagecloud.
know that
By using our this is acknowledge
site, you measuring thatthe amount
you have ofunderstand
read and CPU usage our happening in a
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 42/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
Monitoring
/proc/softirqs
The softirq system increments statistic counters which can be read from
/proc/softirqs Monitoring these statistics can give you a sense for the rate
at which softirqs for various events are being generated.
$ cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3
HI: 0 0 0 0
TIMER: 2831512516 1337085411 1103326083 1423923272
NET_TX: 15774435 779806 733217 749512
NET_RX: 1671622615 1257853535 2088429526 2674732223
BLOCK: 1800253852 1466177 1791366 634534
BLOCK_IOPOLL: 0 0 0 0
TASKLET: 25 0 0 0
SCHED: 2642378225 1711756029 629040543 682215771
HRTIMER: 2547911 2046898 1558136 1521176
RCU: 2056528783 4231862865 3545088730 844379888
This le can give you an idea of how your network receive ( NET_RX )
processing is currently distributed across your CPUs. If it is distributed
unevenly, you will see a larger count value for some CPUs than others. This
is one indicator that you might be able to bene t from Receive Packet
Steering / Receive Flow Steering described below. Be careful using just this
le when monitoring your performance: during periods of high network
We use cookies to enhance the user experience on packagecloud.
activity
By using you would
our site, expect to
you acknowledge that see theread
you have rate understandincrements
andNET_RX our increase, but
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 43/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
this isn’t necessarily the case. It turns out that this is a bit nuanced, because
there are additional tuning knobs in the network stack that can affect the
rate at which NET_RX softirqs will re, which we’ll see soon.
You should be aware of this, however, so that if you adjust the other tuning
knobs you will know to examine /proc/softirqs and expect to see a
change.
Now, let’s move on to the networking stack and trace how network data is
received from top to bottom.
Now that we’ve taken a look in to how network drivers and softirqs work,
let’s see how the Linux network device subsystem is initialized. Then, we
can follow the path of a packet starting with its arrival.
open_softirq(NET_TX_SOFTIRQ, net_tx_action);
open_softirq(NET_RX_SOFTIRQ, net_rx_action);
/* ... */
}
We’ll see soon how the driver’s interrupt handler will “raise” (or trigger) the
net_rx_action function registered to the NET_RX_SOFTIRQ softirq.
Data arrives
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
AtCookie
longPolicy,
last;Privacy
network data arrives!
Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 45/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
Assuming that the RX queue has enough available descriptors, the packet is
written to RAM via DMA. The device then raises the interrupt that is
assigned to it (or in the case of MSI-X, the interrupt tied to the rx queue the
packet arrived on).
Interrupt handler
Let’s take a look at the source for the MSI-X interrupt handler; it will really
help illustrate the idea that the interrupt handler does as little work as
possible.
From drivers/net/ethernet/intel/igb/igb_main.c:
napi_schedule(&q_vector->napi);
return IRQ_HANDLED;
}
This interrupt handler is very short and performs 2 very quick operations
before
We usereturning.
cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 46/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
The actual code showing exactly how this works is important; it will guide
our understanding of how network data is processed on multi-CPU systems.
Let’s gure out how the napi_schedule call from the hardware interrupt
handler works.
Remember, NAPI exists speci cally to harvest network data without needing
interrupts from the NIC to signal that data is ready for processing. As
mentioned earlier, the NAPI poll loop is bootstrapped by receiving a
hardware interrupt. In other words: NAPI is enabled, but off, until the rst
packet arrives at which point the NIC raises an IRQ and NAPI is started.
There are a few other cases, as we’ll see soon, where NAPI can be disabled
and will need a hardware interrupt to be raised before it will be started
again.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 47/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
The NAPI poll loop is started when the interrupt handler in the driver calls
napi_schedule . napi_schedule is actually just a wrapper function de ned
in a header le which calls down to __napi_schedule .
From net/core/dev.c:
/**
* __napi_schedule - schedule for receive
* @n: entry to schedule
*
* The entry's receive function will be scheduled to run
*/
void __napi_schedule(struct napi_struct *n)
{
unsigned long flags;
local_irq_save(flags);
____napi_schedule(&__get_cpu_var(softnet_data), n);
local_irq_restore(flags);
}
EXPORT_SYMBOL(__napi_schedule);
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
}
As we’ll see shortly, the softirq handler function net_rx_action will call the
NAPI poll function to harvest packets.
Note that all the code we’ve seen so far to defer work from a hardware
interrupt handler to a softirq has been using structures associated with the
current CPU.
While the driver’s IRQ handler itself does very little work itself, the softirq
handler will execute on the same CPU as the driver’s IRQ handler.
This why setting the CPU a particular IRQ will be handled by is important:
that CPU will be used not only to execute the interrupt handler in the driver,
but the same CPU will also be used when harvesting packets in a softirq via
NAPI.
As we’ll see later, things like Receive Packet Steering can distribute some of
We use cookies to enhance the user experience on packagecloud.
this workourtosite,other
By using CPUs further
you acknowledge that you up
havethe
read network stack.
and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 49/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
$ cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 46 0 0 0 IR-IO-APIC-edge
1: 3 0 0 0 IR-IO-APIC-edge
30: 3361234770 0 0 0 IR-IO-APIC-fasteoi
64: 0 0 0 0 DMAR_MSI-edge
65: 1 0 0 0 IR-PCI-MSI-edge
66: 863649703 0 0 0 IR-PCI-MSI-edge
67: 986285573 0 0 0 IR-PCI-MSI-edge
68: 45 0 0 0 IR-PCI-MSI-edge
69: 394 0 0 0 IR-PCI-MSI-edge
NMI: 9729927 4008190 3068645 3375402 Non-maskable inte
LOC: 2913290785 1585321306 1495872829 1803524526 Local timer inter
You can monitor the statistics in /proc/interrupts to see how the number
and rate of hardware interrupts change as packets arrive and to ensure that
each RX queue for your NIC is being handled by an appropriate CPU. As
we’ll see shortly, this number only tells us how many hardware interrupts
have happened, but it is not necessarily a good metric for understanding
how much
We use data
cookies has been
to enhance received
the user experience or processed as many drivers will disable
on packagecloud.
NIC IRQsourassite,
By using part of their contract
you acknowledge with
that you have readthe NAPI subsystem.
and understand our Further, using
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 50/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
interrupt coalescing will also affect the statistics gathered from this le.
Monitoring this le can help you determine if the interrupt coalescing
settings you select are actually working.
Interrupt coalescing
This can help prevent interrupt storms and can help increase throughput or
latency, depending on the settings used. Fewer interrupts generated result
in higher throughput, increased latency, and lower CPU usage. More
interrupts generated result in the opposite: lower latency, lower throughput,
but also increased CPU usage.
Historically, earlier versions of the igb , e1000 , and other drivers included
support for a parameter called InterruptThrottleRate . This parameter has
been replaced in more recent drivers with a generic ethtool function.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Get the current IRQ coalescing settings with ethtool -c .
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 51/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
One interesting option that some drivers support is “adaptive RX/TX IRQ
coalescing.” This option is typically implemented in hardware. The driver
usually needs to do some work to inform the NIC that this feature is
enabled and some bookkeeping as well (as seen in the igb driver code
above).
You can also use ethtool -C to set several options. Some of the more
We use cookies to enhance the user experience on packagecloud.
common options to set are:
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 52/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
Reminder that your hardware and driver may only support a subset of the
options listed above. You should consult your driver source code and your
hardware data sheet for more information on supported coalescing
options.
Unfortunately, the options you can set aren’t well documented anywhere
except in a header le. Check the source of include/uapi/linux/ethtool.h to
nd an explanation of each option supported by ethtool (but not
necessarily your driver and NIC).
Setting speci c CPUs allows you to segment which CPUs will be used for
processing which IRQs. These changes may affect how upper layers operate,
as we’ve seen for the networking stack.
If you do decide to adjust your IRQ af nities, you should rst check if you
running the irqbalance daemon. This daemon tries to automatically
balance IRQs to CPUs and it may overwrite your settings. If you are running
irqbalance , you should either disable irqbalance or use the --banirq in
conjunction with IRQBALANCE_BANNED_CPUS to let irqbalance know that it
shouldn’t touch a set of IRQs and CPUs that you want to assign yourself.
Next, you should check the le /proc/interrupts for a list of the IRQ
numbers for each network RX queue for your NIC.
Finally, you can adjust the which CPUs each of those IRQs will be handled
by modifying /proc/irq/IRQ_NUMBER/smp_affinity for each IRQ number.
Once the
We use softirq
cookies codethedetermines
to enhance user experiencethat a softirq is pending, begins
on packagecloud.
By using our site, you acknowledge that you have read and understand our
processing,
Cookie Policy, and
Privacyexecutes Terms of Service. , network data processing back
net_rx_action
Policy, and our begins.
to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 54/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
The function iterates through the list of NAPI structures that are queued for
the current CPU, dequeuing each structure, and operating on it.
The processing loop bounds the amount of work and execution time that
can be consumed by the registered NAPI poll functions. It does this in two
ways:
From net/core/dev.c:
while (!list_empty(&sd->poll_list)) {
struct napi_struct *n;
int work, weight;
spent among each of the available NAPI structures registered to this CPU.
This is another reason why multiqueue NICs should have the IRQ af nity
carefully tuned. Recall that the CPU which handles the IRQ from the device
will be the CPU where the softirq handler will execute and, as a result, will
also be the CPU where the above loop and budget computation runs.
Systems with multiple NICs each with multiple queues can end up in a
situation where multiple NAPI structs are registered to the same CPU. Data
processing for all NAPI structs on the same CPU spend from the same
budget .
If you don’t have enough CPUs to distribute your NIC’s IRQs, you can
consider increasing the net_rx_action budget to allow for more packet
processing for each CPU. Increasing the budget will increase CPU usage
(speci cally sitime or si in top or other programs), but should reduce
latency as data will be processed more promptly.
Note: the CPU will still be bounded by a time limit of 2 jif es, regardless
of the assigned budget.
Recall that network device drivers use netif_napi_add for registering poll
function. As we saw earlier in this post, the igb driver has a piece of code
like this:
/* initialize NAPI */
netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64);
From net/core/dev.c:
weight = n->weight;
work = 0;
if (test_bit(NAPI_STATE_SCHED, &n->state)) {
work = n->poll(n, weight);
trace_napi_poll(n);
}
budget -= work;
This code obtains the weight which was registered to the NAPI struct ( 64 in
the above driver code) and passes it into the poll function which was also
registered to the NAPI struct ( igb_poll in the above code).
The poll function returns the number of data frames that were processed.
This amount is saved above as work , which is then subtracted from the
overall budget .
So, assuming:
1. You are using a weight of 64 from your driver (all drivers were
hardcoded with this value in Linux 3.13.0), and
2. You have your budget set to the default of 300
One important piece of information about the contract between the NAPI
subsystem and device drivers which has not been mentioned yet are the
requirements around shutting down NAPI.
We’ll see how net_rx_action deals with the rst part of that contract now.
Next, the poll function is examined, we’ll see how the second part of that
contract is handled.
napi_complete(n);
local_irq_disable();
} else {
if (n->gro_list) {
/* flush too old packets
* If HZ < 1000, flush all packets.
*/
local_irq_enable();
napi_gro_flush(n, HZ >= 1000);
local_irq_disable();
}
list_move_tail(&n->poll_list, &sd->poll_list);
}
}
If the entire work is consumed, there are two cases that net_rx_action
handles:
1. The network device should be shutdown (e.g. because the user ran
ifconfig eth0 down ),
2. If the device is not being shutdown, check if there’s a generic
receive of oad (GRO) list. If the timer tick rate is >= 1000, all GRO’d
network ows that were recently updated will be ushed. We’ll dig
into GRO in detail later. Move the NAPI structure to the end of the
list for this CPU so the next iteration of the loop will get the next
NAPI structure registered.
And that is how the packet processing loop invokes the driver’s registered
poll function to process packets. As we’ll see shortly, the poll function
will harvest network data and send it up the stack to be processed.
The poll list registered for this CPU has no more NAPI structures
( !list_empty(&sd->poll_list) ), or
The remaining budget is <= 0, or
The time limit of 2 jif es has been reached
softnet_break:
sd->time_squeeze++;
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
goto out;
Execution is then transferred to the out label. Execution can also make it to
the out label if there were no more NAPI structures to process, in other
words, there is more budget than there is network activity and all the
drivers have shut NAPI off and there is nothing left for net_rx_action to do.
The out section does one important thing before returning from
net_rx_action : it calls net_rps_action_and_irq_enable . This function
serves an important purpose if Receive Packet Steering is enabled; it wakes
up remote CPUs to start processing network data.
We’ll see more about how RPS works later. For now, let’s see how to monitor
the health of the net_rx_action processing loop and move on to the inner
working of NAPI poll functions so we can progress up the network stack.
NAPI poll
Let’s take a look at how the igb driver does this to get an idea of how this
works in practice.
igb_poll
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 61/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
At long last, we can nally examine our friend igb_poll . It turns out the
code for igb_poll is deceptively simple. Let’s take a look. From
drivers/net/ethernet/intel/igb/igb_main.c:
/**
* igb_poll - NAPI Rx polling callback
* @napi: napi polling structure
* @budget: count of how many packets we should handle
**/
static int igb_poll(struct napi_struct *napi, int budget)
{
struct igb_q_vector *q_vector = container_of(napi,
struct igb_q_vector,
napi);
bool clean_complete = true;
#ifdef CONFIG_IGB_DCA
if (q_vector->adapter->flags & IGB_FLAG_DCA_ENABLED)
igb_update_dca(q_vector);
#endif
/* ... */
if (q_vector->rx.ring)
clean_complete &= igb_clean_rx_irq(q_vector, budget);
igb_clean_rx_irq
Once the loop terminates, the function assigns statistics counters for rx
packets and bytes processed.
Now it’s time to take two detours prior to proceeding up the network stack.
First, let’s see how to monitor and tune the network subsystem’s softirqs.
Next, let’s talk about Generic Receive Of oading (GRO). After that, the rest
of the networking stack will make more sense as we enter
napi_gro_receive .
Monitoring
We use cookies tonetwork data
enhance the user processing
experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
/proc/net/softnet_stat
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 64/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
seq_printf(seq,
"%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
sd->processed, sd->dropped, sd->time_squeeze, 0,
0, 0, 0, 0, /* was fastroute */
sd->cpu_collision, sd->received_rps, flow_limit_count);
$ cat /proc/net/softnet_stat
6dcad223 00000000 00000001 00000000 00000000 00000000 00000000 00000
6f0e1565
We use cookies00000000 00000002
to enhance the 00000000
user experience 00000000 00000000
on packagecloud. 00000000 00000
By using our site,
660774ec you acknowledge
00000000 that you00000000
00000003 have read and understand our
00000000 00000000
00000000 00000
Cookie Policy, Privacy Policy, and our Terms of Service.
61c99331 00000000 00000000 00000000 00000000 00000000 back to
00000000 00000top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 65/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
If you decide to monitor this le and graph the results, you must be
extremely careful that the ordering of these elds hasn’t changed and that
the meaning of each eld has been preserved. You will need to read the
kernel source to verify this.
You can adjust the net_rx_action budget, which determines how much
packet processing can be spent among all NAPI structures registered to a
CPU by setting a sysctl value named net.core.netdev_budget .
The main idea behind both methods is that reducing the number of packets
passed up the network stack by combining “similar enough” packets
together can reduce CPU usage. For example, imagine a case where a large
le transfer is occurring and most of the packets contain chunks of data in
the le. Instead of sending small packets up the stack one at a time, the
incoming packets can be combined into one packet with a huge payload.
That packet can then be passed up the stack. This allows the protocol layers
to process a single packet’s headers while delivering bigger chunks of data
to the user program.
The problem with this sort of optimization is, of course, information loss. If
a packet had some important option or ag set, that option or ag could be
lost if the packet is coalesced into another. And this is exactly why most
people don’t use or encourage the use of LRO. LRO implementations,
generally speaking, had very lax rules for coalescing packets.
By the way: if you have ever used tcpdump and seen unrealistically large
incoming packet sizes, it is most likely because your system has GRO
enabled. As you’ll see soon, packet capture taps are inserted further up the
stack, after GRO has already happened.
You can use ethtool to check if GRO is enabled and also to adjust the
setting.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 68/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
Note: making these changes will, for most drivers, take the interface
down and then bring it back up; connections to this interface will be
interrupted. This may not matter much for a one-time change, though.
napi_gro_receive
The function napi_gro_receive deals processing network data for GRO (if
GRO is enabled for the system) and sending the data up the stack toward
the protocol layers. Much of this logic is handled in a function called
dev_gro_receive .
dev_gro_receive
speci c that should happen for GRO. For example, the TCP protocol will
need to decide if/when to ACK a packet that is being coalesced into an
existing packet.
skb_set_network_header(skb, skb_gro_offset(skb));
skb_reset_mac_len(skb);
NAPI_GRO_CB(skb)->same_flow = 0;
NAPI_GRO_CB(skb)->flush = 0;
NAPI_GRO_CB(skb)->free = 0;
pp = ptype->callbacks.gro_receive(&napi->gro_list, skb);
break;
}
If the protocol layers indicated that it is time to ush the GRO’d packet, that
is taken care of next. This happens with a call to napi_gro_complete , which
calls a gro_complete callback for the protocol layers and then passes the
packet up the stack by calling netif_receive_skb .
if (pp) {
struct sk_buff *nskb = *pp;
*pp = nskb->next;
nskb->next = NULL;
napi_gro_complete(nskb);
We use cookies to enhance the user experience on packagecloud.
Bynapi->gro_count--;
using our site, you acknowledge that you have read and understand our
}Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 70/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
If the packet was not merged and there are fewer than MAX_GRO_SKBS (8)
GRO ows on the system, a new entry is added to the gro_list on the NAPI
structure for this CPU.
napi->gro_count++;
NAPI_GRO_CB(skb)->count = 1;
NAPI_GRO_CB(skb)->age = jiffies;
skb_shinfo(skb)->gso_size = skb_gro_len(skb);
skb->next = napi->gro_list;
napi->gro_list = skb;
ret = GRO_HELD;
And that is how the GRO system in the Linux networking stack works.
napi_skb_finish
Next, it’s time for netif_receive_skb to see how data is handed off to the
protocol layers. Before this can be examined, we’ll need to take a look at
We use cookies to enhance the user experience on packagecloud.
Receive Packet Steering (RPS) rst.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 71/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
Recall earlier how we discussed that network device drivers register a NAPI
poll function. Each NAPI poller instance is executed in the context of a
softirq of which there is one per CPU. Further recall that the CPU which the
driver’s IRQ handler runs on will wake its softirq processing loop to process
packets.
In other words: a single CPU processes the hardware interrupt and polls for
packets to process incoming data.
Some NICs (like the Intel I350) support multiple queues at the hardware
level. This means incoming packets can be DMA’d to a separate memory
region for each queue, with a separate NAPI structure to manage polling
this region, as well. Thus multiple CPUs will handle interrupts from the
device and also process packets.
This means that you wouldn’t notice a decrease in CPU time spent handling
IRQs or the NAPI poll loop, but you can distribute the load for processing
the packet after it’s been harvested and reduce CPU time from there up the
network stack.
RPS
We works bytogenerating
use cookies a experience
enhance the user hash foronincoming data to determine which CPU
packagecloud.
By using our site, you acknowledge that you have read and understand our
should process
Cookie Policy, the
Privacy data.
Policy, andThe data
our Terms is then enqueued to the per-CPUback
of Service. receive
to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 72/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
For RPS to work, it must be enabled in the kernel con guration (it is on
Ubuntu for kernel 3.13.0), and a bitmask describing which CPUs should
process packets for a given interface and RX queue.
/sys/class/net/DEVICE_NAME/queues/QUEUE/rps_cpus
So, for eth0 and receive queue 0, you would modify the le:
/sys/class/net/eth0/queues/rx-0/rps_cpus with a hexadecimal number
indicating which CPUs should process packets from eth0 ’s receive queue 0.
As the documentation points out, RPS may be unnecessary in certain
con gurations.
usage graph. You can compare before and after of your softirq and CPU
usage graphs to con rm that RPS is con gured properly to your liking.
For RFS to work, you must have RPS enabled and con gured.
RFS keeps track of a global hash table of all ows and the size of this hash
table can be adjusted by setting the net.core.rps_sock_flow_entries
sysctl.
Next, you can also set the number of ows per RX queue by writing this
value to the sysfs le named rps_flow_cnt for each RX queue.
RFS can be sped up with the use of hardware acceleration; the NIC and the
kernel can work together to determine which ows should be processed on
which CPUs. To use this feature, it must be supported by the NIC and your
driver.
Assuming that your NIC and driver support it, you can enable accelerated
RFS by enabling and con guring a set of things:
Once the
We use above
cookies is conthe gured,
to enhance accelerated
user experience RFS will be used to automatically
on packagecloud.
By using our site, you acknowledge that you have read and understand our
move data to the RX queue tied to a CPU core that is processing data for
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 75/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
that ow and you won’t need to specify an ntuple lter rule manually for
each ow.
You can tune when packets will be timestamped after they are received by
adjusting a sysctl named net.core.netdev_tstamp_prequeue :
The default value is 1. Please see the previous section for an explanation as
to what this setting means, exactly.
netif_receive_skb
We’ll see precisely how __netif_receive_skb_core works, but rst let’s see
how thecookies
We use RPS toenabled code
enhance the path works,
user experience as that code will also call
on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our. Terms of Service.
__netif_receive_skb_core back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 77/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
if (cpu >= 0) {
ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
rcu_read_unlock();
return ret;
}
get_rps_cpu will take into account RFS and aRFS settings as described
above to ensure the the data gets queued to the desired CPU’s backlog with
a call to enqueue_to_backlog .
enqueue_to_backlog
qlen = skb_queue_len(&sd->input_pkt_queue);
if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) {
The
We length ofto input_pkt_queue
use cookies is onrst
enhance the user experience compared to netdev_max_backlog . If
packagecloud.
the queue is longer than this value, the data is dropped. Similarly, the ow
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 78/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
limit is checked and if it has been exceeded, the data is dropped. In both
cases the drop count on the softnet_data structure is incremented. Note
that this is the softnet_data structure of the CPU the data was going to be
queued to. Read the section above about /proc/net/softnet_stat to learn
how to get the drop count for monitoring purposes.
Note: You need to check the driver you are using. If it calls
netif_receive_skb and you are not using RPS, increasing the
netdev_max_backlog will not yield any performance improvement
because no data will ever make it to the input_pkt_queue .
If the queue is empty: check if NAPI has been started on the remote
CPU. If not, check if an IPI is queued to be sent. If not, queue one
and start the NAPI processing loop by calling ____napi_schedule .
Proceed to queuing the data.
If the queue is not empty, or the previously described operation has
completed, enqueue the data.
The code is a bit tricky with its use of goto , so read it carefully. From
We use cookies to enhance the user experience on packagecloud.
net/core/dev.c:
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 79/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
if (skb_queue_len(&sd->input_pkt_queue)) {
enqueue:
__skb_queue_tail(&sd->input_pkt_queue, skb);
input_queue_tail_incr_save(sd, qtail);
rps_unlock(sd);
local_irq_restore(flags);
return NET_RX_SUCCESS;
}
Flow limits
RPS distributes packet processing load amongst multiple CPUs, but a single
large ow can monopolize CPU processing time and starve smaller ows.
Flow limits are a feature that can be used to limit the number of packets
queued to the backlog for each ow to a certain amount. This can help
ensure that smaller ows are processed even though much larger ows are
pushing packets in.
The if statement above from net/core/dev.c checks the ow limit with a call
to skb_flow_limit :
This code is checking that there is still room in the queue and that the ow
limit has not been reached. By default, ow limits are disabled. In order to
enable ow limits, you must specify a bitmap (similar to RPS’ bitmap).
Tuning
Before adjusting this tuning value, see the note in the previous section.
You can adjust the weight of the backlog’s NAPI poller by setting the
net.core.dev_weight sysctl. Adjusting this value determines how much of
the overall budget the backlog poll loop can consume (see the section
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
above
Cookieabout adjusting
Policy, Privacy net.core.netdev_budget
Policy, and our Terms of Service. ): back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 81/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
Example: increase the NAPI poll backlog processing loop with sysctl .
Tuning: Enabling flow limits and tuning flow limit hash table size
This change only affects newly allocated ow hash tables. So, if you’d like
to increase the table size, you should do it before you enable ow limits.
The per-CPU backlog queue plugs into NAPI the same way a device driver
does. A poll function is provided that is used to process packets from the
softirq context. A weight is also provided, just as a device driver would.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 82/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
sd->backlog.poll = process_backlog;
sd->backlog.weight = weight_p;
sd->backlog.gro_list = NULL;
sd->backlog.gro_count = 0;
The backlog NAPI structure differs from the device driver NAPI structure in
that the weight parameter is adjustable, where as drivers hardcode their
NAPI weight to 64. We’ll see in the tuning section below how to adjust the
weight using a sysctl .
process_backlog
The process_backlog function is a loop which runs until its weight (as
described in the previous section) has been consumed or no more data
remains on the backlog.
Each piece of data on the backlog queue is removed from the backlog
queue and passed on to __netif_receive_skb . The code path once the data
hits __netif_receive_skb is the same as explained above for the RPS
disabled case. Namely, __netif_receive_skb does some bookkeeping prior
to calling __netif_receive_skb_core to pass network data up to the
protocol layers.
process_backlog follows the same contract with NAPI that device drivers
do, which is: NAPI is disabled if the total weight will not be used. The poller
is restarted with the call to ____napi_schedule from enqueue_to_backlog
asWedescribed
use cookies above.
to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 83/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
If such a tap exists, the data is delivered there rst then to the protocol
layers next.
If a packet tap is installed (usually via libpcap), the packet is delivered there
with the following code from net/core/dev.c:
If you are curious about how the path of the data through pcap, read
net/packet/af_packet.c.
Once the taps have been satis ed, __netif_receive_skb_core delivers data
to protocol layers. It does this by obtaining the protocol eld from the data
and iterating across a list of deliver functions registered for that protocol
type.
type = skb->protocol;
list_for_each_entry_rcu(ptype,
&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
if (ptype->type == type &&
(ptype->dev == null_or_dev || ptype->dev == skb->dev ||
ptype->dev == orig_dev)) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}
Each protocol layer adds a lter to a list at a given slot in the hash table,
computed with
We use cookies a helper
to enhance function
the user called
experience ptype_head :
on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 85/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
And now you know how network data gets from the NIC to the protocol
layer.
Now that we know how data is delivered to the protocol stacks from the
network device subsystem, let’s see how a protocol layer registers itself.
IP protocol layer
The IP protocol layer plugs itself into the ptype_base hash table so that
data will be delivered to it from the network device layer described in
previous sections.
We use cookies to enhance the user experience on packagecloud.
This happens in the function inet_init from net/ipv4/af_inet.c:
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 86/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
dev_add_pack(&ip_packet_type);
ip_rcv
We can see the code which hands the data over to net lter at the end of
ip_rcv in net/ipv4/ip_input.c:
The short version is that NF_HOOK_THRESH will check if any lters are
installed and attempt to return execution back to the IP protocol layer to
avoid going deeper into net lter and anything that hooks in below that like
iptables and conntrack.
Keep in mind: if you have numerous or very complex net lter or iptables
rules, those rules will be executed in the softirq context and can lead to
latency in your network stack. This may be unavoidable, though, if you need
to have a particular set of rules installed.
ip_rcv_finish
Once net lter has had a chance to take a look at the data and decide what
to do with it, ip_rcv_finish is called. This only happens if the data is not
being dropped by net lter, of course.
ipprot->early_demux(skb);
/* must reload iph, skb->head might have changed */
iph = ip_hdr(skb);
}
}
Once the routing layer completes, statistics counters are updated and the
function ends by calling dst_input(skb) which in turn calls the input
function pointer on the packet’s dst_entry structure that was af xed by the
routing system.
If the packet’s nal destination is the local system, the routing system will
attach the function ip_local_deliver to the input function pointer in the
dst_entry structure on the packet.
ip_local_deliver
/*
* Deliver IP Packets to the higher protocol layers.
*/
int ip_local_deliver(struct sk_buff *skb)
{
/*
* Reassemble IP fragments.
*/
if (ip_is_fragment(ip_hdr(skb))) {
if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER))
return 0;
}
We use cookies
return to enhance the user experience on
NF_HOOK(NFPROTO_IPV4, packagecloud.
NF_INET_LOCAL_IN, skb, skb->dev, NULL,
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 90/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
ip_local_deliver_finish);
}
Once net lter has had a chance to take a look at the data,
ip_local_deliver_finish will be called, assuming the data is not dropped
rst by net lter.
ip_local_deliver_finish
$ cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDa
Ip: 1 64 25922988125 0 0 15771700 0 0 25898327616 22789396404 129878
...
This le contains statistics for several protocol layers. The IP protocol layer
appears rst. The rst line contains space separate names for each of the
corresponding values in the next line.
enum
{
IPSTATS_MIB_NUM = 0,
/* frequently written fields in fast path, kept in same cache line */
IPSTATS_MIB_INPKTS, /* InReceives */
IPSTATS_MIB_INOCTETS, /* InOctets */
IPSTATS_MIB_INDELIVERS, /* InDelivers */
IPSTATS_MIB_OUTFORWDATAGRAMS, /* OutForwDatagrams */
IPSTATS_MIB_OUTPKTS, /* OutRequests */
IPSTATS_MIB_OUTOCTETS, /* OutOctets */
/* ... */
The format is similar to /proc/net/snmp , except the lines are pre xed with
IpExt .
This blog post will examine UDP, but the TCP protocol handler is registered
the same way and at the same time as the UDP protocol handler.
These structures are registered in the initialization code of the inet address
family. From net/ipv4/af_inet.c:
/*
* Add all the base protocols.
*/
We’re going to be looking at the UDP protocol layer. As seen above, the
handler function for UDP is called udp_rcv .
This is the entry point into the UDP layer where the IP layer hands data.
Let’s continue our journey there.
The code for the UDP protocol layer can be found in: net/ipv4/udp.c.
udp_rcv
The code for the udp_rcv function is just a single line which calls directly
into __udp4_lib_rcv to handle receiving the datagram.
__udp4_lib_rcv
The __udp4_lib_rcv function will check to ensure the packet is valid and
obtain the UDP header, UDP datagram length, source address, and
destination address. Next, are some additional integrity checks and
checksum veri cation.
sk = skb_steal_sock(skb);
if (sk) {
struct dst_entry *dst = skb_dst(skb);
int ret;
if (unlikely(sk->sk_rx_dst != dst))
udp_sk_rx_dst_set(sk, dst);
In both cases described above, the datagram will be queued to the socket:
/*
* Hmm. We got an UDP packet to a port to which we
* don't wanna listen. Ignore it.
*/
kfree_skb(skb);
return 0;
udp_queue_rcv_skb
Finally, we arrive at the receive queue logic which begins by checking if the
receive queue for the socket is full. From net/ipv4/udp.c :
sk_rcvqueues_full
The sk_rcvqueues_full function checks the socket’s backlog length and the
socket’s sk_rmem_alloc to determine if the sum is greater than the
sk_rcvbuf for the socket ( sk->sk_rcvbuf in the above code snippet):
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 97/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
/*
* Take into account size of receive queue and backlog queue
* Do not take into account this skb truesize,
* to allow even a single big packet to come.
*/
static inline bool sk_rcvqueues_full(const struct sock *sk, const struct sk_buf
f *skb,
unsigned int limit)
{
unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_allo
c);
Tuning these values is a bit tricky as there are many things that can be
adjusted.
$ sudo
We sysctl
use cookies -w net.core.rmem_default=8388608
to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 98/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
You can also set the sk->sk_rcvbuf size by calling setsockopt from your
application and passing SO_RCVBUF . The maximum you can set with
setsockopt is net.core.rmem_max .
udp_queue_rcv_skb
Once it’s been veri ed that the queue is not full, progress toward queuing
the datagram can continue. From net/ipv4/udp.c:
bh_lock_sock(sk);
if (!sock_owned_by_user(sk))
rc = __udp_queue_rcv_skb(sk, skb);
else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {
bh_unlock_sock(sk);
goto drop;
}
bh_unlock_sock(sk);
return rc;
The rstcookies
We use steptoisenhance
determine
the user if the socket
experience currently has any system calls
on packagecloud.
The datagrams on the backlog are added to the receive queue when socket
system calls release the socket with a call to release_sock in the kernel.
__udp_queue_rcv_skb
From net/ipv4/udp.c:
rc = sock_queue_rcv_skb(sk, skb);
if (rc < 0) {
int is_udplite = IS_UDPLITE(sk);
Two very useful les for getting UDP protocol statistics are:
We use cookies to enhance the user experience on packagecloud.
/proc/net/snmp
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 100/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
/proc/net/udp
/proc/net/snmp
Much like the detailed statistics found in this le for the IP protocol, you
will need to read the protocol layer source to determine exactly when and
where these values are incremented.
/proc/net/udp
$ cat /proc/net/udp
sl local_address rem_address st tx_queue rx_queue tr tm->when r
515: 00000000:B346 00000000:0000 07 00000000:00000000 00:00000000
558: 00000000:0371 00000000:0000 07 00000000:00000000 00:00000000
588: 0100007F:038F 00000000:0000 07 00000000:00000000 00:00000000
769: 00000000:0044 00000000:0000 07 00000000:00000000 00:00000000
812: 00000000:006F 00000000:0000 07 00000000:00000000 00:00000000
The rst line describes each of the elds in the lines following:
And that is how data arrives at a system and traverses the network stack
until it reaches a socket and is ready to be read by a user program.
Extras
There are a few extra things worth mentioning that are worth mentioning
which didn’t seem quite right anywhere else.
Timestamping
As mentioned in the above blog post, the networking stack can collect
timestamps of incoming data. There are sysctl values controlling when/how
to collect timestamps when used in conjunction with RPS; see the above
post for more information on RPS, timestamping, and where, exactly, in the
network stack receive timestamping happens. Some NICs even support
timestamping in hardware, too.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 104/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
This is a useful feature if you’d like to try to determine how much latency
the kernel network stack is adding to receiving packets.
Determine which timestamp modes your driver and device support with
ethtool -T .
IMPORTANT NOTE: For this option to work, your device driver must support
it. Linux kernel 3.13.0’s igb driver does not support this option. The ixgbe
We use cookies to enhance the user experience on packagecloud.
driver, however, does. If your driver has a function set to the ndo_busy_poll
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 105/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
A great paper explaining how this works and how to use it is available from
Intel.
When using this socket option for a single socket, you should pass a time
value in microseconds as the amount of time to busy poll in the device
driver’s receive queue for new data. When you issue a blocking read to this
socket after setting this value, the kernel will busy poll for new data.
You can also set the sysctl value net.core.busy_poll to a time value in
microseconds of how long calls with poll or select should busy poll
waiting for new data to arrive, as well.
This option can reduce latency, but will increase CPU usage and power
consumption.
The Linux kernel provides a way for device drivers to be used to send and
receive data on a NIC when the kernel has crashed. The API for this is called
Netpoll and it is used by a few things, but most notably: kgdb, netconsole.
/* ... */
/* ... */
}
The Netpoll checks happen early in most of the Linux network device
subsystem code that deals with transmitting or receiving network data.
If you are interested in using the Netpoll API, you should take a look at the
netconsole driver, the Netpoll API header le, ‘include/linux/netpoll.h`,
and this excellent talk.
SO_INCOMING_CPU
The SO_INCOMING_CPU ag was not added until Linux 3.19, but it is useful
enough that it should be included in this blog post.
application can then use this information to hand sockets off to threads
running on the desired CPU to help increase data locality and CPU cache
hits.
DMA Engines
A DMA engine is a piece of hardware that allows the CPU to of oad large
copy operations. This frees the CPU to do other tasks while memory copies
are done with hardware. Enabling the use of a DMA engine and running
code that takes advantage of it, should yield reduced CPU usage.
The Linux kernel has a generic DMA engine interface that DMA engine
driver authors can plug into. You can read more about the Linux DMA
engine interface in the kernel source Documentation.
While there are a few DMA engines that the kernel supports, we’re going to
discuss one in particular that is quite common: the Intel IOAT DMA engine.
Many servers include the Intel I/O AT bundle, which is comprised of a series
of performance changes.
One of those changes is the inclusion of a hardware DMA engine. You can
check your dmesg output for ioatdma to determine if the module is being
loaded and if it has found supported hardware.
We use cookies to enhance the user experience on packagecloud.
The DMA of oad engine is used in a few places, most notably in the TCP
By using our site, you acknowledge that you have read and understand our
stack.
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 108/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
Support for the Intel IOAT DMA engine was included in Linux 2.6.18, but
was disabled later in 3.13.11.10 due to some unfortunate data corruption
bugs.
Another interesting feature included with the Intel I/O AT bundle is Direct
Cache Access (DCA).
This feature allows network devices (via their drivers) to place network data
directly in the CPU cache. How this works, exactly, is driver speci c. For the
igb driver, you can check the code for the function igb_update_dca , as
well as the code for igb_update_rx_dca . The igb driver uses DCA by
writing a register value to the NIC.
To use DCA, you will need to ensure that DCA is enabled in your BIOS, the
dca module is loaded, and that your network card and driver both support
DCA.
If you are using the ioatdma module despite the risk of data corruption
mentioned above, you can monitor it by examining some entries in sysfs .
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 109/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
$ cat /sys/class/dma/dma0chan0/memcpy_count
123205655
Similarly, to get the number of bytes of oaded by this DMA channel, you’d
run a command like:
$ cat /sys/class/dma/dma0chan0/bytes_transferred
131791916307
The IOAT DMA engine is only used when packet size is above a certain
threshold. That threshold is called the copybreak . This check is in place
because for small copies, the overhead of setting up and using the DMA
engine is not worth the accelerated transfer.
Conclusion
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 110/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
Need some extra help navigating the network stack? Have questions about
anything in this post or related things not covered? Send us an email and
let us know how we can help.
Related posts
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 111/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
If you enjoyed this post, you may enjoy some of our other low-level
technical posts:
Sign up!
Features
Travis CI
Jenkins
Buildkite
GPG Signatures
Info
Pricing
HOWTOs
NPM/NodeJS HOWTO
Maven HOWTO
Java
HOWTO
Debian HOWTO
RPM HOWTO
RubyGem
We HOWTO
use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Python HOWTO
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#using-sysfs 113/114
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Receiving Data - Packagecloud Blog
Linux HOWTO
Guides
Maven Guide
Debian Guide
RPM Guide
RubyGem Guide
Python Guide
Linux Guide
Docs
General Docs
API Docs
Community
Blog
Slack
Status
Contact
Legal
Terms of Service
Privacy Policy