0% found this document useful (0 votes)
113 views

T5-Linux Performance Tuning

The document provides an overview of an upcoming Linux performance tuning tutorial. It will run from 9am to 5pm with breaks at 10:30am, 12:30-1:30pm for lunch, and 3:00-3:30pm. The agenda includes introductions to performance tuning, filesystem tuning, network tuning, NFS tuning, memory tuning, and application tuning. Basic tools like free, top, and iostat are introduced for analyzing memory usage, process activity, and I/O statistics.

Uploaded by

jacob
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views

T5-Linux Performance Tuning

The document provides an overview of an upcoming Linux performance tuning tutorial. It will run from 9am to 5pm with breaks at 10:30am, 12:30-1:30pm for lunch, and 3:00-3:30pm. The agenda includes introductions to performance tuning, filesystem tuning, network tuning, NFS tuning, memory tuning, and application tuning. Basic tools like free, top, and iostat are introduced for analyzing memory usage, process activity, and I/O statistics.

Uploaded by

jacob
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Linux Performance Tuning

Tuesday, October 31, 2017

Logistics


Tutorial runs from 9 to 5:00pm

Morning break at 10:30am

Lunch at 12:30--1:30 pm

Afternoon break at 3:00-3:30pm

Feel free to ask me questions

But I reserve the right to defer some answers until
later in the session or to the break/end of the class.

Please fill out and return the tutorial evaluation
form!
2
Agenda


Introduction to Performance Tuning

Filesystem and storage tuning

Network tuning

NFS performance tuning

Memory tuning

Application tuning

Introduction to Performance
Tuning

Complex task that requires in-depth
understanding of hardware, software, and
application

If it were easy the OS would do it automatically (and
the OS does a lot automatically to begin with)

Goals of Performance Tuning

Speed up time to do a single large task (time to do
perform some large matrix calculation)

Graceful degredation of a web/application server as
it is asked to service a larger and larger number of
requests 4
Stress Testing


What happens when a server is put under a
large amount of stress?

“My web server just got slashdotted!”

Typically the server behaves well until the load
increases beyond a certain critical point; then it
breaks down.

Transaction latencies go through the roof

The server may cease functioning altogether

Measure the system when it is functioning
normally, and then when it is under stress.
5
What changes?

Finding Bottlenecks


Careful tuning of memory usage won't matter if
the problem is caused by a shortage of disk
bandwidth

Performance measurement tools are hugely
important to diagnose what is placing limits on
the scalability or performance of your
application

Start with large areas, then narrow down

Is your application I/O bound? CPU bound?
Network bound?
6
Incremental Tuning


Use the scientific method

Establish a baseline

Define testing parameters which are replicated from
test to test.

Measure the performance given a starting
configuration.

Change one parameter at a time

Record everything

Make sure you get the same results when you
repeat a test!
7

Measurement overhead


Some performance measurement tools may
impact your application's behavior

If you're not familiar with how a particular tool
interacts with your workload, don't assume that a
tool has zero overhead with your application!

Enabling application performance metering or
debugging may also change its baseline
numbers.

8
A basic performance tuning
methodology

Define your baseline configuration and measure
its performance

[ If appropriate, define a stress test workload, and
measure it. ]

Make a single change to the system
configuration. Measure the results of that
change and record it.

Repeat as necessary

Make sure to test single changes as well as
combination of changes. Sometimes effects are
synergistic 9

Basic Performance Tools


free

top

iostat

10
The free(1) command


Basic command which shows memory usage

11

Questions to ask yourself after


looking at free(1) output

Will adding more memory help?

Often the cheapest way to speed up server

If the system is using paging or swapping,
adding more physical memory may help

Will a larger page cache help?

More sophisticated tools will answer these
questions later...

But asking questions is the beginning of wisdom

12
The top(1) command


Good general place to start

13

Questions to ask yourself when


looking at top(1) output

What are the “top” tasks running; should they
be there? Are they running? Waiting for disk?
How much memory are they taking up?

How is the CPU time (overall) being spent?

User time, System time, Niced user time, I/O Wait,
Hardware IRQ, Software IRQ, “Stolen” time

Alternative top command: “htop”

14
The iostat(1) command


Part of the systat package; shows I/O statistics

Use -k to for kilobytes instead of 512 sectors

15

Advanced iostat(1)


Many more details with the -x option

rrqm/s, wrqm/s – read/write requests merged per
second

r/s, w/s – read/write request per second

rkB/s, wkB/s – number of kilobytes of read/write
transfers per second

avgrq-sz --- average request size in 512 byte
sectors per second

avgqu-sa – average request queue length


16
Advanced iostat(1), continued


Still more details revealed with the -x option

await – average time (in ms) between when a
requested is issued and when it is completed (time
in queue plus time for device to service the request)

svctm – average sevice time (in ms) for I/O
requests that were issued to the device

%util – Percentage of CPU time during which the
device was servicing requests. (100% means the
device is fully saturated)

17

Example of iostat -xk 1


Workload “fs_mark -s 10240 -n 1000 -d /mnt”

Creates 1000 files, each 10k, in /mnt, with an fsync
after writing each file

Result: 33.7 files/second

18
Conclusions we can draw from
the iostat results

Utilization: 98.48%

The system is I/O bound

Adding memory or speeding up the CPU clock
won't help

Solution – attack the I/O bottleneck

Add more I/O bandwidth resources (use a faster
disk or use a RAID array)

Or, do less work!

19

Speeding up fs_mark


If we mount the (ext4) file system with -o
barrier=0, file/sec becomes 358.3

But this risks fs corruption after a power fail

Is the fsync() really needed? Without it, file/sec
goes up to 17,010.30

Depends on application requirements

Experimental: -o journal_async_commit

Using journal checksums, it allows ext4 to use only
one barrier per fsync() instead of two. (Caveats:
requires e2fsprogs 1.43 to use safely; risk of stale
data being exposed after a crash) 20
Using -o journal_async_commit


Using ”fs_mark -s 10240 -n 1000 -d /mnt” again

Result: 48.2 files/sec (a 46% improvement over
33.7 files/sec!)

21

Comparing the two results

33.7
files/sec

2000
barrier
ops

48.2
files/sec

1000
barrier
ops

22
Before we leave fs_mark...


How does fs_mark fare on other file systems?

ext2 (no barriers) – 574.9

ext3 (no barriers) – 348.8 (w/ barriers) – 30.8

ext4 (no barriers) – 358.3 (w/ barriers) – 33.7

XFS (no barriers) – 337.3 (w/ barriers) – 29.0

reiserfs (no barriers) – 210.0 (w/ barriers) – 31.5

Important note: these numbers are specific to
this workload (small files, fsync heavy) and not
a general figure of merit for these file systems
23

Lessons Learned So Far


Measure, analyze, and then tweak

Bottleneck analysis is critical

It is very useful to understand how things work
under the covers

Adding more resources is one way to address a
bottleneck

But so is figuring ways of doing less work!

Sometimes you can achieve your goal by working
smarter, not harder.

24
The snap script


Handy quickie shell script which I and a
colleague developed while working on the
Advanced Linux Response Team at IBM

Collects a lot of statistics: iostat, meminfo,
slabinfo, sar, etc. in a low impact fashion.

Collects system configuration information

Especially useful when I might not have access to
the system for security reasons

Gather information for a day; then analyze for
trends or patterns

https://sites.google.com/site/linuxperftuning/snap 25

Agenda


Introduction to Performance Tuning

Filesystem and storage tuning

Network tuning

NFS performance tuning

Memory tuning

Application tuning

26
File system and storage tuning


Choosing the right storage devices

Hard Drives

SSD

RAID

NFS appliances

File System Tuning

General Tips

File system specific

27

Hard Drives


Disks are probably the biggest potential
bottleneck in your system

Punched cards and paper tape having fallen out of
favor...

Critical performance specs you should examine

Sustained Data Transfer Rate

Rotational speed: 5400rpm, 7200rpm, 10,000 rpm

Areal density (max capacity in that product family)

Seek time (actually 3 numbers: average, track-to-
track, full stroke)
28
Transfer Rates


The important number is the sustained data
transfer rate (aka ”disk to buffer”) rate

Typically around 70-100 Mb/s; slower for laptop
drives

Much less important: The I/O transfer rate

At least for hard drives, whether you are using
SATA I's 1.5 Gb/s or SATA II's 3.0 Gb/s won't
matter except for rare cases when transfering data
out of the track buffer

SSD's might be a different story, of course...
29

Short stroking hard drives


HDD performance are not uniform across the
platter

Up to 100% performance improvements on the
outer diameter (OD) of the disk

Consider partitioning your disk to take this into
account!

If you don't need the full 1TB of space, partitioning
your disk to only use the first 100GB or 300GB
could speed things up!

Also – when running benchmarks, use the same
partitions for each file system tested. 30
What about SSD's?


Advantages of SSD

Fast random access reads

Fails usually when writing, not when reading

Less suceptible to mechanical shock/vibration

Most SSD's use less power than HDD's

31

What about SSD's?


Disadvantage of SSD's

Cost per Gb much more expensive

Limited number of write cycles

Writes are slower than reads; random writes can be
much slower (up to a ½ sec average, 2 sec worst
case for 4k random writes for really bad SSD's!)

Most SSD's do not have power fail protection

Do your research carefully before buying SSD's

32
Should you use SSD's?


For laptops and desktops, absolutely!

For servers, it depends...

If you need fast random access reads, yes

If you care about TCO, be careful

Careful spreading of IOPS across large numbers of
HDD’s could be better. See Google’s Disks for
Data Centers paper for more details.

For certain workloads, the write endurance problem
of SSD's may be a concern

33

PCIe attached flash


Like SSD's, only more so

Speed achieved by writing to large numbers of
flash chips in parallel

Potentially 100k to 1M 4k random reads / seconds

Synchronous 4k random write just as slow as SSD's

Very expensive, but the price has dropped

In some cases, they can be cost effective

If you can really use all of their IOPS

34
RAID


Redundant Array of Inexpensive Disks

RAID 0 – Striping

RAID 1 – Mirroring

RAID 5 – 3 or more disks, with a rotating parity
stripe

RAID 6 – 4 more disks, with two rotating parity
stripes

RAID 10 – Mirroring + striping

35

RAID tuning considerations


Adding more spindles improves performance

RAID 5/6 requires some special care

Writes smaller than the N*stripe size will require a
read/modify/write cycle in order to update the parity
stripe (where N is the number of non-spare disks)

If the RAID device is going to be broken up using
LVM or partitions, make sure the LV or partition is
aligned on a full stripe boundary

36
Filesystem Tuning


Most general purpose file systems work quite
well for most workloads

But in some file systems are better for certain
specialized workloads

Reiserfs – small (< 4k) files

XFS – very big RAID arrays, very large files

Ext4 is a good general purpose filesystem that
many people use by default

37

Managing Access-time Updates


POSIX requires that a file's last access time is
updated each time its contents are accessed.

This means a disk write for every single read

The mount options noatime and relatime
(default in 2.6.30+) can reduce this overhead

The relatime option will only update the atime if
mtime and ctime is newer than the last atime.

Saves approximately half the writes compared to
noatime

Some applications do depend on atime being
updated 38
Lazy time updates


New mount option added in Linux 4.0

Supress timestamp-only updates

Batch the updates with mandatory inode updates

Win for random, non-allocating read/write
workloads

Timestamps will be written out after 24 hours,
or on a syncfs(2) or umount(2)

Ext4-specific optimization: when an inode gets
updated, all dirty timestamps in the same itable
block are written out 39

Tuning ext4 journals


Sometimes increasing the journal size can help;
especially if your workload is very metadata-
intensive (lots of small files; lots of file
creates/deletes/renames)

Journal data modes

data=ordered (default) – data is written first before
metadata is committed

data=journal – data is written into the journal

data=writeback – only metadata is logged; after a
crash, uninitialized data can appear in newly
allocated data blocks 40
Using ionice to control read/write
priorities

Like the nice command but affects the priority of
read/write requests issued by the process

Three scheduling classes

Idle – only if there are no other high priority
requests pending

Best-effort – requests served round-robin (default)

Real time – highest priority request always gets
access

For best-effort and real time classes, there are
8 priorities, with 0 being the highest priority and
7 the lowest priority 41

Brenden Gregg's Scripts


Really useful scripts which use ftrace and perf

iosnoop – more friendly version of blktrace

bitesize – shows distribution of I/O sizes

iolatency – shows I/O latency

opensnoop – shows file opens

“git clone https://github.com/brendangregg/perf-tools.git”

Uses ftrace to collect data from the kernel – most
processing done in userspace

42
Sidebar: eBPF


Started as the Berkeley Packet Filter – then
grew wildly out of control

In-kernel JIT compiler (Linux 3.0+)

Extended to add 10 registers, call and tail_call
functions, arrays, associative arrays, etc.

LLVM 3.7 can compile C code to eBPF byte code

In-kernel code verifier (same guarantees as BPF)

Can be attached to tracepoints or perf so filtering
and data aggregation can be done in the kernel
(Linux 4.2+)

Promising future for Linux performance tracing 43

eBPF version of Brenden’s Tools


Requires Linux 4.1+

Part of IOvisor’s bcc (BPF Compiler Collection)

“git clone https://github.com/iovisor/bcc.git”

EBPF version

biosnoop – more friendly version of blktrace

bitesize – shows distribution of I/O sizes

biolatency – shows I/O latency

opensnoop – shows file opens

44
Kernel Version and Distributions


RHEL 7.4 – 3.10 (August 1, 2017)

SLES 12.1 – 3.12 (May 31, 2017)

Ubuntu 16.04 LTS – 4.8 (April 21, 2016)

Debian 9 – 4.9 (July 22, 2017)

Ubuntu 17.10 – 4.13 (October 19, 2017)

45

Agenda


Introduction to Performance Tuning

Filesystem and storage tuning

Network tuning

NFS performance tuning

Memory tuning

Application tuning

46
Network Tuning


Before you do anything else... check the basic
health of the network

Speed, duplex, errors

Tools: ethtool, ifconfig, ping

Check TCP throughput: ttcp or nttcp

Look for “weird stuff” using wireshark / tcpdump

Network is a shared resource

Who else is using it?

What are bottlenecks in the network topology?
47

Latency vs Throughput


Latency

When applications need maximum responsiveness

Lockstep protocols (i.e., no sliding window
optimizations)

RPC-based protocols

Throughput

When transfering large data sets

Very often tuning efforts will trade off latency for
throughput or vice versa

48
Interrupt Coalescing


This reduces CPU load by amortizing the cost
of an interrupt over multiple packets; this allows
us to trade off latency for throughput

“ethtool -C ethX rx-usecs 80 rx-frames 20”

This will delay a receive interrupt for 80 μs or until
20 packets are received, whichever comes first

“ethtool -C ethX rx-usecs 0 rx-frames 1”

This will cause an interrupt to be sent for every
packet received

Different NIC's will have different defaults and
may have additional tuning parameters
49

Enable NIC optimizations


Some device drivers don't enable these
features by default

You can check using “ethtool -k eth0”

TCP segment offload

“ethtool -K tso on”

Checksum off-load

“ethtool -K tx on rx on”

Large Receive offload (for throughput)

“ethtool -K lro on”
50
Optimizing TCP


“To optimize TCP, first you must be smarter
than TCP...”

A quick review of TCP

Provides a sequenced, reliable stream service

Flow control via a sliding window

Window: amount of data which has been sent, but
not yet acknowledged

As data is ack'ed, more data can be sent

Congestion avoidance by controlling the window
size (in response to packet loss)

Throughput is small when the window size is small 51

The bandwidth-delay product


Very important when optimizing for throughput,
especially for high speed, long distance links

Represents the amount of data that can be “in
flight” at any particular point in time.

BDP = 2 * bandwidth * delay

BDP = bandwidth * Round Trip Time (RTT)

example:

(100 Mbits/sec / 8 bits/byte) * 50 ms ping time =
625kbytes

52
Why the BDP matters


TCP has to be able to retransmit any dropped
packets; so the kernel has to remember what
data has been sent in case it needs to
retransmit it.

TCP Window

Limits on the size of the TCP window to control
kernel memory consumed by the networking stack

53

Using the BDP


The BDP in bytes plus some overhead room
should be used as [wmax] below when setting
these parameters in /etc/sysctl.conf:

net.core.rmem_max= [wmax]

Maximum Socket Receive Buffer size

net.core.wmem_max= [wmax]

Maximum Socket Send Buffer size

net.core.rmem_max also known as
/proc/sys/net/core/rmem_max

e.g., set via “echo 2097152 >
/proc/sys/net/core/rmem_max” 54
Per-socket /etc/sysctl.conf
settings

net.ipv4.tcp_rmem = [wmin] [wstd] [wmax]

receive buffer sizing in bytes (per socket)

net.ipv4.tcp_wmem = [wmin] [wstd] [wmax]

memory reserved for send buffers in bytes (per
socket)

Modern kernels do automatic tuning of the
receive and send buffers; and the defaults are
better; still, if your BDP is very high, you may
need to boost [wstd] and [wmax]. Keep [wmin]
small for out-of-memory situations.
55

For large numbers of TCP


connections

net.ipv4.tcp_mem = [pmin] [pdef] [pmax]

pages allowed to be used by TCP (for all sockets)

For 32-bit x86 systems, kernel text & data
(including TCP buffers) can only be in the low
896MB.

So on 32-bit x86 systems, do not adjust these
numbers, since they are needed to balance
memory usage with other Lowmem users.

If this is a problem, best bet is to switch to a 64-bit
x86 system first.
56
Increasing Transmit Buffers


For high throughput transfers (assuming use of
all available bandwidth) you want to keep the
packets flowing at all times

… while minimizing interrupt overhead

So the traditional advice was to increase the
transmit buffers

“ifconfig eth0 txqueuelength 50000”

“ethtool -G eth0 tx 4096 rx 4096”

57

Too much buffering →


Bufferbloat

What if you don't have full access to the
bandwidth between the source and destination?

TCP depends on timely packet drops for
congestion control

Too much buffering leads to increased latency

Can cause congestive collapse on wireless
networks

If throughput is critical, use ttcp to find the
smallest xmit buffers possible for your setup

58
If you want to be a good citizen...


“modprobe tcp_vegas; echo vegas >
/proc/sys/net/ipv4/tcp_congestion_control”

“ifconfig wlan0 txqueuelen 4”

“ethtool -G eth0 64”

Use the smallest value you can

“ethtool -g eth0” to see your current values

“echo 1 > /proc/sys/net/ipv4/tcp_ecn”

“echo 1 > /proc/sys/net/ipv4/tcp_sack”

“echo 1 > /proc/sys/net/ipv4/tcp_dack”

… if everyone did this, might avoid some wifi
meltdowns at conferences 59

New fix for Bufferbloat: CoDel


Stands for “Controlled Delay”

Nichols, K. And Jacobson, V. “Controlling Queue
Delay.” Communications of the ACM. July 7, 2012

In Linux 3.5 and newer kernels

CeroWRT for home routers

“The result is very impressive: I see a 30X reduction in
ping latency on a fully saturated 10Mbps network. For
an interactive ssh session over that same saturated
10Mbps network, fq_codel totally eliminates the laggy
keyboard response — it feels like there’s no other
network traffic at all!” – Kamal Mostafa
60
Optimizing for Low Latency TCP


This can be very difficult, because TCP is not
really designed for low latency applications.

TCP is engineered to worry about congestion
control on wide-area networks, and to optimize for
throughput on large data streams.

If you are writing your own application from
scratch, very often basing your own protocol on
UDP is often a better bet.

Do you really need a byte-oriented service?

Do you only need automatic retransmission to deal
with lost packets? 61

Nagle Algorithm


Goal: To make networking more efficient by
batching small writes into a bigger packet for
efficiency

When the OS gets a small amount of data (a single
keystroke in an telnet connection), delay a very
small amount of time to see if more bytes will be
coming → this naturally increases latency!

Disabling Nagle in the application:

int on = 1;

setsockopt (sockfd, SOL_TCP, TCP_NODELAY,
&on, sizeof (on)); 62
Delayed Acknowledgements


On the receiver end, wait a small amount of
time before sending a bare acknowledgement
to see if there's more data coming (or if the
program will send a response upon which you
can piggy-back your response)

This can interact with TCP slow-start to cause
longer latencies when the send window is
initially small.

After congestion or after the TCP connection has
been idle, the send window (maxmimum bytes of
unack'ed data) must be set down the MSS value
63

Solving the Delayed Ack problem


Disable slow-start algorithm on the sender?

Slow-start is a MUST implement (RFC 2581)

Disable delayed acknowledgments on the
receiver?

Delayed acknowledgments is a SHOULD (RFC
2581)

Some OS's have a way of disabling delayed
acknowledgments; Linux does not

There is a hack that works on a per-packet basis,
though...
64
Enabling QUICKACK

Linux tries to be “clever” and automatically
figure out when to disable delayed
acknowledgments when it believes the other
side is in slow start.

Hack to force “quickack” mode:

int on = 1;

setsockopt (sockfd, SOL_TCP, TCP_QUICKACK,
&on, sizeof (on));
But QUICKACK mode is disabled once other side
is done with slow start. So you have to re-enable
it any time the connection is idle for longer than 65
the retransmission time.

Agenda


Introduction to Performance Tuning

Filesystem and storage tuning

Network tuning

NFS performance tuning

Memory tuning

Application tuning

66
NFS Performance tuning


Optimize both your network and your filesystem

In addition, various client and server specific
settings that we'll discuss now

General hint: use dedicated NFS servers

NFS file serving uses all parts of your system: CPU
time, memory, disk bandwidth, network bandwidth,
PCI bus bandwidth

Trying to run applications on your NFS servers will
make both NFS and the apps run slowly

67

Tuning a NFS Server


If you only export file system mountpoints, use
the no_subtree_check option in /etc/exports

Can burn large amonuts of CPU for metadata
intensive workloads

Bump up the number of NFS threads to a large
number such as 128 (instead of 4 or 8 which is
way too little). How to do this is distro-specific;
look for something like RPCNFSDCOUNT in:

/etc/sysconfig/nfs

/etc/default/nfs-kernel-server
68
PCI Bus tuning


NFS serving puts heavy demands on both
networking cards and hard bus adapters

If you have a system with multiple PCI buses,
put the networking and storage cards on
different buses

Network cards tend to use lots of small DMA
transfers, which tends to hog the bus

69

NFS client tuning


Make sure you use NFSv3 and not NFSv2

Make sure you use TCP and not UDP

Use the largest rsize/wsize that the client/server
kernels support

Modern client/servers can do a megabyte at a time

Use the hard mount option, and not soft

Use intr so you can recover an NFS server is down

All of these are the default except for intr

Remove outdated fstab mount options. Just use
“rw,intr” 70
Tuning your network config for
NFS

Tune the network for bulk transfers (throughput)

Use the largest MTU size you can

For ethernets, consider using jumbo frames if all of
the intervening switches/routers support it

71

NFS v4


Use NFSv4 if…

… you use NFSv4.1 and at least Linux 4.0

… you use NFS across higher-latency networks

… few files are modified by different clients

Delegations and cacheing can be a huge win

But make NFSv4 much more “chatty”

Delegation conflicts result in at least 100ms delay

For more information see the article in June
2015 issue of ;login:

72
Agenda


Introduction to Performance Tuning

Filesystem and storage tuning

Network tuning

NFS performance tuning

Memory tuning

Application tuning

73

Memory Tuning


Memory tuning problems can often look like
other problems

Unneeded I/O caused by excessive paging/swaping

Extra CPU time caused by cache/TLB thrashing

Extra CPU time caused by NUMA-induced memory
access latencies

These subtleties require using more
sophisticated performance measurement tools

74
To measure swapping activity


The top(1) and free(1) command will both tell
you if any swap space is in use

To a first approximation, if there is any swap in use,
the system can be made faster by adding more
RAM.

To see current swap activity, use the sar(8)
program

First use of a very handy (and rather complicated)
system activity recorder program; reading through
the man page strongly recommended

Part of the systat package 75

Using sar to obtain swapping


information

Use “sar -W <interval> [<num. of samples>]”

Reports number of pages written (swapped out)
and read (swapped in) from the page device out
per second.

The first output is the average since system was
started.

76
Optimizing swapping


Use multiple swap devices

Use fast swap devices

Fast devices can be given a higher priority

Add more memory to avoid swapping in the first
place

77

Swapping vs. Paging


Swap used for anonymous pages

i.e., pages which are not backed by a file

Pages which are backed by a file are subject to
paging

If they have been modified, or made dirty, they are
”cleaned” by being written to their backing store

If a page has not been be used recently, it is
”deactivated” by removing it from processes' page
table

Clean and inactive pages may be purposed for
other uses on an LRU basis 78
Optimizing Paging


Unlike swapping, some amount of paging is
normal – and unavoidable

So we can't just manage the amount of paging to
zero, like we can with swapping

Goal: to minimize amount of paging in the steady-
state case

Key statistics:

majflts/s – major faults (which result in I/O) / second

pgsteal/s – pages reclaimed from the page and
swap cache / second to satisfy memory demands
79

Using sar to obtain information


about paging

Use “sar -B <interval> [<num. of samples>]”

Reports many statistics

pgpgin/s, pgpgout/s – ignore, not useful/misleading

fault/s – # of page faults / sec.

majfault/s – # of page faults that result in I/O / sec.

pgfree/s – # of pages placed on the free list / sec.

pgscank/s – # of pages scanned by kswaped / sec.

pgscand/s – # of pages scanned directly / sec.

pgsteal/s – # of pages reclaimed from scache / sec.

%vmeff – pgsteal/s / (pgscank/s + pgscand/s) 80
Other ways of finding information
about memory utilization

cat /proc/meminfo

Something especially important on 32-bit x86
kernels: Low Memory vs. High Memory

Documentation/filesystems/proc.txt

cat /proc/slabinfo

Useful for seeing how the kernel is using memory

ALT-sysrq-m (or 'echo m > /proc/sysrq-trigger')

Different for different kernel versions and
distributions; /proc/slabinfo may not exist if
CONFIG_SLUB used and not CONFIG_SLAB 81

/proc/meminfo

82
Interesting bits from sysrq-m


Per-zone statistics

83

About Memory Caches


2GHz processor → 2 billion cycles per second

Memory is much slower

Solution: use small amounts of fast cache
memory

Typically 32Kb very fast Level 1 cache

Maybe 4-8MB of somewhat slower Level 2 cache

Can see how much cache you have using
dmidecode and x86info

Not much tuning that can be done except by
improving the C/C++ program code 84
TLB Caches


The Translation Lookaside Cache speeds up
translation from a virtual address to a physical
address

Normally requires 2-3 lookups in the page tables

TLB cache short circuits this lookup process

The x86info program will show the TLB cache
layout

Hugepages are a way to avoid consuming too
many TLB cache entries

The perf command can show TLB and cache 85
hit/miss statistics

Using hugepages


Build a kernel that avoids using modules

The core kernel text segment uses huge pages;
modules do not

Modify an application to use hugepages (or
configure an application to use it if it already
has provision to use hugepages).

“mount -t hugetlbfs none /hugepages” then mmap
pages in /hugepages

On new qemu/kvm, you can use the option

“-mem-path /hugepages”

Use shmget(2) with the flag SHM_HUGETLB 86
Configuring hugepages


On older enterprise distro's (RHEL 6 / SLES 11)
this must be done at boot time or shortly after it

Kernel boot option “hugepages=nnn”

/etc/sysctl.conf: “vm.nr_hugepages=nnn”

These pages are reserved for hugepages and can
not be used for anything else

With kernels newer than 2.6.23, things are
more flexible

Kernel boot option “movablecore=nnn[KMG]”

Memory reserved this way can be used for
hugepages and other uses 87

Agenda


Introduction to Performance Tuning

Filesystem and storage tuning

Network tuning

NFS performance tuning

Memory tuning

Application tuning

88
Application Tuning


Access to the source code?

Open source vs. Proprietary

Ability/willingness to modify the code?

Even if it's open source, you might not want to
modify the code

Proprietary programs

Read the documentation; find the knobs and find
the application-level statistics you can gather

… but there are still some tricks we can do to figure
out what is going on when you don't have the
source... 89

A quick aside: Java Performance


Tuning

I'm not a Java programmer.... but I've worked
with a lot of Java performance tuning experts

First thing to consider is Garbage Collection

The GC is overhead that burns CPU time

GC can cause unpredictable pauses in the program

Collecting GC stats: JVM command-line option
-verbose:gc

Sizing the heap

Larger heap means less GC's

… but more time spent GC'ing when you do 90
Generational GC


Observation: objects in Java have a high infant
mortality rate

Temporary objects, etc.

So put them in a separate arenas.

An object starts in the nursery (aka eden) space.
The nursery is GC'ed more frequently.

Objects which survive a certain number of GC
passes get promoted from the nursery to a tenured
space (which is GC'ed less frequently)

Need to configure the size of the nursery and
tenured space 91

Reducing GC's by not creating as


much Garbage

Requires being able to modify the code

Very often, though, Java programmers can
make extra work for the Java Run-time
Environment without realizing it

Two common examples

Using String and Integer class variables to do
calculations (instead of StringBuffer and the
primitive int type)

Using Java.util.Map instead of creating a Class

Yes, using an associative array rather than open-
coding accessor functions to class variables, but... 92
Back to C/C++ applications


Tools for investigating applications

strace/ltrace

valgrind

gprof

oprofile

perf

Most of these tools work better if you have
source access

But sometimes source is not absolutely required
93

strace and ltrace


Useful for seeing what the application is doing

Especially useful when you don't have source

System call tracing: strace

Shared library tracing: ltrace

Run a new command with tracing:

strace /bin/ls /usr

Attach to an already existing process

ltrace -p 12345

Newer systems may have “perf trace” – less 94
overhead than strace
Valgrind


Used for finding memory leaks and other
memory access bugs

Best used with source access (compiled with -g);
but not strictly necessary

Works by emulating x86 in x86 and adding
checks to pointer references and malloc/free
calls

Other architectures supported

Commercial alternative: purify (uses object
code insertion)
95

C/C++ profiling using gprof


To use, compile your code using the -pg option

This will add code to the compiled binary to
track each function call and its caller

In addition the program counter is sampled by
the kernel at some regular interval (i.e., 100Hz
or 1kHz) to find the “hot spots”

Demo time!

96
System profiling using
oprofile/perf

Basic operation very similar to gprof

Sample the program counter at regular intervals

Advantages over gprof

Does not require recompiling application with -pg

Can profile multiple processes and the kernel all at
the same time

Older Enterprise distributions will only have
oprofile; perf is the new hotness

Demo time!
97

Debug Information


Many tracing/debugging tools require
debugging information (gdb, perf, oprofile,
systemtap, eBPF, etc.)

Problem: debugging information is large

With debugging info: powertop is 6,796k

Without debugging info: powertop is only 484k

Solution: split out the debugging info into an
optional file

Debuginfo file named by the build-id – a SHA1 hash
of the text/data segments – and which is also stored
in the executable. 98
Installing debuginfo pacakges


Fedora / RHEL

Named <pkg>-debuginfo

Use “debuginfo-install <pkg>” to install

Debian (in Debian stretch+)

Named <pkg>-dbgsym (or legacy: <pkg>-dbg)

Add debug.mirrors.debian.org to
/etc/apt/sources.list

Ubuntu

Named <pkg>-dbgsym.ddeb (or <pkg>-dbg)

Add ddeb.ubuntu.com to /etc/apt/sources.list 99

Userspace Locking


One other application issue which can be a very
big deal: userspace locking

Rip out fancy multi-level locking (i.e., user-
space spinlocks, sched_yield() calls, etc.)

Just use pthread mutexes, and be happy

Linux implements pthread mutexes using the
futex(2) system call. Avoids kernel context switch
except in the contended case

The fast path really is fast! (So no need for
fancy/complex multi-level locking – just rip it out)

Common pitfall when porting apps from Solaris 100
Processor Affinity


Rarely a good idea... but can be used to
improve response time for critical tasks

Set CPU affinity for tasks using taskset(1)

Set CPU affinity for interrupt handlers using
/proc/irq/<nn>/smp_affinity

Strategies

Put producer/consumer processes on the same
CPU

Move interrupt handlers to a different CPU

Use mpstat(1) and /proc/interrupts to get 101

processor-related statistics
Latency Numbers All
Programmers Should Know

L1 cache reference 0.5 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lock/unlock 25 ns

Main memory refrence 100 ns

Compress 1k bytes using snappy 3,000 ns = 3 us

Send 2k over 1Gps network 20,000 ns = 20 us

SSD Random Read 150,000 ns = 150 us

Read 1MB from main memory 250,000 ns = 250 us

Round-trip in same datacenter 500,000 ns = 500 us

Read 1MB sequentially from SSD 1,000,000 ns = 1 ms

Disk Seek 10,000,000 ns = 10 ms

Read 1MB sequentially from HDD 20,000,000 ns = 20 ms
102

Send packet CA → Netherlands → CA 150,000,000 ns = 150 ms
Conclusion


Performance tuning is fractal

There's always more to tweak

“It's more addictive than pistachios!”

Understanding when to stop

Great way of learning more up and down the
technology stack – from the CPU chip up
through to the OS to the application tuning

103

You might also like