Linux Performance Tools (LinuxCon NA) - Brendan Gregg
Linux Performance Tools (LinuxCon NA) - Brendan Gregg
[email protected]
@brendangregg
A
quick
tour
of
many
tools…
• Massive
AWS
EC2
Linux
cloud
– Tens
of
thousands
of
instances
– Autoscale
by
~3k
each
day
– CentOS
and
Ubuntu
• FreeBSD
for
content
delivery
– Approx
33%
of
US
Internet
traffic
at
night
• Performance
is
criRcal
– Customer
saRsfacRon:
>50M
subscribers
– $$$
price/performance
– Develop
tools
for
cloud-‐wide
analysis;
use
server
tools
as
needed
• Just
launched
in
Europe!
Brendan
Gregg
• Senior
Performance
Architect,
Ne8lix
– Linux
and
FreeBSD
performance
– Performance
Engineering
team
(@coburnw)
• Recent
work:
– Linux
perf-‐tools,
using
crace
&
perf_events
– Systems
Performance,
PrenRce
Hall
• Previous
work
includes:
– USE
Method,
flame
graphs,
uRlizaRon
&
latency
heat
maps,
DTrace
tools,
ZFS
L2ARC
• Twijer
@brendangregg
(these
slides)
Agenda
• Methodologies
&
Tools
• Tool
Types:
– Observability
• Basic
• Intermediate
• Advanced
– Benchmarking
– Tuning
– StaRc
• Tracing
Aim:
to
show
what
can
be
done
Knowing
that
something
can
be
done
is
more
important
than
knowing
how
to
do
it.
Methodologies
&
Tools
Methodologies
&
Tools
• There
are
dozens
of
performance
tools
for
Linux
– Packages:
sysstat,
procps,
coreuRls,
…
– Commercial
products
• Methodologies
can
provide
guidance
for
choosing
and
using
tools
effecRvely
An3-‐Methodologies
• The
lack
of
a
deliberate
methodology…
• Street
Light
AnR-‐Method:
– 1.
Pick
observability
tools
that
are
• Familiar
• Found
on
the
Internet,
or
at
random
– 2.
Run
tools
– 3.
Look
for
obvious
issues
• Drunk
Man
AnR-‐Method:
– Tune
things
at
random
unRl
the
problem
goes
away
Methodologies
• For
example,
the
USE
Method:
– For
every
resource,
check:
• URlizaRon
• SaturaRon
• Errors
• 5
Whys:
– Ask
“why?”
5
Rmes
• Other
methods
include:
– Workload
characterizaRon,
drill-‐down
analysis,
event
tracing,
baseline
stats,
staRc
performance
tuning,
…
• Start
with
the
quesRons,
then
find
the
tools
Command
Line
Tools
• Useful
to
study
even
if
you
never
use
them:
GUIs
and
commercial
products
ocen
use
the
same
interfaces
Kernel
/proc,
/sys,
…
$ vmstat 1!
procs -----------memory---------- ---swap-- …!
r b swpd free buff cache si so …!
9 0 0 29549320 29252 9299060 0 …!
2 0 0 29547876 29252 9299332 0 …!
4 0 0 29548124 29252 9299460 0 …!
5 0 0 29548840 29252 9299592 0 …!
Tool
Types
Type
Characteris.c
Observability
Watch
acRvity.
Safe,
usually,
depending
on
resource
overhead.
Benchmarking
Load
test.
CauRon:
producRon
tests
can
cause
issues
due
to
contenRon.
Tuning
Change.
Danger:
changes
could
hurt
performance,
now
or
later
with
load.
StaRc
Check
configuraRon.
Should
be
safe.
Observability
Tools
How
do
you
measure
these?
Observability
Tools:
Basic
• upRme
• top
(or
htop)
• ps
• vmstat
• iostat
• mpstat
• free
upRme
• One
way
to
print
load
averages:
$ uptime!
07:42:06 up 8:16, 1 user, load average: 2.27, 2.84, 2.91!
• Custom
fields:
$ ps -eo user,sz,rss,minflt,majflt,pcpu,args!
USER SZ RSS MINFLT MAJFLT %CPU COMMAND!
root 6085 2272 11928 24 0.0 /sbin/init!
[…]!
vmstat
• Virtual
memory
staRsRcs
and
more:
$ vmstat –Sm 1!
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----!
r b swpd free buff cache si so bi bo in cs us sy id wa!
8 0 0 1620 149 552 0 0 1 179 77 12 25 34 0 0!
7 0 0 1598 149 552 0 0 0 0 205 186 46 13 0 0!
8 0 0 1617 149 552 0 0 0 8 210 435 39 21 0 0!
8 0 0 1589 149 552 0 0 0 0 218 219 42 17 0 0!
[…]!
Workload
• Very
useful
... \ avgqu-sz await r_await w_await svctm %util!
... / 0.00 0.00 0.00 0.00 0.00 0.00!
... \ 126.09 8.22 8.22 0.00 0.06 86.40!
set
of
stats
... / 99.31 6.47 6.47 0.00 0.06 86.00!
... \ 0.00 0.00 0.00 0.00 0.00 0.00!
ResulRng
Performance
mpstat
• MulR-‐processor
staRsRcs,
per-‐CPU:
$ mpstat –P ALL 1!
[…]!
08:06:43 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle!
08:06:44 PM all 53.45 0.00 3.77 0.00 0.00 0.39 0.13 0.00 42.26!
08:06:44 PM 0 49.49 0.00 3.03 0.00 0.00 1.01 1.01 0.00 45.45!
08:06:44 PM 1 51.61 0.00 4.30 0.00 0.00 2.15 0.00 0.00 41.94!
08:06:44 PM 2 58.16 0.00 7.14 0.00 0.00 0.00 1.02 0.00 33.67!
08:06:44 PM 3 54.55 0.00 5.05 0.00 0.00 0.00 0.00 0.00 40.40!
08:06:44 PM 4 47.42 0.00 3.09 0.00 0.00 0.00 0.00 0.00 49.48!
08:06:44 PM 5 65.66 0.00 3.03 0.00 0.00 0.00 0.00 0.00 31.31!
08:06:44 PM 6 50.00 0.00 2.08 0.00 0.00 0.00 0.00 0.00 47.92!
[…]!
-/+ buffers/cache:
Swap: 0
436
0
3313!
0!
• Eg,
-‐jt:
Rme
(us)
since
epoch;
-‐T:
syscall
Rme
(s)
• Translates
syscall
args
– Very
helpful
for
solving
system
usage
issues
• Currently
has
massive
overhead
(ptrace
based)
– Can
slow
the
target
by
>
100x.
Use
extreme
cauRon.
tcpdump
• Sniff
network
packets
for
post
analysis:
$ tcpdump -i eth0 -w /tmp/out.tcpdump!
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes!
^C7985 packets captured!
8996 packets received by filter!
1010 packets dropped by kernel!
# tcpdump -nr /tmp/out.tcpdump | head !
reading from file /tmp/out.tcpdump, link-type EN10MB (Ethernet) !
20:41:05.038437 IP 10.44.107.151.22 > 10.53.237.72.46425: Flags [P.], seq 18...!
20:41:05.038533 IP 10.44.107.151.22 > 10.53.237.72.46425: Flags [P.], seq 48...!
20:41:05.038584 IP 10.44.107.151.22 > 10.53.237.72.46425: Flags [P.], seq 96...!
[…]!
Filename Type Size Used Priority!
/dev/sda3 partition 5245212 284 -1!
rNNN (see 'perf list --help' on how to encode it) [Raw hardware event … !
mem:<addr>[:access] [Hardware breakpoint]!
Kernel
TCP/IP
Broken
GC
Java
stacks
Locks
epoll
(missing
Idle
frame
Time
thread
pointer)
Rptop
TIME
06:11:35
C0_MCYC
6428553166
C0_ACYC
7457384521
UTIL
51%
RATIO
116%
MHz!
2900!
06:11:40 6349881107 7365764152 50% 115% 2899!
06:11:45 6240610655 7239046277 49% 115% 2899! Real
CPU
MHz
[...]!
ec2-guest# ./cputemp 1!
CPU1 CPU2 CPU3 CPU4!
61 61 60 59!
60 61 60 60! CPU
Temperature
[...]!
More
Advanced
Tools…
• Some
others
worth
menRoning:
Tool
Descrip.on
ltrace
Library
call
tracer
ethtool
Mostly
interface
tuning;
some
stats
snmpget
SNMP
network
host
staRsRcs
lldptool
Can
get
LLDP
broadcast
stats
blktrace
Block
I/O
event
tracer
/proc
Many
raw
kernel
counters
pmu-‐tools
On-‐
and
off-‐core
CPU
counter
tools
Advanced
Tracers
• Many
opRons
on
Linux:
– perf_events,
crace,
eBPF,
SystemTap,
ktap,
LTTng,
dtrace4linux,
sysdig
• Most
can
do
staRc
and
dynamic
tracing
– StaRc:
pre-‐defined
events
(tracepoints)
– Dynamic:
instrument
any
socware
(kprobes,
uprobes).
Custom
metrics
on-‐demand.
Catch
all.
• Many
are
in-‐development.
– I’ll
summarize
their
state
later…
Linux
Observability
Tools
Linux
Observability
Tools
Benchmarking
Tools
Benchmarking
Tools
• MulR:
– UnixBench,
lmbench,
sysbench,
perf
bench
• FS/disk:
– dd,
hdparm,
fio
• App/lib:
– ab,
wrk,
jmeter,
openssl
• Networking:
– ping,
hping3,
iperf,
jcp,
traceroute,
mtr,
pchar
AcRve
Benchmarking
• Most
benchmarks
are
misleading
or
wrong
– You
benchmark
A,
but
actually
measure
B,
and
conclude
that
you
measured
C
• AcRve
Benchmarking
1. Run
the
benchmark
for
hours
2. While
running,
analyze
and
confirm
the
performance
limiter
using
observability
tools
• We
just
covered
those
tools
–
use
them!
lmbench
• CPU,
memory,
and
kernel
micro-‐benchmarks
• Eg,
memory
latency
by
stride
size:
$ lat_mem_rd 100m 128 > out.latencies!
some R processing…!
L2
cache
Main
Memory
L1
cache
L3
cache
fio
• FS
or
disk
I/O
micro-‐benchmarks
$ fio --name=seqwrite --rw=write --bs=128k --size=122374m!
[…]!
seqwrite: (groupid=0, jobs=1): err= 0: pid=22321!
write: io=122374MB, bw=840951KB/s, iops=6569 , runt=149011msec!
clat (usec): min=41 , max=133186 , avg=148.26, stdev=1287.17!
lat (usec): min=44 , max=133188 , avg=151.11, stdev=1287.21!
bw (KB/s) : min=10746, max=1983488, per=100.18%, avg=842503.94,
stdev=262774.35!
cpu : usr=2.67%, sys=43.46%, ctx=14284, majf=1, minf=24!
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%!
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%!
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%!
issued r/w/d: total=0/978992/0, short=0/0/0!
lat (usec): 50=0.02%, 100=98.30%, 250=1.06%, 500=0.01%, 750=0.01%!
lat (usec): 1000=0.01%!
lat (msec): 2=0.01%, 4=0.01%, 10=0.25%, 20=0.29%, 50=0.06%!
lat (msec): 100=0.01%, 250=0.01%!
# ./iosnoop –h!
USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration]!
-d device # device string (eg, "202,1)!
-i iotype # match type (eg, '*R*' for all reads)!
-n name # process name to match on I/O issue!
-p PID # PID to match on I/O issue!
-Q # include queueing time in LATms!
-s # include start time of I/O (s)!
-t # include completion time of I/O (s)!
-h # this usage message!
duration # duration seconds, and use buffers!
[…]!
iolatency
• Block
I/O
(disk)
latency
distribuRons:
# ./iolatency !
Tracing block I/O. Output every 1 seconds. Ctrl-C to end.!
!
>=(ms) .. <(ms) : I/O |Distribution |!
0 -> 1 : 2104 |######################################|!
1 -> 2 : 280 |###### |!
2 -> 4 : 2 |# |!
4 -> 8 : 0 | |!
8 -> 16 : 202 |#### |!
!
>=(ms) .. <(ms) : I/O |Distribution |!
0 -> 1 : 1144 |######################################|!
1 -> 2 : 267 |######### |!
2 -> 4 : 10 |# |!
4 -> 8 : 5 |# |!
8 -> 16 : 248 |######### |!
16 -> 32 : 601 |#################### |!
32 -> 64 : 117 |#### |!
[…]!
opensnoop
• Trace
open()
syscalls
showing
filenames:
# ./opensnoop -t!
Tracing open()s. Ctrl-C to end.!
TIMEs COMM PID FD FILE!
4345768.332626 postgres 23886 0x8 /proc/self/oom_adj!
4345768.333923 postgres 23886 0x5 global/pg_filenode.map!
4345768.333971 postgres 23886 0x5 global/pg_internal.init!
4345768.334813 postgres 23886 0x5 base/16384/PG_VERSION!
4345768.334877 postgres 23886 0x5 base/16384/pg_filenode.map!
4345768.334891 postgres 23886 0x5 base/16384/pg_internal.init!
4345768.335821 postgres 23886 0x5 base/16384/11725!
4345768.347911 svstat 24649 0x4 supervise/ok!
4345768.347921 svstat 24649 0x4 supervise/status!
4345768.350340 stat 24651 0x3 /etc/ld.so.cache!
4345768.350372 stat 24651 0x3 /lib/x86_64-linux-gnu/libselinux…!
4345768.350460 stat 24651 0x3 /lib/x86_64-linux-gnu/libc.so.6!
4345768.350526 stat 24651 0x3 /lib/x86_64-linux-gnu/libdl.so.2!
4345768.350981 stat 24651 0x3 /proc/filesystems!
4345768.351182 stat 24651 0x3 /etc/nsswitch.conf!
[…]!
funcgraph
• Trace
a
graph
of
kernel
code
flow:
# ./funcgraph -Htp 5363 vfs_read!
Tracing "vfs_read" for PID 5363... Ctrl-C to end.!
# tracer: function_graph!
#!
# TIME CPU DURATION FUNCTION CALLS!
# | | | | | | | |!
4346366.073832 | 0) | vfs_read() {!
4346366.073834 | 0) | rw_verify_area() {!
4346366.073834 | 0) | security_file_permission() {!
4346366.073834 | 0) | apparmor_file_permission() {!
4346366.073835 | 0) 0.153 us | common_file_perm();!
4346366.073836 | 0) 0.947 us | }!
4346366.073836 | 0) 0.066 us | __fsnotify_parent();!
4346366.073836 | 0) 0.080 us | fsnotify();!
4346366.073837 | 0) 2.174 us | }!
4346366.073837 | 0) 2.656 us | }!
4346366.073837 | 0) | tty_read() {!
4346366.073837 | 0) 0.060 us | tty_paranoia_check();!
[…]!
kprobe
• Dynamically
trace
a
kernel
funcRon
call
or
return,
with
variables,
and
in-‐kernel
filtering:
# ./kprobe 'p:open do_sys_open filename=+0(%si):string' 'filename ~ "*stat"'!
Tracing kprobe myopen. Ctrl-C to end.!
postgres-1172 [000] d... 6594028.787166: open: (do_sys_open
+0x0/0x220) filename="pg_stat_tmp/pgstat.stat"!
postgres-1172 [001] d... 6594028.797410: open: (do_sys_open
+0x0/0x220) filename="pg_stat_tmp/pgstat.stat"!
postgres-1172 [001] d... 6594028.797467: open: (do_sys_open
+0x0/0x220) filename="pg_stat_tmp/pgstat.stat”!
^C!
Ending tracing...!
• Add
-‐s
for
stack
traces;
-‐p
for
PID
filter
in-‐kernel.
• Quickly
confirm
kernel
behavior;
eg:
did
a
tunable
take
effect?
Imagine
Linux
with
Tracing
• These
tools
aren’t
using
dtrace4linux,
SystemTap,
ktap,
or
any
other
add-‐on
tracer
• These
tools
use
exis.ng
Linux
capabili.es
– No
extra
kernel
bits,
not
even
kernel
debuginfo
– Just
Linux’s
built-‐in
8race
profiler
– Demoed
on
Linux
3.2
• Solving
real
issues
now
crace
• Added
by
Steven
Rostedt
and
others
since
2.6.27
• Already
enabled
on
our
servers
(3.2+)
– CONFIG_FTRACE,
CONFIG_FUNCTION_PROFILER,
…
– Use
directly
via
/sys/kernel/debug/tracing
• My
front-‐end
tools
to
aid
usage
– hjps://github.com/brendangregg/perf-‐tools
– Unsupported
hacks:
see
WARNINGs
– Also
see
the
trace-‐cmd
front-‐end,
as
well
as
perf
• lwn.net:
“Ftrace:
The
Hidden
Light
Switch”
My
perf-‐tools
(so
far…)
Tracing
Summary
• crace
• perf_events
• eBPF
• SystemTap
• ktap
• LTTng
• dtrace4linux
• sysdig
perf_events
• aka
“perf”
command
• In
Linux.
Add
from
linux-‐tools-‐common,
…
• Powerful
mulR-‐tool
and
profiler
– interval
sampling,
CPU
performance
counter
events
– user
and
kernel
dynamic
tracing
– kernel
line
tracing
and
local
variables
(debuginfo)
– kernel
filtering,
and
in-‐kernel
counts
(perf
stat)
• Not
very
programmable,
yet
– limited
kernel
summaries.
May
improve
with
eBPF.
perf_events
Example
# perf record –e skb:consume_skb -ag!
^C[ perf record: Woken up 1 times to write data ]!
[ perf record: Captured and wrote 0.065 MB perf.data (~2851 samples) ]!
# perf report!
[...]!
74.42% swapper [kernel.kallsyms] [k] consume_skb!
|!
--- consume_skb!
arp_process!
arp_rcv!
__netif_receive_skb_core! Summarizing
stack
__netif_receive_skb!
netif_receive_skb! traces
for
a
tracepoint
virtnet_poll!
net_rx_action!
__do_softirq! perf_events
can
do
irq_exit!
do_IRQ! many
things
–
hard
to
ret_from_intr!
default_idle!
pick
just
one
example
cpu_idle!
start_secondary!
[…]!
eBPF
• Extended
BPF:
programs
on
tracepoints
– High
performance
filtering:
JIT
– In-‐kernel
summaries:
maps
• Linux
in
3.18?
Enhance
perf_events/crace/…?
# ./bitesize 1!
writing bpf-5 -> /sys/kernel/debug/tracing/events/block/block_rq_complete/filter!
!
I/O sizes:!
Kbytes : Count!
4 -> 7 : 131!
8 -> 15 : 32!
16 -> 31 : 1! in-‐kernel
summary
32 -> 63 : 46!
64 -> 127 : 0!
128 -> 255 : 15!
[…]!
SystemTap
• Fully
programmable,
fully
featured
• Compiles
tracing
programs
into
kernel
modules
– Needs
a
compiler,
and
takes
Rme
• “Works
great
on
Red
Hat”
– I
keep
trying
on
other
distros
and
have
hit
trouble
in
the
past;
make
sure
you
are
on
the
latest
version.
– I’m
liking
it
a
bit
more
acer
finding
ways
to
use
it
without
kernel
debuginfo
(a
difficult
requirement
in
our
environment).
Work
in
progress.
• Ever
be
mainline?
ktap
• Sampling,
staRc
&
dynamic
tracing
• Lightweight,
simple.
Uses
bytecode.
• Suited
for
embedded
devices
• Development
appears
suspended
acer
suggesRons
to
integrate
with
eBPF
(which
itself
is
in
development)
• ktap
+
eBPF
would
be
awesome:
easy,
lightweight,
fast.
Likely?
sysdig
• sysdig:
InnovaRve
new
tracer.
Simple
expressions:
sysdig fd.type=file and evt.failed=true!
sysdig evt.type=open and fd.name contains /etc!
sysdig -p"%proc.name %fd.name" "evt.type=accept and proc.name!=httpd”!
dtrace4L.
ktap
sysdig
Ease
of
use
perf stap
crace
(alpha)
(mature)
Stage
of
eBPF
(brutal)
Development
Scope
&
Capability
In
Summary
In
Summary…
• Plus
diagrams
for
benchmarking,
tuning,
tracing
• Try
to
start
with
the
quesRons
(methodology),
to
help
guide
your
use
of
the
tools
• I
hopefully
turned
some
unknown
unknowns
into
known
unknowns
References
&
Links
– Systems
Performance:
Enterprise
and
the
Cloud,
PrenRce
Hall,
2014
– hjp://www.brendangregg.com/linuxperf.html
– hjp://www.brendangregg.com/perf.html#FlameGraphs
– nicstat:
hjp://sourceforge.net/projects/nicstat/
– Rptop:
hjp://Rptop.gforge.inria.fr/
• Tiptop:
Hardware
Performance
Counters
for
the
Masses,
Erven
Rohou,
Inria
Research
Report
7789,
Nov
2011.
– crace
&
perf-‐tools
• hjps://github.com/brendangregg/perf-‐tools
• hjp://lwn.net/ArRcles/608497/
– MSR
tools:
hjps://github.com/brendangregg/msr-‐cloud-‐tools
– pcstat:
hjps://github.com/tobert/pcstat
– eBPF:
hjp://lwn.net/ArRcles/603983/
– ktap:
hjp://www.ktap.org/
– SystemTap:
hjps://sourceware.org/systemtap/
– sysdig:
hjp://www.sysdig.org/
– hjp://www.slideshare.net/brendangregg/linux-‐performance-‐analysis-‐and-‐tools
– Tux
by
Larry
Ewing;
Linux®
is
the
registered
trademark
of
Linus
Torvalds
in
the
U.S.
and
other
countries.
Thanks
• QuesRons?
• hjp://slideshare.net/brendangregg
• hjp://www.brendangregg.com
• [email protected]
• @brendangregg