Linux Performance 2018: Brendan Gregg
Linux Performance 2018: Brendan Gregg
2018
Brendan Gregg
Senior Performance Architect
Oct 2018
http://neuling.org/linux-next-size.html
Post frequency:
Application
(retpolne)
Server A: 31353 MySQL queries/sec
serverA# mpstat 1
Linux 4.14.12-virtual (bgregg-c5.9xl-i-xxx) 02/09/2018 _x86_64_ (36 CPU)
01:09:13 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
01:09:14 AM all 86.89 0.00 13.08 0.00 0.00 0.00 0.00 0.00 0.00 0.03
01:09:15 AM all 86.77 0.00 13.23 0.00 0.00 0.00 0.00 0.00 0.00 0.00
01:09:16 AM all 86.93 0.00 13.02 0.00 0.00 0.00 0.03 0.00 0.00 0.03
[...]
Virtual Physical
Address Address
CPU MMU Main
Memory
hit miss
(walk) Page
TLB Table
Server A: TLB miss walks 3.5%
serverA# ./tlbstat 1
K_CYCLES K_INSTR IPC DTLB_WALKS ITLB_WALKS K_DTLBCYC K_ITLBCYC DTLB% ITLB%
95913667 99982399 1.04 86588626 115441706 1507279 1837217 1.57 1.92
95810170 99951362 1.04 86281319 115306404 1507472 1842313 1.57 1.92
95844079 100066236 1.04 86564448 115555259 1511158 1845661 1.58 1.93
95978588 100029077 1.04 86187531 115292395 1508524 1845525 1.57 1.92
[...]
SDN Configuration
Runtime Event Targets
DDoS Mitigation
verifier sockets
Intrusion Detection
kprobes
Container Security
BPF uprobes
Observability tracepoints
BPF
Firewalls (bpfilter) perf_events
actions
Device Drivers
…
eBPF is solving new things: off-CPU + wakeup analysis
eBPF bcc Linux 4.4+
https://github.com/iovisor/bcc
e.g., identify multimodal disk I/O latency and outliers
with bcc/eBPF biolatency
# biolatency -mT 10
Tracing block device I/O... Hit Ctrl-C to end.
19:19:04
msecs : count distribution
0 -> 1 : 238 |********* |
2 -> 3 : 424 |***************** |
4 -> 7 : 834 |********************************* |
8 -> 15 : 506 |******************** |
16 -> 31 : 986 |****************************************|
32 -> 63 : 97 |*** |
64 -> 127 : 7 | |
128 -> 255 : 27 |* |
19:19:14
msecs : count distribution
0 -> 1 : 427 |******************* |
2 -> 3 : 424 |****************** |
[…]
bcc/eBPF programs are laborious: biolatency
# define BPF program if args.disks:
bpf_text = """ bpf_text = bpf_text.replace('STORAGE',
#include <uapi/linux/ptrace.h> 'BPF_HISTOGRAM(dist, disk_key_t);')
#include <linux/blkdev.h> bpf_text = bpf_text.replace('STORE',
'disk_key_t key = {.slot = bpf_log2l(delta)}; ' +
typedef struct disk_key { 'void *__tmp = (void *)req->rq_disk->disk_name; ' +
char disk[DISK_NAME_LEN]; 'bpf_probe_read(&key.disk, sizeof(key.disk), __tmp); ' +
u64 slot; 'dist.increment(key);')
} disk_key_t; else:
BPF_HASH(start, struct request *); bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);')
STORAGE bpf_text = bpf_text.replace('STORE',
'dist.increment(bpf_log2l(delta));')
// time block I/O if debug or args.ebpf:
int trace_req_start(struct pt_regs *ctx, struct request *req) print(bpf_text)
{ if args.ebpf:
u64 ts = bpf_ktime_get_ns(); exit()
start.update(&req, &ts);
return 0; # load BPF program
} b = BPF(text=bpf_text)
if args.queued:
// output b.attach_kprobe(event="blk_account_io_start", fn_name="trace_req_start")
int trace_req_completion(struct pt_regs *ctx, struct request *req) else:
{ b.attach_kprobe(event="blk_start_request", fn_name="trace_req_start")
u64 *tsp, delta; b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_req_start")
b.attach_kprobe(event="blk_account_io_completion",
// fetch timestamp and calculate delta fn_name="trace_req_completion")
tsp = start.lookup(&req);
if (tsp == 0) { print("Tracing block device I/O... Hit Ctrl-C to end.")
return 0; // missed issue
} # output
delta = bpf_ktime_get_ns() - *tsp; exiting = 0 if args.interval else 1
FACTOR dist = b.get_table("dist")
while (1):
// store as histogram try:
STORE sleep(int(args.interval))
except KeyboardInterrupt:
start.delete(&req); exiting = 1
return 0;
} print()
""" if args.timestamp:
print("%-8s\n" % strftime("%H:%M:%S"), end="")
# code substitutions
if args.milliseconds: dist.print_log2_hist(label, "disk")
bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000000;') dist.clear()
label = "msecs"
else: countdown -= 1
bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000;') if exiting or countdown == 0:
label = "usecs" exit()
… rewritten in bpftrace (launched Oct 2018)!
#!/usr/local/bin/bpftrace
BEGIN
{
printf("Tracing block device I/O... Hit Ctrl-C to end.\n");
}
kprobe:blk_account_io_start
{
@start[arg0] = nsecs;
}
kprobe:blk_account_io_completion
/@start[arg0]/
{
@usecs = hist((nsecs - @start[arg0]) / 1000);
delete(@start[arg0]);
}
eBPF bpftrace (aka BPFtrace) Linux 4.9+
…
Good for one-liners & short scripts; bcc is good for complex tools
https://github.com/iovisor/bpftrace
bpftrace Internals
eBPF XDP Linux 4.8+
https://www.netronome.com/blog/frnog-30-faster-networking-la-francaise/
eBPF bpfilter Linux 4.18+
ipfwadm (1.2.1)
ipchains (2.2.10)
iptables
nftables (3.13)
jit-compiled
bpfilter (4.18+) NIC offloading
https://lwn.net/Articles/747551/
Linux 4.9
BBR
TCP congestion control algorithm
Bottleneck Bandwidth and RTT
1% packet loss: we see 3x better throughput
https://twitter.com/amernetflix/status/892787364598132736
https://blog.apnic.net/2017/05/09/bbr-new-kid-tcp-block/ https://queue.acm.org/detail.cfm?id=3022184
Linux 4.12
Kyber
Multiqueue block I/O scheduler
Tune target read & write latency
Up to 300x lower 99th latencies in our testing
completions
Kyber (simplified) queue size adjust
https://lwn.net/Articles/720675/
Linux 4.17
Hist Triggers
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
# trigger info:
hist:keys=stacktrace:vals=bytes_req,bytes_alloc:sort=bytes_alloc:size=2048
[active]
[…]
{ stacktrace:
__kmalloc+0x11b/0x1b0
ftrace
seq_buf_alloc+0x1b/0x50
seq_read+0x2cc/0x370
advanced
proc_reg_read+0x3d/0x80 summaries
__vfs_read+0x28/0xe0
vfs_read+0x86/0x140
SyS_read+0x46/0xb0
system_call_fastpath+0x12/0x6a
} hitcount: 19133 bytes_req: 78368768 bytes_alloc: 78368768
https://www.kernel.org/doc/html/latest/trace/histogram.html
Linux 4.?
PSI not merged yet
https://lwn.net/Articles/759781/
More perf 4.4 - 4.19 (2016 - 2018)
●
TCP listener lockless (4.4) ●
perf_event_open() [ku]probes (4.17)
●
copy_file_range() (4.5) ●
AF_XDP sockets (4.18)
●
madvise() MADV_FREE (4.5) ●
Block I/O latency controller (4.19)
●
epoll multithread scalability (4.5) ●
CAKE for bufferbloat (4.19)
●
Kernel Connection Multiplexor (4.6) ●
New async I/O polling (4.19)
●
Writeback management (4.10)
… and many minor improvements to:
●
Hybrid block polling (4.10)
• perf
●
BFQ I/O scheduler (4.12)
• CPU scheduling
●
Async I/O improvements (4.13)
• futexes
●
In-kernel TLS acceleration (4.13)
• NUMA
●
Socket MSG_ZEROCOPY (4.14)
• Huge pages
●
Asynchronous buffered I/O (4.14)
• Slab allocation
●
Longer-lived TLB entries with PCID (4.14)
• TCP, UDP
●
mmap MAP_SYNC (4.15)
• Drivers
●
Software-interrupt context hrtimers (4.16)
• Processor support
●
Idle loop tick efficiency (4.17)
• GPUs
Take Aways
1. Run latest
2. Browse major features
eg, https://kernelnewbies.org/Linux_4.19
Some Linux perf Resources
- http://www.brendangregg.com/linuxperf.html
- https://kernelnewbies.org/LinuxChanges
- https://lwn.net/Kernel
- https://github.com/iovisor/bcc
- http://blog.stgolabs.net/search/label/linux
- http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html