Friday 6 November 2009

Performance Monitoring using Ingo Molnars' perf tool

The perf tool by Ingo Molnar allows one to do some deep performance using Linux performance counters. It covers a broad range of performance monitoring at the hardware level and software level. In this blog posting I just want to give you a taste of some of the ways to use this powerful tool.

One needs to build this from the kernel source, but it's fairly easy to do:

1) Install libelf-dev, on a Ubuntu system use:

sudo apt-get install libelf-dev

2) Get the kernel source

either from kernel.org or from Ubuntu kernel source package:

apt-get source linux-image-2.6.31-14-generic

3) ..and build the tool..

in the kernel source:

cd tools/perf
make

There is plenty of documentation on this tool in the tools/perf/Documentation directory and I recommend reading this to get a full appreciation of what the tool can do and how to drive it.

My first example is a trivial performance counter example on the dd command:

./perf stat dd if=/dev/zero of=/dev/null bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 0.353498 s, 12.1 GB/s

Performance counter stats for 'dd if=/dev/zero of=/dev/null bs=1M count=4096':

355.148424 task-clock-msecs # 0.998 CPUs
18 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
501 page-faults # 0.001 M/sec
899141721 cycles # 2531.735 M/sec
2212730050 instructions # 2.461 IPC
67433134 cache-references # 189.873 M/sec
6374 cache-misses # 0.018 M/sec

0.355829317 seconds time elapsed


But we can dig deeper than this. How about seeing what's really going on on the application and the kernel? The next command records stats into a file perf.data and then we can then examine these stats using perf report:

./perf record -f dd if=/dev/urandom of=/dev/null bs=1M count=16
16+0 records in
16+0 records out
16777216 bytes (17 MB) copied, 2.39751 s, 7.0 MB/s
[ perf record: Captured and wrote 1.417 MB perf.data (~61900 samples) ]

..and generate a report on the significant CPU consuming functions:

./perf report --sort comm,dso,symbol | grep -v "0.00%"
# Samples: 61859
#
# Overhead Command Shared Object Symbol
# ........ ....... ......................... ......
#
75.52% dd [kernel] [k] sha_transform
14.07% dd [kernel] [k] mix_pool_bytes_extract
3.38% dd [kernel] [k] extract_buf
2.33% dd [kernel] [k] copy_user_generic_string
1.36% dd [kernel] [k] __ticket_spin_lock
0.90% dd [kernel] [k] _spin_lock_irqsave
0.72% dd [kernel] [k] _spin_unlock_irqrestore
0.67% dd [kernel] [k] extract_entropy_user
0.27% dd [kernel] [k] default_spin_lock_flags
0.22% dd [kernel] [k] sha_init
0.11% dd [kernel] [k] __ticket_spin_unlock
0.08% dd [kernel] [k] copy_to_user
0.04% perf [kernel] [k] copy_user_generic_string
0.02% dd [kernel] [k] clear_page_c
0.01% perf [kernel] [k] memset_c
0.01% dd [kernel] [k] page_fault
0.01% dd /lib/libc-2.10.1.so [.] 0x000000000773f6
0.01% perf [kernel] [k] __ticket_spin_lock
0.01% dd [kernel] [k] native_read_tsc
0.01% dd /lib/libc-2.10.1.so [.] strcmp
0.01% perf [kernel] [k] kmem_cache_alloc
0.01% perf [kernel] [k] __block_commit_write
0.01% perf [kernel] [k] ext4_do_update_inode

..showing us where most of the CPU time is being consumed, down to the function names in the kernel, application and shared libraries.

One can drill down deeper, in the previous example strcmp() was using 0.01% of the CPU; we can see where using perf annotate:

./perf annotate strcmp
objdump: 'vmlinux': No such file

------------------------------------------------
Percent | Source code & Disassembly of vmlinux
------------------------------------------------
------------------------------------------------
Percent | Source code & Disassembly of libc-2.10.1.so
------------------------------------------------
:
:
:
: Disassembly of section .text:
:
: 000000000007ee20 :
50.00 : 7ee20: 8a 07 mov (%rdi),%al
0.00 : 7ee22: 3a 06 cmp (%rsi),%al
25.00 : 7ee24: 75 0d jne 7ee33
25.00 : 7ee26: 48 ff c7 inc %rdi
0.00 : 7ee29: 48 ff c6 inc %rsi
0.00 : 7ee2c: 84 c0 test %al,%al
0.00 : 7ee2e: 75 f0 jne 7ee20
0.00 : 7ee30: 31 c0 xor %eax,%eax
0.00 : 7ee32: c3 retq
0.00 : 7ee33: b8 01 00 00 00 mov $0x1,%eax
0.00 : 7ee38: b9 ff ff ff ff mov $0xffffffff,%ecx
0.00 : 7ee3d: 0f 42 c1 cmovb %ecx,%eax
0.00 : 7ee40: c3 retq

Without the debug info in the object code, just the annotated assember is displayed.

To see which events one can trace with, use the perf list command:

./perf list

List of pre-defined events (to be used in -e):

cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]

cpu-clock [Software event]
task-clock [Software event]
page-faults OR faults [Software event]
minor-faults [Software event]
major-faults [Software event]
context-switches OR cs [Software event]
cpu-migrations OR migrations [Software event]

L1-dcache-loads [Hardware cache event]
L1-dcache-load-misses [Hardware cache event]
...
...
sched:sched_migrate_task [Tracepoint event]
sched:sched_process_free [Tracepoint event]
sched:sched_process_exit [Tracepoint event]
sched:sched_process_wait [Tracepoint event]
sched:sched_process_fork [Tracepoint event]
sched:sched_signal_send [Tracepoint event]


On my system I have over 120 different types of events that can be monitored. One can select events to monitor using the -e event option, e.g.:

./perf stat -e L1-dcache-loads -e instructions dd if=/dev/zero of=/dev/null bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.000784806 s, 1.3 GB/s

Performance counter stats for 'dd if=/dev/zero of=/dev/null bs=1M count=1':

1166059 L1-dcache-loads # inf M/sec
4283970 instructions # inf IPC

0.003090599 seconds time elapsed


This is one powerful too! I recommend reading the documentation and trying it out for yourself on a 2.6.31 kernel.

References: http://lkml.org/lkml/2009/8/4/346

3 comments:

  1. Can you use 'perf' on processes already running? 'twould be handy for long-running daemons detached from the terminal.


    Ok, ok. RTFM, right?

    ReplyDelete
  2. Not sure, perhaps read the source code... :-)

    ReplyDelete
  3. I've noticed that the tools/perf directory from
    the ubuntu kernel source package doesn't atually have any source code. Not sure why this is the case
    >ls *
    Makefile

    Documentation:
    Makefile

    util:
    generate-cmdlist.sh

    That said, perf appears to work fine when built from the kernel.org source.

    ReplyDelete