summaryrefslogtreecommitdiff
path: root/simpleperf/README.md
diff options
context:
space:
mode:
authorYabin Cui <yabinc@google.com>2016-10-18 12:26:26 -0700
committerYabin Cui <yabinc@google.com>2016-10-18 12:26:26 -0700
commitcaf8e070afb3d7282121856bb2cce356ef47aaaf (patch)
treef2275f9d254dba4fe84719e6df7cddd958eb4f99 /simpleperf/README.md
parent4ab6af6787d7826e8c43282b5ad78b80b172d107 (diff)
downloadextras-caf8e070afb3d7282121856bb2cce356ef47aaaf.tar.gz
simpleperf: copy document from ndk-master.
Test: None, markdown only Bug: None Change-Id: I4d0ba07127f84119524df5d214c4731a7c9f06eb
Diffstat (limited to 'simpleperf/README.md')
-rw-r--r--simpleperf/README.md736
1 files changed, 736 insertions, 0 deletions
diff --git a/simpleperf/README.md b/simpleperf/README.md
new file mode 100644
index 00000000..3acd33a0
--- /dev/null
+++ b/simpleperf/README.md
@@ -0,0 +1,736 @@
+# Introduction of simpleperf
+## What is simpleperf
+Simpleperf is a native profiling tool for Android. Its command-line interface
+supports broadly the same options as the linux-tools perf, but also supports
+various Android-specific improvements.
+
+Simpleperf is part of the Android Open Source Project. The source code is at
+https://android.googlesource.com/platform/system/extras/+/master/simpleperf/.
+Bugs and feature requests can be submitted at
+http://github.com/android-ndk/ndk/issues.
+
+## How simpleperf works
+Modern CPUs have a hardware component called the performance monitoring unit
+(PMU). The PMU has several hardware counters, counting events like how many cpu
+cycles have happened, how many instructions have executed, or how many cache
+misses have happened.
+
+The Linux kernel wraps these hardware counters into hardware perf events. In
+addition, the Linux kernel also provides hardware independent software events
+and tracepoint events. The Linux kernel exposes all this to userspace via the
+perf_event_open system call, which simpleperf uses.
+
+Simpleperf has three main functions: stat, record and report.
+
+The stat command gives a summary of how many events have happened in the
+profiled processes in a time period. Here’s how it works:
+1. Given user options, simpleperf enables profiling by making a system call to
+linux kernel.
+2. Linux kernel enables counters while scheduling on the profiled processes.
+3. After profiling, simpleperf reads counters from linux kernel, and reports a
+counter summary.
+
+The record command records samples of the profiled process in a time period.
+Here’s how it works:
+1. Given user options, simpleperf enables profiling by making a system call to
+linux kernel.
+2. Simpleperf creates mapped buffers between simpleperf and linux kernel.
+3. Linux kernel enable counters while scheduling on the profiled processes.
+4. Each time a given number of events happen, linux kernel dumps a sample to a
+mapped buffer.
+5. Simpleperf reads samples from the mapped buffers and generates perf.data.
+
+The report command reads a "perf.data" file and any shared libraries used by
+the profiled processes, and outputs a report showing where the time was spent.
+
+## Main simpleperf commands
+Simpleperf supports several subcommands, including list, stat, record, report.
+Each subcommand supports different options. This section only covers the most
+important subcommands and options. To see all subcommands and options,
+use --help.
+
+ # List all subcommands.
+ $simpleperf --help
+
+ # Print help message for record subcommand.
+ $simpleperf record --help
+
+### simpleperf list
+simpleperf list is used to list all events available on the device. Different
+devices may support different events because of differences in hardware and
+kernel.
+
+ $simpleperf list
+ List of hw-cache events:
+ branch-loads
+ ...
+ List of hardware events:
+ cpu-cycles
+ instructions
+ ...
+ List of software events:
+ cpu-clock
+ task-clock
+ ...
+
+### simpleperf stat
+simpleperf stat is used to get a raw event counter information of the profiled program
+or system-wide. By passing options, we can select which events to use, which
+processes/threads to monitor, how long to monitor and the print interval.
+Below is an example.
+
+ # Stat using default events (cpu-cycles,instructions,...), and monitor
+ # process 7394 for 10 seconds.
+ $simpleperf stat -p 7394 --duration 10
+ Performance counter statistics:
+
+ 1,320,496,145 cpu-cycles # 0.131736 GHz (100%)
+ 510,426,028 instructions # 2.587047 cycles per instruction (100%)
+ 4,692,338 branch-misses # 468.118 K/sec (100%)
+ 886.008130(ms) task-clock # 0.088390 cpus used (100%)
+ 753 context-switches # 75.121 /sec (100%)
+ 870 page-faults # 86.793 /sec (100%)
+
+ Total test time: 10.023829 seconds.
+
+#### Select events
+We can select which events to use via -e option. Below are examples:
+
+ # Stat event cpu-cycles.
+ $simpleperf stat -e cpu-cycles -p 11904 --duration 10
+
+ # Stat event cache-references and cache-misses.
+ $simpleperf stat -e cache-references,cache-misses -p 11904 --duration 10
+
+When running the stat command, if the number of hardware events is larger than
+the number of hardware counters available in the PMU, the kernel shares hardware
+counters between events, so each event is only monitored for part of the total
+time. In the example below, there is a percentage at the end of each row,
+showing the percentage of the total time that each event was actually monitored.
+
+ # Stat using event cache-references, cache-references:u,....
+ $simpleperf stat -p 7394 -e cache-references,cache-references:u,cache-references:k,cache-misses,cache-misses:u,cache-misses:k,instructions --duration 1
+ Performance counter statistics:
+
+ 4,331,018 cache-references # 4.861 M/sec (87%)
+ 3,064,089 cache-references:u # 3.439 M/sec (87%)
+ 1,364,959 cache-references:k # 1.532 M/sec (87%)
+ 91,721 cache-misses # 102.918 K/sec (87%)
+ 45,735 cache-misses:u # 51.327 K/sec (87%)
+ 38,447 cache-misses:k # 43.131 K/sec (87%)
+ 9,688,515 instructions # 10.561 M/sec (89%)
+
+ Total test time: 1.026802 seconds.
+
+In the example above, each event is monitored about 87% of the total time. But
+there is no guarantee that any pair of events are always monitored at the same
+time. If we want to have some events monitored at the same time, we can use
+--group option. Below is an example.
+
+ # Stat using event cache-references, cache-references:u,....
+ $simpleperf stat -p 7394 --group cache-references,cache-misses --group cache-references:u,cache-misses:u --group cache-references:k,cache-misses:k -e instructions --duration 1
+ Performance counter statistics:
+
+ 3,638,900 cache-references # 4.786 M/sec (74%)
+ 65,171 cache-misses # 1.790953% miss rate (74%)
+ 2,390,433 cache-references:u # 3.153 M/sec (74%)
+ 32,280 cache-misses:u # 1.350383% miss rate (74%)
+ 879,035 cache-references:k # 1.251 M/sec (68%)
+ 30,303 cache-misses:k # 3.447303% miss rate (68%)
+ 8,921,161 instructions # 10.070 M/sec (86%)
+
+ Total test time: 1.029843 seconds.
+
+#### Select target to monitor
+We can select which processes or threads to monitor via -p option or -t option.
+Monitoring a process is the same as monitoring all threads in the process.
+Simpleperf can also fork a child process to run the new command and then monitor
+the child process. Below are examples.
+
+ # Stat process 11904 and 11905.
+ $simpleperf stat -p 11904,11905 --duration 10
+
+ # Stat thread 11904 and 11905.
+ $simpleperf stat -t 11904,11905 --duration 10
+
+ # Start a child process running `ls`, and stat it.
+ $simpleperf stat ls
+
+#### Decide how long to monitor
+When monitoring existing threads, we can use --duration option to decide how long
+to monitor. When monitoring a child process running a new command, simpleperf
+monitors until the child process ends. In this case, we can use Ctrl-C to stop monitoring
+at any time. Below are examples.
+
+ # Stat process 11904 for 10 seconds.
+ $simpleperf stat -p 11904 --duration 10
+
+ # Stat until the child process running `ls` finishes.
+ $simpleperf stat ls
+
+ # Stop monitoring using Ctrl-C.
+ $simpleperf stat -p 11904 --duration 10
+ ^C
+
+#### Decide the print interval
+When monitoring perf counters, we can also use --interval option to decide the print
+interval. Below are examples.
+
+ # Print stat for process 11904 every 300ms.
+ $simpleperf stat -p 11904 --duration 10 --interval 300
+ # Stop by using Ctrl-C.
+ ^C
+
+ # Print system wide stat at interval of 300ms for 10 seconds (rooted device only).
+ # system wide profiling needs root privilege
+ $su 0 simpleperf stat -a --duration 10 --interval 300
+
+#### Display counters in systrace
+simpleperf can also work with systrace to dump counters in the collected trace.
+Below is an example to do a system wide stat
+
+ # capture instructions (kernel only) and cache misses with interval of 300 milliseconds for 15 seconds
+ $su 0 simpleperf stat -e instructions:k,cache-misses -a --interval 300 --duration 15
+ # on host launch systrace to collect trace for 10 seconds
+ (HOST)$external/chromium-trace/systrace.py --time=10 -o new.html sched gfx view
+ # open the collected new.html in browser and perf counters will be shown up
+
+### simpleperf record
+simpleperf record is used to dump records of the profiled program. By passing
+options, we can select which events to use, which processes/threads to monitor,
+what frequency to dump records, how long to monitor, and where to store records.
+
+ # Record on process 7394 for 10 seconds, using default event (cpu-cycles),
+ # using default sample frequency (4000 samples per second), writing records
+ # to perf.data.
+ $simpleperf record -p 7394 --duration 10
+ simpleperf I 07-11 21:44:11 17522 17522 cmd_record.cpp:316] Samples recorded: 21430. Samples lost: 0.
+
+#### Select events
+In most cases, the cpu-cycles event is used to evaluate consumed cpu time.
+As a hardware event, it is both accurate and efficient. We can also use other
+events via -e option. Below is an example.
+
+ # Record using event instructions.
+ $simpleperf record -e instructions -p 11904 --duration 10
+
+#### Select target to monitor
+The way to select target in record command is similar to that in stat command.
+Below are examples.
+
+ # Record process 11904 and 11905.
+ $simpleperf record -p 11904,11905 --duration 10
+
+ # Record thread 11904 and 11905.
+ $simpleperf record -t 11904,11905 --duration 10
+
+ # Record a child process running `ls`.
+ $simpleperf record ls
+
+#### Set the frequency to record
+We can set the frequency to dump records via the -f or -c options. For example,
+-f 4000 means dumping approximately 4000 records every second when the monitored
+thread runs. If a monitored thread runs 0.2s in one second (it can be preempted
+or blocked in other times), simpleperf dumps about 4000 * 0.2 / 1.0 = 800
+records every second. Another way is using -c option. For example, -c 10000
+means dumping one record whenever 10000 events happen. Below are examples.
+
+ # Record with sample frequency 1000: sample 1000 times every second running.
+ $simpleperf record -f 1000 -p 11904,11905 --duration 10
+
+ # Record with sample period 100000: sample 1 time every 100000 events.
+ $simpleperf record -c 100000 -t 11904,11905 --duration 10
+
+#### Decide how long to monitor
+The way to decide how long to monitor in record command is similar to that in
+stat command. Below are examples.
+
+ # Record process 11904 for 10 seconds.
+ $simpleperf record -p 11904 --duration 10
+
+ # Record until the child process running `ls` finishes.
+ $simpleperf record ls
+
+ # Stop monitoring using Ctrl-C.
+ $simpleperf record -p 11904 --duration 10
+ ^C
+
+#### Set the path to store records
+By default, simpleperf stores records in perf.data in current directory. We can
+use -o option to set the path to store records. Below is an example.
+
+ # Write records to data/perf2.data.
+ $simpleperf record -p 11904 -o data/perf2.data --duration 10
+
+### simpleperf report
+simpleperf report is used to report based on perf.data generated by simpleperf
+record command. Report command groups records into different sample entries,
+sorts sample entries based on how many events each sample entry contains, and
+prints out each sample entry. By passing options, we can select where to find
+perf.data and executable binaries used by the monitored program, filter out
+uninteresting records, and decide how to group records.
+
+Below is an example. Records are grouped into 4 sample entries, each entry is
+a row. There are several columns, each column shows piece of information
+belonging to a sample entry. The first column is Overhead, which shows the
+percentage of events inside current sample entry in total events. As the
+perf event is cpu-cycles, the overhead can be seen as the percentage of cpu
+time used in each function.
+
+ # Reports perf.data, using only records sampled in libsudo-game-jni.so,
+ # grouping records using thread name(comm), process id(pid), thread id(tid),
+ # function name(symbol), and showing sample count for each row.
+ $simpleperf report --dsos /data/app/com.example.sudogame-2/lib/arm64/libsudo-game-jni.so --sort comm,pid,tid,symbol -n
+ Cmdline: /data/data/com.example.sudogame/simpleperf record -p 7394 --duration 10
+ Arch: arm64
+ Event: cpu-cycles (type 0, config 0)
+ Samples: 28235
+ Event count: 546356211
+
+ Overhead Sample Command Pid Tid Symbol
+ 59.25% 16680 sudogame 7394 7394 checkValid(Board const&, int, int)
+ 20.42% 5620 sudogame 7394 7394 canFindSolution_r(Board&, int, int)
+ 13.82% 4088 sudogame 7394 7394 randomBlock_r(Board&, int, int, int, int, int)
+ 6.24% 1756 sudogame 7394 7394 @plt
+
+#### Set the path to read records
+By default, simpleperf reads perf.data in current directory. We can use -i
+option to select another file to read records.
+
+ $simpleperf report -i data/perf2.data
+
+#### Set the path to find executable binaries
+If reporting function symbols, simpleperf needs to read executable binaries
+used by the monitored processes to get symbol table and debug information. By
+default, the paths are the executable binaries used by monitored processes while
+recording. However, these binaries may not exist when reporting or not contain
+symbol table and debug information. So we can use --symfs to redirect the paths.
+Below is an example.
+
+ $simpleperf report
+ # In this case, when simpleperf wants to read executable binary /A/b,
+ # it reads file in /A/b.
+
+ $simpleperf report --symfs /debug_dir
+ # In this case, when simpleperf wants to read executable binary /A/b,
+ # it prefers file in /debug_dir/A/b to file in /A/b.
+
+#### Filter records
+When reporting, it happens that not all records are of interest. Simpleperf
+supports five filters to select records of interest. Below are examples.
+
+ # Report records in threads having name sudogame.
+ $simpleperf report --comms sudogame
+
+ # Report records in process 7394 or 7395
+ $simpleperf report --pids 7394,7395
+
+ # Report records in thread 7394 or 7395.
+ $simpleperf report --tids 7394,7395
+
+ # Report records in libsudo-game-jni.so.
+ $simpleperf report --dsos /data/app/com.example.sudogame-2/lib/arm64/libsudo-game-jni.so
+
+ # Report records in function checkValid or canFindSolution_r.
+ $simpleperf report --symbols "checkValid(Board const&, int, int);canFindSolution_r(Board&, int, int)"
+
+#### Decide how to group records into sample entries
+Simpleperf uses --sort option to decide how to group sample entries. Below are
+examples.
+
+ # Group records based on their process id: records having the same process
+ # id are in the same sample entry.
+ $simpleperf report --sort pid
+
+ # Group records based on their thread id and thread comm: records having
+ # the same thread id and thread name are in the same sample entry.
+ $simpleperf report --sort tid,comm
+
+ # Group records based on their binary and function: records in the same
+ # binary and function are in the same sample entry.
+ $simpleperf report --sort dso,symbol
+
+ # Default option: --sort comm,pid,tid,dso,symbol. Group records in the same
+ # thread, and belong to the same function in the same binary.
+ $simpleperf report
+
+## Features of simpleperf
+Simpleperf works similar to linux-tools-perf, but it has following improvements:
+1. Aware of Android environment. Simpleperf handles some Android specific
+situations when profiling. For example, it can profile embedded shared libraries
+in apk, read symbol table and debug information from .gnu_debugdata section. If
+possible, it gives suggestions when facing errors, like how to disable
+perf_harden to enable profiling.
+2. Support unwinding while recording. If we want to use -g option to record and
+report call-graph of a program, we need to dump user stack and register set in
+each record, and then unwind the stack to find the call chain. Simpleperf
+supports unwinding while recording, so it doesn’t need to store user stack in
+perf.data. So we can profile for a longer time with limited space on device.
+3. Build in static binaries. Simpleperf is a static binary, so it doesn’t need
+supporting shared libraries to run. It means there is no limitation of Android
+version that simpleperf can run on, although some devices don’t support
+profiling.
+
+# Steps to profile native libraries
+After introducing simpleperf, this section uses a simple example to show how to
+profile jni native libraries on Android using simpleperf. The example profiles
+an app called com.example.sudogame, which uses a jni native library
+sudo-game-jni.so. We focus on sudo-game-jni.so, not the java code or system
+libraries.
+
+## 1. Run debug version of the app on device
+We need to run debug version of the app, because we can’t use *run-as* for non
+debuggable apps.
+
+## 2. Download simpleperf to the app’s directory
+Use *uname* to find the architecture on device
+
+ $adb shell uname -m
+ aarch64
+
+"aarch64" means we should download arm64 version of simpleperf to device.
+
+ $adb push device/arm64/simpleperf /data/local/tmp
+ $adb shell run-as com.example.sudogame cp /data/local/tmp/simpleperf .
+ $adb shell run-as com.example.sudogame chmod a+x simpleperf
+ $adb shell run-as com.example.sudogame ls -l
+ -rwxrwxrwx 1 u0_a90 u0_a90 3059208 2016-01-01 10:40 simpleperf
+
+Note that some apps use arm native libraries even on arm64 devices (We can
+verify this by checking /proc/<process\_id\_of\_app>/maps). In that case, we
+should use arm/simpleperf instead of arm64/simpleperf.
+
+## 3. Enable profiling
+Android devices may disable profiling by default, and we need to enable
+profiling.
+
+ $adb shell setprop security.perf_harden 0
+
+## 4. Find the target process/thread to record
+
+ # Use `ps` to get process id of sudogame.
+ $adb shell ps | grep sudogame
+ u0_a102 15869 545 1629824 76888 SyS_epoll_ 0000000000 S com.example.sudogame
+
+ # Use `ps -t` to get thread ids of process 15869.
+ # If this doesn’t work, you can try `ps -eT`.
+ $adb shell ps -t | grep 15869
+ u0_a102 15869 545 1629824 76888 SyS_epoll_ 0000000000 S com.example.sudogame
+ u0_a102 15874 15869 1629824 76888 futex_wait 0000000000 S Jit thread pool
+ ...
+
+## 5. Record perf.data
+
+ # Record process 15869 for 30s, and use the app while recording it.
+ $adb shell run-as com.example.sudogame ./simpleperf record -p 15869 --duration 30
+ simpleperf W 07-12 20:00:33 16022 16022 environment.cpp:485] failed to read /proc/sys/kernel/kptr_restrict: Permission denied
+ simpleperf I 07-12 20:01:03 16022 16022 cmd_record.cpp:315] Samples recorded: 81445. Samples lost: 0.
+
+ $adb shell run-as com.example.sudogame ls -lh perf.data
+ -rw-rw-rw- 1 u0_a102 u0_a102 4.3M 2016-07-12 20:01 perf.data
+
+Now we have recorded perf.data with 81445 records. There is a warning about
+failing to read kptr_restrict. It doesn’t matter in our case, but is a notification that we
+can’t read kernel symbol addresses.
+
+## 6. Report perf.data
+Below are several examples reporting on device.
+
+### Report samples in different binaries
+
+ # Report how samples distribute on different binaries.
+ $adb shell run-as com.example.sudogame ./simpleperf report -n --sort dso
+ simpleperf W 07-12 19:15:10 11389 11389 dso.cpp:309] Symbol addresses in /proc/kallsyms are all zero. `echo 0 >/proc/sys/kernel/kptr_restrict` if possible.
+ Cmdline: /data/data/com.example.sudogame/simpleperf record -p 15869 --duration 30
+ Arch: arm64
+ Event: cpu-cycles (type 0, config 0)
+ Samples: 81445
+ Event count: 34263925309
+
+ Overhead Sample Shared Object
+ 75.31% 58231 [kernel.kallsyms]
+ 8.44% 6845 /system/lib64/libc.so
+ 4.30% 4667 /vendor/lib64/egl/libGLESv2_adreno.so
+ 2.30% 2433 /system/lib64/libhwui.so
+ 1.88% 1952 /system/lib64/libart.so
+ 1.88% 1967 /system/framework/arm64/boot-framework.oat
+ 1.59% 1218 /system/lib64/libcutils.so
+ 0.69% 728 /system/lib64/libskia.so
+ 0.63% 489 /data/app/com.example.sudogame-2/lib/arm64/libsudo-game-jni.so
+ 0.34% 312 /system/lib64/libart-compiler.so
+ ...
+
+According to the report above, most time is spent in kernel, and
+libsudo-game-jni.so costs only 0.63% by itself. It seems libsudo-game-jni.so
+can’t be the bottleneck. However, it is possible we didn’t record long enough
+to hit the hot spot, or code in libsudo-game-jni.so calls other libraries
+consuming most time.
+
+### Report samples in different functions
+
+ # Report how samples distribute inside libsudo-game-jni.so.
+ $adb shell run-as com.example.sudogame ./simpleperf report -n --dsos /data/app/com.example.sudogame-2/lib/arm64/libsudo-game-jni.so --sort symbol
+ ...
+ Overhead Sample Symbol
+ 94.45% 461 unknown
+ 5.22% 26 @plt
+ 0.20% 1 Java_com_example_sudogame_GameModel_findConflictPairs
+ 0.14% 1 Java_com_example_sudogame_GameModel_canFindSolution
+
+In the report above, most samples belong to unknown symbol. It is because the
+libsudo-game-jni.so used on device doesn’t contain symbol table. We need to
+download shared library with symbol table to device. In android studio 2.1.2,
+the binary with symbol table is in
+[app_dir]/app/build/intermediates/binaries/debug/obj/arm64-v8a (for amr64).
+
+ # Make a proper directory to download binary to device. This directory
+ # should be the same as the directory of
+ # /data/app/com.example.sudogame-2/lib/arm64/libsudo-game-jni.so.
+ $adb shell run-as com.example.sudogame mkdir -p data/app/com.example.sudogame-2/lib/arm64
+ # Download binary with symbol table.
+ $adb push [app_dir]/app/build/intermediates/binaries/debug/obj/arm64-v8a/libsudo-game-jni.so /data/local/tmp
+ $adb shell run-as com.example.sudogame cp /data/local/tmp/libsudo-game-jni.so data/app/com.example.sudogame-2/lib/arm64
+
+ # Report how samples distribute inside libsudo-game-jni.so with debug binary
+ # support.
+ $adb shell run-as com.example.sudogame ./simpleperf report -n --dsos /data/app/com.example.sudogame-2/lib/arm64/libsudo-game-jni.so --sort symbol --symfs .
+ ...
+ Overhead Sample Symbol
+ 71.08% 347 checkValid(Board const&, int, int)
+ 15.13% 74 randomBlock_r(Board&, int, int, int, int, int)
+ 7.94% 38 canFindSolution_r(Board&, int, int)
+ 5.22% 26 @plt
+ 0.30% 2 randomBoard(Board&)
+ 0.20% 1 Java_com_example_sudogame_GameModel_findConflictPairs
+ 0.14% 1 Java_com_example_sudogame_GameModel_canFindSolution
+
+With the help of debug version of libsudo-game-jni.so, the report above shows that most
+time in libsudo-game-jni.so is spent in function checkValid. So now we can look
+into it further.
+
+### Report samples in one function
+
+ # Report how samples distribute inside checkValid() function.
+ # adb shell command can’t pass ‘(‘ in arguments, so we run the command
+ # inside `adb shell`.
+ $adb shell
+ device$ run-as com.example.sudogame ./simpleperf report -n --symbols "checkValid(Board const&, int, int)" --sort vaddr_in_file --symfs .
+ ...
+ Overhead Sample VaddrInFile
+ 14.90% 50 0x24d8
+ 8.48% 29 0x251c
+ 5.52% 19 0x2468
+ ...
+
+The report above shows samples hitting different places inside function
+checkValid(). By using objdump to disassemble libsudo-game-jni.so, we can find
+which are the hottest instructions in checkValid() function.
+
+ # Disassemble libsudo-game-jni.so.
+ $aarch64-linux-android-objdump -d -S -l libsudo-game-jni.so >libsudo-game-jni.asm
+
+## 7. Record and report call graph
+### What is a call graph
+A call graph is a tree showing function call relations. For example, a program
+starts at main() function, and main() calls functionOne() and functionTwo(),
+and functionOne() calls functionTwo() and functionThree(). Then the call graph
+is as below.
+
+ main() -> functionOne()
+ | |
+ | |-> functionTwo()
+ | |
+ | -> functionThree()
+ -> functionTwo()
+
+### Record dwarf based call graph
+To generate call graph, simpleperf needs to generate call chain for each record.
+Simpleperf requests kernel to dump user stack and user register set for each
+record, then it backtraces the user stack to find the function call chain. To
+parse the call chain, it needs support of dwarf call frame information, which
+usually resides in .eh_frame or .debug_frame section of the binary. So we need
+to use --symfs to point out where is libsudo-game-jni.so with debug information.
+
+ # Record thread 11546 for 30s, use the app while recording it.
+ $adb shell run-as com.example.sudogame ./simpleperf record -t 11546 -g --symfs . --duration 30
+ simpleperf I 01-01 07:13:08 9415 9415 cmd_record.cpp:336] Samples recorded: 65279. Samples lost: 16740.
+ simpleperf W 01-01 07:13:08 9415 9415 cmd_record.cpp:343] Lost 20.4099% of samples, consider increasing mmap_pages(-m), or decreasing sample frequency(-f), or increasing sample period(-c).
+
+ $adb shell run-as com.example.sudogame ls -lh perf.data
+ -rw-rw-rw- 1 u0_a96 u0_a96 8.3M 2016-01-01 08:49 perf.data
+
+Note that kernel can’t dump user stack >= 64K, so the dwarf based call graph
+doesn’t contain call chains consuming >= 64K stack. So avoiding allocating
+large memory on stack is a good way to improve dwarf based call graph.
+
+### Record stack frame based call graph
+Another way to generate call graph is to rely on the kernel parsing the call
+chain for each record. To make it possible, kernel has to be able to identify
+the stack frame of each function call. This is not always possible, because
+compilers can optimize away stack frames, or use a stack frame style not
+recognized by the kernel. So how well it works depends.
+
+ # Record thread 11546 for 30s, use the app while recording it.
+ $adb shell run-as com.example.sudogame ./simpleperf record -t 11546 --call-graph fp --symfs . --duration 30
+ simpleperf W 01-02 05:43:24 23277 23277 environment.cpp:485] failed to read /proc/sys/kernel/kptr_restrict: Permission denied
+ simpleperf I 01-02 05:43:54 23277 23277 cmd_record.cpp:323] Samples recorded: 95023. Samples lost: 0.
+
+ $adb shell run-as com.example.sudogame ls -lh perf.data
+ -rw-rw-rw- 1 u0_a96 u0_a96 39M 2016-01-02 05:43 perf.data
+
+### Report call graph
+#### Report call graph on device
+ # Report call graph.
+ $adb shell run-as com.example.sudogame ./simpleperf report -n -g --symfs .
+ Cmdline: /data/data/com.example.sudogame/simpleperf record -t 11546 -g --symfs . -f 1000 --duration 30
+ Arch: arm64
+ Event: cpu-cycles (type 0, config 0)
+ Samples: 23840
+ Event count: 41301992088
+
+ Children Self Sample Command Pid Tid Shared Object Symbol
+ 97.98% 0.69% 162 xample.sudogame 11546 11546 /data/app/com.example.sudogame-1/lib/arm64/libsudo-game-jni.so checkValid(Board const&, int, int)
+ |
+ -- checkValid(Board const&, int, int)
+ |
+ |--99.95%-- __android_log_print
+ | |
+ | |--92.19%-- __android_log_buf_write
+ | | |
+ | | |--73.50%-- libcutils.so[+1120c]
+ ...
+
+#### Report call graph in callee mode
+Call graph can be shown in two modes. One is caller mode, showing how functions
+call others. The other is callee mode, showing how functions are called by
+others. We can use *-g callee* option to show call graph in callee mode.
+
+ # Report call graph.
+ $host/simpleperf report -n -g callee --symfs .
+ Cmdline: /data/data/com.example.sudogame/simpleperf record -t 11546 -g --symfs . -f 1000 --duration 30
+ Arch: arm64
+ Event: cpu-cycles (type 0, config 0)
+ Samples: 23840
+ Event count: 41301992088
+
+ Children Self Sample Command Pid Tid Shared Object Symbol
+ 97.58% 0.21% 48 xample.sudogame 11546 11546 /system/lib64/liblog.so __android_log_print
+ |
+ -- __android_log_print
+ |
+ |--99.70%-- checkValid(Board const&, int, int)
+ | |
+ | |--99.31%-- canFindSolution_r(Board&, int, int)
+ ...
+
+#### Report using report.py
+The call graph generated by simpleperf report may be hard to read in text mode.
+Simpleperf provides a python script showing GUI of call graph.
+It can be used as below.
+
+ # Show call graph in GUI.
+ $adb shell run-as com.example.sudogame ./simpleperf report -n -g --symfs . >perf.report
+ $python report.py perf.report
+
+# Steps to profile java code on rooted devices
+Simpleperf only supports profiling native instructions in binaries in ELF
+format. If the java code is executed by interpreter, or with jit cache, it
+can’t be profiled by simpleperf. As Android supports Ahead-of-time compilation,
+it can compile java bytecode into native instructions. We currently need root
+privilege to force Android fully compiling java code into native instructions
+in ELF binaries with debug information (this could be fixed by a
+profileable=”true” in AndroidManifest that causes PackageManager to pass -g to
+dex2oat). We also need root privilege to read compiled native binaries
+(because installd writes them to a directory whose uid/gid is system:install).
+So profiling java code can currently only be done on rooted devices.
+
+## 1. Fully compile java code into native instructions
+### On Android N
+
+ # Restart adb as root. It needs root privilege to setprop below.
+ $adb root
+ # Set the property to compile with debug information.
+ $adb shell setprop dalvik.vm.dex2oat-flags -g
+
+ # Fully compile the app instead of using interpreter or jit.
+ $adb shell cmd package compile -f -m speed com.example.sudogame
+
+ # Restart the app on device.
+
+### On Android M
+
+ # Restart adb as root. It needs root privilege to setprop below.
+ $adb root
+ # Set the property to compile with debug information.
+ $adb shell setprop dalvik.vm.dex2oat-flags -g
+
+ # Reinstall the app.
+ $adb install -r app-debug.apk
+
+### On Android L
+
+ # Restart adb as root. It needs root privilege to setprop below.
+ $adb root
+ # Set the property to compile with debug information.
+ $adb shell setprop dalvik.vm.dex2oat-flags --include-debug-symbols
+
+ # Reinstall the app.
+ $adb install -r app-debug.apk
+
+## 2. Record perf.data
+
+ # Change to the app’s data directory.
+ $ adb root && adb shell
+ device# cd `run-as com.example.sudogame pwd`
+
+ # Record as root as simpleperf needs to read the generated native binary.
+ device#./simpleperf record -t 25636 -g --symfs . -f 1000 --duration 30
+ simpleperf I 01-02 07:18:20 27182 27182 cmd_record.cpp:323] Samples recorded: 23552. Samples lost: 39.
+
+ device#ls -lh perf.data
+ -rw-rw-rw- 1 root root 11M 2016-01-02 07:18 perf.data
+
+## 3. Report perf.data
+ # Report how samples distribute on different binaries.
+ device#./simpleperf report -n --sort dso
+ Cmdline: /data/data/com.example.sudogame/simpleperf record -t 25636 -g --symfs . -f 1000 --duration 30
+ Arch: arm64
+ Event: cpu-cycles (type 0, config 0)
+ Samples: 23552
+ Event count: 40662494587
+
+ Overhead Sample Shared Object
+ 85.73% 20042 [kernel.kallsyms]
+ 9.41% 2198 /system/lib64/libc.so
+ 2.29% 535 /system/lib64/libcutils.so
+ 0.95% 222 /data/app/com.example.sudogame-1/lib/arm64/libsudo-game-jni.so
+ ...
+ 0.04% 16 /system/lib64/libandroid_runtime.so
+ 0.03% 10 /data/app/com.example.sudogame-1/oat/arm64/base.odex
+ ...
+
+As in the report above, there are samples in
+/data/app/com.example.sudogame-1/oat/arm64/base.odex, which is the native binary
+compiled from java code.
+
+ # Report call graph.
+ device#./simpleperf report -n -g --symfs .
+ Cmdline: /data/data/com.example.sudogame/simpleperf record -t 25636 -g --symfs . -f 1000 --duration 30
+ Arch: arm64
+ Event: cpu-cycles (type 0, config 0)
+ Samples: 23552
+ Event count: 40662494587
+
+ Children Self Sample Command Pid Tid Shared Object Symbol
+ 98.32% 0.00% 1 xample.sudogame 25636 25636 /data/app/com.example.sudogame-1/oat/arm64/base.odex void com.example.sudogame.GameModel.reInit()
+ |
+ -- void com.example.sudogame.GameModel.reInit()
+ |
+ |--98.98%-- boolean com.example.sudogame.GameModel.canFindSolution(int[][])
+ | Java_com_example_sudogame_GameModel_canFindSolution
+ | |
+ | |--99.93%-- canFindSolution(Board&)
+ ...
+
+As in the report above, reInit() and canFindSolution() are java
+functions.