doc/internals.xml


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194

<?xml version="1.0" encoding='ISO-8859-1'?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">

<book id="oprofile-internals">
<bookinfo>
	<title>OProfile Internals</title>
 
	<authorgroup>
		<author>
			<firstname>John</firstname>
			<surname>Levon</surname>
			<affiliation>
				<address><email>levon@movementarian.org</email></address>
			</affiliation>
		</author>
	</authorgroup>

	<copyright>
		<year>2003</year>
		<holder>John Levon</holder>
	</copyright>
</bookinfo>

<toc></toc>

<chapter id="introduction">
<title>Introduction</title>

<para>
This document is current for OProfile version <oprofileversion />.
This document provides some details on the internal workings of OProfile for the
interested hacker. This document assumes strong C, working C++, plus some knowledge of
kernel internals and CPU hardware.
</para>
<note>
<para>
Only the "new" implementation associated with kernel 2.6 and above is covered here. 2.4
uses a very different kernel module implementation and daemon to produce the sample files.
</para>
</note>

<sect1 id="overview">
<title>Overview</title>
<para>
OProfile is a statistical continuous profiler. In other words, profiles are generated by
regularly sampling the current registers on each CPU (from an interrupt handler, the
saved PC value at the time of interrupt is stored), and converting that runtime PC
value into something meaningful to the programmer.
</para>
<para>
OProfile achieves this by taking the stream of sampled PC values, along with the detail
of which task was running at the time of the interrupt, and converting into a file offset
against a particular binary file. Because applications <function>mmap()</function>
the code they run (be it <filename>/bin/bash</filename>, <filename>/lib/libfoo.so</filename>
or whatever), it's possible to find the relevant binary file and offset by walking
the task's list of mapped memory areas. Each PC value is thus converted into a tuple
of binary-image,offset. This is something that the userspace tools can use directly
to reconstruct where the code came from, including the particular assembly instructions,
symbol, and source line (via the binary's debug information if present).
</para>
<para>
Regularly sampling the PC value like this approximates what actually was executed and
how often - more often than not, this statistical approximation is good enough to
reflect reality. In common operation, the time between each sample interrupt is regulated
by a fixed number of clock cycles. This implies that the results will reflect where
the CPU is spending the most time; this is obviously a very useful information source
for performance analysis.
</para>
<para>
Sometimes though, an application programmer needs different kinds of information: for example,
"which of the source routines cause the most cache misses ?". The rise in importance of
such metrics in recent years has led many CPU manufacturers to provide hardware performance
counters capable of measuring these events on the hardware level. Typically, these counters
increment once per each event, and generate an interrupt on reaching some pre-defined
number of events. OProfile can use these interrupts to generate samples: then, the
profile results are a statistical approximation of which code caused how many of the
given event.
</para>
<para>
Consider a simplified system that only executes two functions A and B. A
takes one cycle to execute, whereas B takes 99 cycles. Imagine we run at
100 cycles a second, and we've set the performance counter to create an
interrupt after a set number of "events" (in this case an event is one
clock cycle). It should be clear that the chances of the interrupt
occurring in function A is 1/100, and 99/100 for function B. Thus, we
statistically approximate the actual relative performance features of
the two functions over time. This same analysis works for other types of
events, providing that the interrupt is tied to the number of events
occurring (that is, after N events, an interrupt is generated).
</para>
<para>
There are typically more than one of these counters, so it's possible to set up profiling
for several different event types. Using these counters gives us a powerful, low-overhead
way of gaining performance metrics. If OProfile, or the CPU, does not support performance
counters, then a simpler method is used: the kernel timer interrupt feeds samples
into OProfile itself.
</para>
<para>
The rest of this document concerns itself with how we get from receiving samples at
interrupt time to producing user-readable profile information.
</para>
</sect1>

<sect1 id="components">
<title>Components of the OProfile system</title>

<sect2 id="arch-specific-components">
<title>Architecture-specific components</title>
<para>
If OProfile supports the hardware performance counters found on
a particular architecture, code for managing the details of setting
up and managing these counters can be found in the kernel source
tree in the relevant <filename>arch/<emphasis>arch</emphasis>/oprofile/</filename>
directory. The architecture-specific implementation works via
filling in the oprofile_operations structure at init time. This
provides a set of operations such as <function>setup()</function>,
<function>start()</function>, <function>stop()</function>, etc.
that manage the hardware-specific details of fiddling with the
performance counter registers.
</para>
<para>
The other important facility available to the architecture code is
<function>oprofile_add_sample()</function>.  This is where a particular sample
taken at interrupt time is fed into the generic OProfile driver code.
</para>
</sect2>

<sect2 id="filesystem">
<title>oprofilefs</title>
<para>
OProfile implements a pseudo-filesystem known as "oprofilefs", mounted from
userspace at <filename>/dev/oprofile</filename>. This consists of small
files for reporting and receiving configuration from userspace, as well
as the actual character device that the OProfile userspace receives samples
from. At <function>setup()</function> time, the architecture-specific may
add further configuration files related to the details of the performance
counters. For example, on x86, one numbered directory for each hardware
performance counter is added, with files in each for the event type,
reset value, etc.
</para>
<para>
The filesystem also contains a <filename>stats</filename> directory with
a number of useful counters for various OProfile events.
</para>
</sect2>

<sect2 id="driver">
<title>Generic kernel driver</title>
<para>
This lives in <filename>drivers/oprofile/</filename>, and forms the core of
how OProfile works in the kernel. Its job is to take samples delivered
from the architecture-specific code (via <function>oprofile_add_sample()</function>),
and buffer this data, in a transformed form as described later, until releasing
the data to the userspace daemon via the <filename>/dev/oprofile/buffer</filename>
character device.
</para>
</sect2>

<sect2 id="daemon">
<title>The OProfile daemon</title>
<para>
The OProfile userspace daemon's job is to take the raw data provided by the
kernel and write it to the disk. It takes the single data stream from the
kernel and logs sample data against a number of sample files (found in
<filename>$SESSION_DIR/samples/current/</filename>, by default located at 
<filename>/var/lib/oprofile/samples/current/</filename>. For the benefit
of the "separate" functionality, the names/paths of these sample files
are mangled to reflect where the samples were from: this can include
thread IDs, the binary file path, the event type used, and more.
</para>
<para>
After this final step from interrupt to disk file, the data is now
persistent (that is, changes in the running of the system do not invalidate
stored data). So the post-profiling tools can run on this data at any
time (assuming the original binary files are still available and unchanged,
naturally).
</para>
</sect2>

<sect2 id="post-profiling">
<title>Post-profiling tools</title>
So far, we've collected data, but we've yet to present it in a useful form
to the user. This is the job of the post-profiling tools. In general form,
they collate a subset of the available sample files, load and process each one
correlated against the relevant binary file, and finally produce user-readable
information.
</sect2>

</sect1>

</chapter>

<chapter id="performance-counters">
<title>Performance counter management</title>

<sect1 id ="performance-counters-ui">
<title>Providing a user interface</title>

<para>
The performance counter registers need programming in order to set the
type of event to count, etc. OProfile uses a standard model across all
CPUs for defining these events as follows :
</para>
<informaltable frame="all">
<tgroup cols='2'> 
<tbody>
<row><entry><option>event</option></entry><entry>The event type e.g. DATA_MEM_REFS</entry></row>
<row><entry><option>unit mask</option></entry><entry>The sub-events to count (more detailed specification)</entry></row>
<row><entry><option>counter</option></entry><entry>The hardware counter(s) that can count this event</entry></row>
<row><entry><option>count</option></entry><entry>The reset value (how many events before an interrupt)</entry></row>
<row><entry><option>kernel</option></entry><entry>Whether the counter should increment when in kernel space</entry></row>
<row><entry><option>user</option></entry><entry>Whether the counter should increment when in user space</entry></row>
</tbody>
</tgroup>
</informaltable>
<para>
The term "unit mask" is borrowed from the Intel architectures, and can
further specify exactly when a counter is incremented (for example,
cache-related events can be restricted to particular state transitions
of the cache lines).
</para>
<para>
All of the available hardware events and their details are specified in
the textual files in the <filename>events</filename> directory. The
syntax of these files should be fairly obvious. The user specifies the
names and configuration details of the chosen counters via
<command>opcontrol</command>. These are then written to the kernel
module (in numerical form) via <filename>/dev/oprofile/N/</filename>
where N is the physical hardware counter (some events can only be used
on specific counters; OProfile hides these details from the user when
possible). On IA64, the perfmon-based interface behaves somewhat
differently, as described later.
</para>

</sect1>

<sect1 id="performance-counters-programming">
<title>Programming the performance counter registers</title>

<para>
We have described how the user interface fills in the desired
configuration of the counters and transmits the information to the
kernel. It is the job of the <function>-&gt;setup()</function> method
to actually program the performance counter registers. Clearly, the
details of how this is done is architecture-specific; it is also
model-specific on many architectures. For example, i386 provides methods
for each model type that programs the counter registers correctly
(see the <filename>op_model_*</filename> files in
<filename>arch/i386/oprofile</filename> for the details). The method
reads the values stored in the virtual oprofilefs files and programs
the registers appropriately, ready for starting the actual profiling
session.
</para>
<para>
The architecture-specific drivers make sure to save the old register
settings before doing OProfile setup. They are restored when OProfile
shuts down. This is useful, for example, on i386, where the NMI watchdog
uses the same performance counter registers as OProfile; they cannot
run concurrently, but OProfile makes sure to restore the setup it found
before it was running.
</para>
<para>
In addition to programming the counter registers themselves, other setup
is often necessary. For example, on i386, the local APIC needs
programming in order to make the counter's overflow interrupt appear as
an NMI (non-maskable interrupt). This allows sampling (and therefore
profiling) of regions where "normal" interrupts are masked, enabling
more reliable profiles.
</para>

<sect2 id="performance-counters-start">
<title>Starting and stopping the counters</title>
<para>
Initiating a profiling session is done via writing an ASCII '1'
to the file <filename>/dev/oprofile/enable</filename>. This sets up the
core, and calls into the architecture-specific driver to actually
enable each configured counter. Again, the details of how this is
done is model-specific (for example, the Athlon models can disable
or enable on a per-counter basis, unlike the PPro models).
</para>
</sect2>

<sect2>
<title>IA64 and perfmon</title>
<para>
The IA64 architecture provides a different interface from the other
architectures, using the existing perfmon driver. Register programming
is handled entirely in user-space (see
<filename>daemon/opd_perfmon.c</filename> for the details). A process
is forked for each CPU, which creates a perfmon context and sets the
counter registers appropriately via the
<function>sys_perfmonctl</function> interface. In addition, the actual
initiation and termination of the profiling session is handled via the
same interface using <constant>PFM_START</constant> and
<constant>PFM_STOP</constant>. On IA64, then, there are no oprofilefs
files for the performance counters, as the kernel driver does not
program the registers itself.
</para>
<para>
Instead, the perfmon driver for OProfile simply registers with the
OProfile core with an OProfile-specific UUID. During a profiling
session, the perfmon core calls into the OProfile perfmon driver and
samples are registered with the OProfile core itself as usual (with
<function>oprofile_add_sample()</function>).
</para>
</sect2>

</sect1>

</chapter>

<chapter id="collecting-samples">
<title>Collecting and processing samples</title>

<sect1 id="receiving-interrupts">
<title>Receiving interrupts</title>
<para>
Naturally, how the overflow interrupts are received is specific
to the hardware architecture, unless we are in "timer" mode, where the
logging routine is called directly from the standard kernel timer
interrupt handler.
</para>
<para>
On the i386 architecture, the local APIC is programmed such that when a
counter overflows (that is, it receives an event that causes an integer
overflow of the register value to zero), an NMI is generated. This calls
into the general handler <function>do_nmi()</function>; because OProfile
has registered itself as capable of handling NMI interrupts, this will
call into the OProfile driver code in
<filename>arch/i386/oprofile</filename>. Here, the saved PC value (the
CPU saves the register set at the time of interrupt on the stack
available for inspection) is extracted, and the counters are examined to
find out which one generated the interrupt. Also determined is whether
the system was inside kernel or user space at the time of the interrupt.
These three pieces of information are then forwarded onto the OProfile
core via <function>oprofile_add_sample()</function>. Finally, the
counter values are reset to the chosen count value, to ensure another
interrupt happens after another N events have occurred. Other
architectures behave in a similar manner.
</para>
</sect1>
 
<sect1 id="core-structure">
<title>Core data structures</title>
<para>
Before considering what happens when we log a sample, we shall digress
for a moment and look at the general structure of the data collection
system.
</para>
<para>
OProfile maintains a small buffer for storing the logged samples for
each CPU on the system. Only this buffer is altered when we actually log
a sample (remember, we may still be in an NMI context, so no locking is
possible). The buffer is managed by a two-handed system; the "head"
iterator dictates where the next sample data should be placed in the
buffer. Of course, overflow of the buffer is possible, in which case
the sample is discarded.
</para>
<para>
It is critical to remember that at this point, the PC value is an
absolute value, and is therefore only meaningful in the context of which
task it was logged against. Thus, these per-CPU buffers also maintain
details of which task each logged sample is for, as described in the
next section. In addition, we store whether the sample was in kernel
space or user space (on some architectures and configurations, the address
space is not sub-divided neatly at a specific PC value, so we must store
this information).
</para>
<para>
As well as these small per-CPU buffers, we have a considerably larger
single buffer. This holds the data that is eventually copied out into
the OProfile daemon. On certain system events, the per-CPU buffers are
processed and entered (in mutated form) into the main buffer, known in
the source as the "event buffer". The "tail" iterator indicates the
point from which the CPU may be read, up to the position of the "head"
iterator. This provides an entirely lock-free method for extracting data
from the CPU buffers. This process is described in detail later in this chapter.
</para>
<figure><title>The OProfile buffers</title>
<graphic fileref="buffers.png" />
</figure>
</sect1>

<sect1 id="logging-sample">
<title>Logging a sample</title>
<para>
As mentioned, the sample is logged into the buffer specific to the
current CPU. The CPU buffer is a simple array of pairs of unsigned long
values; for a sample, they hold the PC value and the counter for the
sample. (The counter value is later used to translate back into the relevant
event type the counter was programmed to).
</para>
<para>
In addition to logging the sample itself, we also log task switches.
This is simply done by storing the address of the last task to log a
sample on that CPU in a data structure, and writing a task switch entry
into the buffer if the new value of <function>current()</function> has
changed. Note that later we will directly de-reference this pointer;
this imposes certain restrictions on when and how the CPU buffers need
to be processed.
</para>
<para>
Finally, as mentioned, we log whether we have changed between kernel and
userspace using a similar method. Both of these variables
(<varname>last_task</varname> and <varname>last_is_kernel</varname>) are
reset when the CPU buffer is read.
</para>
</sect1>

<sect1 id="logging-stack">
<title>Logging stack traces</title>
<para>
OProfile can also provide statistical samples of call chains (on x86). To
do this, at sample time, the frame pointer chain is traversed, recording
the return address for each stack frame. This will only work if the code
was compiled with frame pointers, but we're careful to abort the
traversal if the frame pointer appears bad. We store the set of return
addresses straight into the CPU buffer. Note that, since this traversal
is keyed off the standard sample interrupt, the number of times a
function appears in a stack trace is not an indicator of how many times
the call site was executed: rather, it's related to the number of
samples we took where that call site was involved. Thus, the results for
stack traces are not necessarily proportional to the call counts:
typical programs will have many <function>main()</function> samples.
</para>
</sect1>

<sect1 id="synchronising-buffers">
<title>Synchronising the CPU buffers to the event buffer</title>
<!-- FIXME: update when percpu patch goes in -->
<para>
At some point, we have to process the data in each CPU buffer and enter
it into the main (event) buffer. The file
<filename>buffer_sync.c</filename> contains the relevant code. We
periodically (currently every <constant>HZ</constant>/4 jiffies) start
the synchronisation process. In addition, we process the buffers on
certain events, such as an application calling
<function>munmap()</function>. This is particularly important for
<function>exit()</function> - because the CPU buffers contain pointers
to the task structure, if we don't process all the buffers before the
task is actually destroyed and the task structure freed, then we could
end up trying to dereference a bogus pointer in one of the CPU buffers.
</para>
<para>
We also add a notification when a kernel module is loaded; this is so
that user-space can re-read <filename>/proc/modules</filename> to
determine the load addresses of kernel module text sections. Without
this notification, samples for a newly-loaded module could get lost or
be attributed to the wrong module.
</para>
<para>
The synchronisation itself works in the following manner: first, mutual
exclusion on the event buffer is taken. Remember, we do not need to do
that for each CPU buffer, as we only read from the tail iterator (whilst
interrupts might be arriving at the same buffer, but they will write to
the position of the head iterator, leaving previously written entries
intact). Then, we process each CPU buffer in turn. A CPU switch
notification is added to the buffer first (for
<option>--separate=cpu</option> support). Then the processing of the
actual data starts.
</para>
<para>
As mentioned, the CPU buffer consists of task switch entries and the
actual samples. When the routine <function>sync_buffer()</function> sees
a task switch, the process ID and process group ID are recorded into the
event buffer, along with a dcookie (see below) identifying the
application binary (e.g. <filename>/bin/bash</filename>). The
<varname>mmap_sem</varname> for the task is then taken, to allow safe
iteration across the tasks' list of mapped areas. Each sample is then
processed as described in the next section.
</para>
<para>
After a buffer has been read, the tail iterator is updated to reflect
how much of the buffer was processed. Note that when we determined how
much data there was to read in the CPU buffer, we also called
<function>cpu_buffer_reset()</function> to reset
<varname>last_task</varname> and <varname>last_is_kernel</varname>, as
we've already mentioned. During the processing, more samples may have
been arriving in the CPU buffer; this is OK because we are careful to
only update the tail iterator to how much we actually read - on the next
buffer synchronisation, we will start again from that point.
</para>
</sect1>

<sect1 id="dentry-cookies">
<title>Identifying binary images</title>
<para>
In order to produce useful profiles, we need to be able to associate a
particular PC value sample with an actual ELF binary on the disk. This
leaves us with the problem of how to export this information to
user-space. We create unique IDs that identify a particular directory
entry (dentry), and write those IDs into the event buffer. Later on,
the user-space daemon can call the <function>lookup_dcookie</function>
system call, which looks up the ID and fills in the full path of
the binary image in the buffer user-space passes in. These IDs are
maintained by the code in <filename>fs/dcookies.c</filename>; the
cache lasts for as long as the daemon has the event buffer open.
</para>
</sect1>

<sect1 id="finding-dentry">
<title>Finding a sample's binary image and offset</title>
<para>
We haven't yet described how we process the absolute PC value into
something usable by the user-space daemon. When we find a sample entered
into the CPU buffer, we traverse the list of mappings for the task
(remember, we will have seen a task switch earlier, so we know which
task's lists to look at). When a mapping is found that contains the PC
value, we look up the mapped file's dentry in the dcookie cache. This
gives the dcookie ID that will uniquely identify the mapped file. Then
we alter the absolute value such that it is an offset from the start of
the file being mapped (the mapping need not start at the start of the
actual file, so we have to consider the offset value of the mapping). We
store this dcookie ID into the event buffer; this identifies which
binary the samples following it are against.
In this manner, we have converted a PC value, which has transitory
meaning only, into a static offset value for later processing by the
daemon.
</para>
<para>
We also attempt to avoid the relatively expensive lookup of the dentry
cookie value by storing the cookie value directly into the dentry
itself; then we can simply derive the cookie value immediately when we
find the correct mapping.
</para>
</sect1>

</chapter>

<chapter id="sample-files">
<title>Generating sample files</title>

<sect1 id="processing-buffer">
<title>Processing the buffer</title>

<para>
Now we can move onto user-space in our description of how raw interrupt
samples are processed into useful information. As we described in
previous sections, the kernel OProfile driver creates a large buffer of
sample data consisting of offset values, interspersed with
notification of changes in context. These context changes indicate how
following samples should be attributed, and include task switches, CPU
changes, and which dcookie the sample value is against. By processing
this buffer entry-by-entry, we can determine where the samples should
be accredited to. This is particularly important when using the 
<option>--separate</option>.
</para>
<para>
The file <filename>daemon/opd_trans.c</filename> contains the basic routine
for the buffer processing. The <varname>struct transient</varname>
structure is used to hold changes in context. Its members are modified
as we process each entry; it is passed into the routines in
<filename>daemon/opd_sfile.c</filename> for actually logging the sample
to a particular sample file (which will be held in
<filename>$SESSION_DIR/samples/current</filename>).
</para>
<para>
The buffer format is designed for conciseness, as high sampling rates
can easily generate a lot of data. Thus, context changes are prefixed
by an escape code, identified by <function>is_escape_code()</function>.
If an escape code is found, the next entry in the buffer identifies
what type of context change is being read. These are handed off to
various handlers (see the <varname>handlers</varname> array), which
modify the transient structure as appropriate. If it's not an escape
code, then it must be a PC offset value, and the very next entry will
be the numeric hardware counter. These values are read and recorded
in the transient structure; we then do a lookup to find the correct
sample file, and log the sample, as described in the next section.
</para>

<sect2 id="handling-kernel-samples">
<title>Handling kernel samples</title>

<para>
Samples from kernel code require a little special handling. Because
the binary text which the sample is against does not correspond to
any file that the kernel directly knows about, the OProfile driver
stores the absolute PC value in the buffer, instead of the file offset.
Of course, we need an offset against some particular binary. To handle
this, we keep a list of loaded modules by parsing
<filename>/proc/modules</filename> as needed. When a module is loaded,
a notification is placed in the OProfile buffer, and this triggers a
re-read. We store the module name, and the loading address and size.
This is also done for the main kernel image, as specified by the user.
The absolute PC value is matched against each address range, and
modified into an offset when the matching module is found. See 
<filename>daemon/opd_kernel.c</filename> for the details.
</para>

</sect2>


</sect1>

<sect1 id="sample-file-generation">
<title>Locating and creating sample files</title>

<para>
We have a sample value and its satellite data stored in a
<varname>struct transient</varname>, and we must locate an
actual sample file to store the sample in, using the context
information in the transient structure as a key. The transient data to
sample file lookup is handled in
<filename>daemon/opd_sfile.c</filename>. A hash is taken of the
transient values that are relevant (depending upon the setting of
<option>--separate</option>, some values might be irrelevant), and the
hash value is used to lookup the list of currently open sample files.
Of course, the sample file might not be found, in which case we need
to create and open it.
</para>
<para>
OProfile uses a rather complex scheme for naming sample files, in order
to make selecting relevant sample files easier for the post-profiling
utilities. The exact details of the scheme are given in
<filename>oprofile-tests/pp_interface</filename>, but for now it will
suffice to remember that the filename will include only relevant
information for the current settings, taken from the transient data. A
fully-specified filename looks something like :
</para>
<computeroutput>
/var/lib/oprofile/samples/current/{root}/usr/bin/xmms/{dep}/{root}/lib/tls/libc-2.3.2.so/CPU_CLK_UNHALTED.100000.0.28082.28089.0
</computeroutput>
<para>
It should be clear that this identifies such information as the
application binary, the dependent (library) binary, the hardware event,
and the process and thread ID. Typically, not all this information is
needed, in which cases some values may be replaced with the token
<filename>all</filename>.
</para>
<para>
The code that generates this filename and opens the file is found in
<filename>daemon/opd_mangling.c</filename>. You may have realised that
at this point, we do not have the binary image file names, only the
dcookie values. In order to determine a file name, a dcookie value is
looked up in the dcookie cache. This is to be found in
<filename>daemon/opd_cookie.c</filename>. Since dcookies are both
persistent and unique during a sampling session, we can cache the
values. If the value is not found in the cache, then we ask the kernel
to do the lookup from value to file name for us by calling
<function>lookup_dcookie()</function>. This looks up the value in a
kernel-side cache (see <filename>fs/dcookies.c</filename>) and returns
the fully-qualified file name to userspace.
</para>

</sect1>

<sect1 id="sample-file-writing">
<title>Writing data to a sample file</title>

<para>
Each specific sample file is a hashed collection, where the key is
the PC offset from the transient data, and the value is the number of
samples recorded against that offset. The files are
<function>mmap()</function>ed into the daemon's memory space. The code
to actually log the write against the sample file can be found in
<filename>libdb/</filename>.
</para>
<para>
For recording stack traces, we have a more complicated sample filename
mangling scheme that allows us to identify cross-binary calls. We use
the same sample file format, where the key is a 64-bit value composed
from the from,to pair of offsets.
</para>

</sect1>

</chapter>

<chapter id="output">
<title>Generating useful output</title>

<para>
All of the tools used to generate human-readable output have to take
roughly the same steps to collect the data for processing. First, the
profile specification given by the user has to be parsed. Next, a list
of sample files matching the specification has to obtained. Using this
list, we need to locate the binary file for each sample file, and then
use them to extract meaningful data, before a final collation and
presentation to the user.
</para>

<sect1 id="profile-specification">
<title>Handling the profile specification</title>

<para>
The profile specification presented by the user is parsed in
the function <function>profile_spec::create()</function>. This
creates an object representing the specification. Then we
use <function>profile_spec::generate_file_list()</function>
to search for all sample files and match them against the
<varname>profile_spec</varname>.
</para>

<para>
To enable this matching process to work, the attributes of
each sample file is encoded in its filename. This is a low-tech
approach to matching specifications against candidate sample
files, but it works reasonably well. A typical sample file
might look like these:
</para>
<screen>
/var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/{cg}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all
/var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.all.all.all
/var/lib/oprofile/samples/current/{root}/bin/ls/{dep}/{root}/bin/ls/CPU_CLK_UNHALTED.100000.0.7423.7424.0
/var/lib/oprofile/samples/current/{kern}/r128/{dep}/{kern}/r128/CPU_CLK_UNHALTED.100000.0.all.all.all
</screen>
<para>
This looks unnecessarily complex, but it's actually fairly simple. First
we have the session of the sample, by default located here
<filename>/var/lib/oprofile/samples/current</filename>. This location
can be changed by specifying the --session-dir option at command-line.
This session could equally well be inside an archive from <command>oparchive</command>.
Next we have one of the tokens <filename>{root}</filename> or
<filename>{kern}</filename>. <filename>{root}</filename> indicates
that the binary is found on a file system, and we will encode its path
in the next section (e.g. <filename>/bin/ls</filename>).
<filename>{kern}</filename> indicates a kernel module - on 2.6 kernels
the path information is not available from the kernel, so we have to
special-case kernel modules like this; we encode merely the name of the
module as loaded.
</para>
<para>
Next there is a <filename>{dep}</filename> token, indicating another
token/path which identifies the dependent binary image. This is used even for
the "primary" binary (i.e. the one that was
<function>execve()</function>d), as it simplifies processing. Finally,
if this sample file is a normal flat profile, the actual file is next in
the path. If it's a call-graph sample file, we need one further
specification, to allow us to identify cross-binary arcs in the call
graph.
</para>
<para>
The actual sample file name is dot-separated, where the fields are, in
order: event name, event count, unit mask, task group ID, task ID, and
CPU number.
</para>
<para>
This sample file can be reliably parsed (with
<function>parse_filename()</function>) into a
<varname>filename_spec</varname>. Finally, we can check whether to
include the sample file in the final results by comparing this
<varname>filename_spec</varname> against the
<varname>profile_spec</varname> the user specified (for the interested,
see <function>valid_candidate()</function> and
<function>profile_spec::match</function>). Then comes the really
complicated bit...
</para>

</sect1>

<sect1 id="sample-file-collating">
<title>Collating the candidate sample files</title>

<para>
At this point we have a duplicate-free list of sample files we need
to process. But first we need to do some further arrangement: we
need to classify each sample file, and we may also need to "invert"
the profiles.
</para>

<sect2 id="sample-file-classifying">
<title>Classifying sample files</title>

<para>
It's possible for utilities like <command>opreport</command> to show 
data in columnar format: for example, we might want to show the results
of two threads within a process side-by-side. To do this, we need
to classify each sample file into classes - the classes correspond
with each <command>opreport</command> column. The function that handles
this is <function>arrange_profiles()</function>. Each sample file
is added to a particular class. If the sample file is the first in
its class, a template is generated from the sample file. Each template
describes a particular class (thus, in our example above, each template
will have a different thread ID, and this uniquely identifies each
class).
</para>

<para>
Each class has a list of "profile sets" matching that class's template.
A profile set is either a profile of the primary binary image, or any of
its dependent images. After all sample files have been listed in one of
the profile sets belonging to the classes, we have to name each class and
perform error-checking. This is done by
<function>identify_classes()</function>; each class is checked to ensure
that its "axis" is the same as all the others. This is needed because
<command>opreport</command> can't produce results in 3D format: we can
only differ in one aspect, such as thread ID or event name.
</para>

</sect2>

<sect2 id="sample-file-inverting">
<title>Creating inverted profile lists</title>

<para>
Remember that if we're using certain profile separation options, such as
"--separate=lib", a single binary could be a dependent image to many
different binaries. For example, the C library image would be a
dependent image for most programs that have been profiled. As it
happens, this can cause severe performance problems: without some
re-arrangement, these dependent binary images would be opened each
time we need to process sample files for each program.
</para>

<para>
The solution is to "invert" the profiles via
<function>invert_profiles()</function>. We create a new data structure
where the dependent binary is first, and the primary binary images using
that dependent binary are listed as sub-images. This helps our
performance problem, as now we only need to open each dependent image
once, when we process the list of inverted profiles.
</para>

</sect2>

</sect1>

<sect1 id="generating-profile-data">
<title>Generating profile data</title>

<para>
Things don't get any simpler at this point, unfortunately. At this point
we've collected and classified the sample files into the set of inverted
profiles, as described in the previous section. Now we need to process
each inverted profile and make something of the data. The entry point
for this is <function>populate_for_image()</function>.
</para>

<sect2 id="bfd">
<title>Processing the binary image</title>
<para>
The first thing we do with an inverted profile is attempt to open the
binary image (remember each inverted profile set is only for one binary
image, but may have many sample files to process). The
<varname>op_bfd</varname> class provides an abstracted interface to
this; internally it uses <filename>libbfd</filename>. The main purpose
of this class is to process the symbols for the binary image; this is
also where symbol filtering happens. This is actually quite tricky, but
should be clear from the source.
</para>
</sect2>

<sect2 id="processing-sample-files">
<title>Processing the sample files</title>
<para>
The class <varname>profile_container</varname> is a hold-all that
contains all the processed results. It is a container of
<varname>profile_t</varname> objects. The
<function>add_sample_files()</function> method uses
<filename>libdb</filename> to open the given sample file and add the
key/value types to the <varname>profile_t</varname>. Once this has been
done, <function>profile_container::add()</function> is passed the
<varname>profile_t</varname> plus the <varname>op_bfd</varname> for
processing.
</para>
<para>
<function>profile_container::add()</function> walks through the symbols
collected in the <varname>op_bfd</varname>.
<function>op_bfd::get_symbol_range()</function> gives us the start and
end of the symbol as an offset from the start of the binary image,
then we interrogate the <varname>profile_t</varname> for the relevant samples
for that offset range. We create a <varname>symbol_entry</varname>
object for this symbol and fill it in. If needed, here we also collect
debug information from the <varname>op_bfd</varname>, and possibly
record the detailed sample information (as used by <command>opreport
-d</command> and <command>opannotate</command>).
Finally the <varname>symbol_entry</varname> is added to
a private container of <varname>profile_container</varname> - this
<varname>symbol_container</varname> holds all such processed symbols.
</para>
</sect2>

</sect1>

<sect1 id="generating-output">
<title>Generating output</title>

<para>
After the processing described in the previous section, we've now got
full details of what we need to output stored in the
<varname>profile_container</varname> on a symbol-by-symbol basis. To
produce output, we need to replay that data and format it suitably.
</para>
<para>
<command>opreport</command> first asks the
<varname>profile_container</varname> for a
<varname>symbol_collection</varname> (this is also where thresholding
happens).
This is sorted, then a
<varname>opreport_formatter</varname> is initialised.
This object initialises a set of field formatters as requested. Then
<function>opreport_formatter::output()</function> is called. This
iterates through the (sorted) <varname>symbol_collection</varname>;
for each entry, the selected fields (as set by the
<varname>format_flags</varname> options) are output by calling the
field formatters, with the <varname>symbol_entry</varname> passed in.
</para>

</sect1>

</chapter>

<chapter id="ext">
<title>Extended Feature Interface</title>

<sect1 id="ext-intro">
<title>Introduction</title>

<para>
The Extended Feature Interface is a standard callback interface 
designed to allow extension to the OProfile daemon's sample processing. 
Each feature defines a set of callback handlers which can be enabled or 
disabled through the OProfile daemon's command-line option.
This interface can be used to implement support for architecture-specific
features or features not commonly used by general OProfile users. 
</para>

</sect1>

<sect1 id="ext-name-and-handlers">
<title>Feature Name and Handlers</title>

<para>
Each extended feature has an entry in the <varname>ext_feature_table</varname>
in <filename>opd_extended.cpp</filename>. Each entry contains a feature name,
and a corresponding set of handlers. Feature name is a unique string, which is
used to identify a feature in the table. Each feature provides a set
of handlers, which will be executed by the OProfile daemon from pre-determined
locations to perform certain tasks. At runtime, the OProfile daemon calls a feature
handler wrapper from one of the predetermined locations to check whether
an extended feature is enabled, and whether a particular handler exists.
Only the handlers of the enabled feature will be executed.
</para>

</sect1>

<sect1 id="ext-enable">
<title>Enabling Features</title>

<para>
Each feature is enabled using the OProfile daemon (oprofiled) command-line
option "--ext-feature=&lt;extended-feature-name&gt;:[args]". The
"extended-feature-name" is used to determine the feature to be enabled.
The optional "args" is passed into the feature-specific initialization handler
(<function>ext_init</function>). Currently, only one extended feature can be
enabled at a time.
</para>

</sect1>

<sect1 id="ext-types-of-handlers">
<title>Type of Handlers</title>

<para>
Each feature is responsible for providing its own set of handlers.
Types of handler are:
</para>

<sect2 id="ext_init">
<title>ext_init Handler</title>

<para>
"ext_init" handles initialization of an extended feature. It takes
"args" parameter which is passed in through the "oprofiled --ext-feature=&lt;
extended-feature-name&gt;:[args]". This handler is executed in the function
<function>opd_options()</function> in the file <filename>daemon/oprofiled.c
</filename>.
</para>

<note>
<para>
The ext_init handler is required for all features.
</para>
</note>

</sect2>

<sect2 id="ext_print_stats">
<title>ext_print_stats Handler</title>

<para>
"ext_print_stats" handles the extended feature statistics report. It adds
a new section in the OProfile daemon statistics report, which is normally
outputed to the file
<filename>/var/lib/oprofile/samples/oprofiled.log</filename>.
This handler is executed in the function <function>opd_print_stats()</function>
in the file <filename>daemon/opd_stats.c</filename>.
</para>

</sect2>

<sect2 id="ext_sfile_handlers">
<title>ext_sfile Handler</title>

<para>
"ext_sfile" contains a set of handlers related to operations on the extended
sample files (sample files for events related to extended feature).
These operations include <function>create_sfile()</function>,
<function>sfile_dup()</function>, <function>close_sfile()</function>,
<function>sync_sfile()</function>, and <function>get_file()</function>
as defined in <filename>daemon/opd_sfile.c</filename>.
An additional field, <varname>odb_t * ext_file</varname>, is added to the 
<varname>struct sfile</varname> for storing extended sample files
information. 

</para>

</sect2>

</sect1>

<sect1 id="ext-implementation">
<title>Extended Feature Reference Implementation</title>

<sect2 id="ext-ibs">
<title>Instruction-Based Sampling (IBS)</title>

<para>
An example of extended feature implementation can be seen by
examining the AMD Instruction-Based Sampling support.
</para>

<sect3 id="ibs-init">
<title>IBS Initialization</title>

<para>
Instruction-Based Sampling (IBS) is a new performance measurement technique
available on AMD Family 10h processors. Enabling IBS profiling is done simply
by specifying IBS performance events through the "--event=" options.
</para>

<screen>
opcontrol --event=IBS_FETCH_XXX:&lt;count&gt;:&lt;um&gt;:&lt;kernel&gt;:&lt;user&gt;
opcontrol --event=IBS_OP_XXX:&lt;count&gt;:&lt;um&gt;:&lt;kernel&gt;:&lt;user&gt;

Note: * Count and unitmask for all IBS fetch events must be the same,
	as do those for IBS op.
</screen>

<para>
IBS performance events are listed in <function>opcontrol --list-events</function>.
When users specify these events, opcontrol verifies them using ophelp, which
checks for the <varname>ext:ibs_fetch</varname> or <varname>ext:ibs_op</varname>
tag in <filename>events/x86-64/family10/events</filename> file.
Then, it configures the driver interface (/dev/oprofile/ibs_fetch/... and
/dev/oprofile/ibs_op/...) and starts the OProfile daemon as follows.
</para>

<screen>
oprofiled \
    --ext-feature=ibs:\
	fetch:&lt;IBS_FETCH_EVENT1&gt;,&lt;IBS_FETCH_EVENT2&gt;,...,:&lt;IBS fetch count&gt;:&lt;IBS Fetch um&gt;|\
	op:&lt;IBS_OP_EVENT1&gt;,&lt;IBS_OP_EVENT2&gt;,...,:&lt;IBS op count&gt;:&lt;IBS op um&gt;
</screen>

<para>
Here, the OProfile daemon parses the <varname>--ext-feature</varname>
option and checks the feature name ("ibs") before calling the 
the initialization function to handle the string
containing IBS events, counts, and unitmasks.
Then, it stores each event in the IBS virtual-counter table
(<varname>struct opd_event ibs_vc[OP_MAX_IBS_COUNTERS]</varname>) and
stores the event index in the IBS Virtual Counter Index (VCI) map
(<varname>ibs_vci_map[OP_MAX_IBS_COUNTERS]</varname>) with IBS event value
as the map key.
</para>
</sect3>

<sect3 id="ibs-data-processing">
<title>IBS Data Processing</title>

<para>
During a profile session, the OProfile daemon identifies IBS samples in the 
event buffer using the <varname>"IBS_FETCH_CODE"</varname> or 
<varname>"IBS_OP_CODE"</varname>. These codes trigger the handlers 
<function>code_ibs_fetch_sample()</function> or 
<function>code_ibs_op_sample()</function> listed in the
<varname>handler_t handlers[]</varname> vector in 
<filename>daemon/opd_trans.c </filename>. These handlers are responsible for
processing IBS samples and translate them into IBS performance events.
</para>

<para>
Unlike traditional performance events, each IBS sample can be derived into 
multiple IBS performance events. For each event that the user specifies,
a combination of bits from Model-Specific Registers (MSR) are checked
against the bitmask defining the event. If the condition is met, the event
will then be recorded. The derivation logic is in the files
<filename>daemon/opd_ibs_macro.h</filename> and
<filename>daemon/opd_ibs_trans.[h,c]</filename>. 
</para>

</sect3>

<sect3 id="ibs-sample-file">
<title>IBS Sample File</title>

<para>
Traditionally, sample file information <varname>(odb_t)</varname> is stored
in the <varname>struct sfile::odb_t file[OP_MAX_COUNTER]</varname>.
Currently, <varname>OP_MAX_COUNTER</varname> is 8 on non-alpha, and 20 on
alpha-based system. Event index (the counter number on which the event
is configured) is used to access the corresponding entry in the array.
Unlike the traditional performance event, IBS does not use the actual
counter registers (i.e. <filename>/dev/oprofile/0,1,2,3</filename>).
Also, the number of performance events generated by IBS could be larger than
<varname>OP_MAX_COUNTER</varname> (currently upto 13 IBS-fetch and 46 IBS-op 
events). Therefore IBS requires a special data structure and sfile
handlers (<varname>struct opd_ext_sfile_handlers</varname>) for managing
IBS sample files. IBS-sample-file information is stored in a memory 
allocated by handler <function>ibs_sfile_create()</function>, which can
be accessed through <varname>struct sfile::odb_t * ext_files</varname>.
</para>

</sect3>

</sect2>

</sect1>

</chapter>

<glossary id="glossary">
<title>Glossary of OProfile source concepts and types</title>

<glossentry><glossterm>application image</glossterm>
<glossdef><para>
The primary binary image used by an application. This is derived
from the kernel and corresponds to the binary started upon running
an application: for example, <filename>/bin/bash</filename>.
</para></glossdef></glossentry>

<glossentry><glossterm>binary image</glossterm>
<glossdef><para>
An ELF file containing executable code: this includes kernel modules,
the kernel itself (a.k.a. <filename>vmlinux</filename>), shared libraries,
and application binaries.
</para></glossdef></glossentry>

<glossentry><glossterm>dcookie</glossterm>
<glossdef><para>
Short for "dentry cookie". A unique ID that can be looked up to provide
the full path name of a binary image.
</para></glossdef></glossentry>

<glossentry><glossterm>dependent image</glossterm>
<glossdef><para>
A binary image that is dependent upon an application, used with
per-application separation. Most commonly, shared libraries. For example,
if <filename>/bin/bash</filename> is running and we take
some samples inside the C library itself due to <command>bash</command>
calling library code, then the image <filename>/lib/libc.so</filename>
would be dependent upon <filename>/bin/bash</filename>.
</para></glossdef></glossentry>

<glossentry><glossterm>merging</glossterm>
<glossdef><para>
This refers to the ability to merge several distinct sample files
into one set of data at runtime, in the post-profiling tools. For example,
per-thread sample files can be merged into one set of data, because
they are compatible (i.e. the aggregation of the data is meaningful),
but it's not possible to merge sample files for two different events,
because there would be no useful meaning to the results.
</para></glossdef></glossentry>

<glossentry><glossterm>profile class</glossterm>
<glossdef><para>
A collection of profile data that has been collected under the same
class template. For example, if we're using <command>opreport</command>
to show results after profiling with two performance counters enabled
profiling <constant>DATA_MEM_REFS</constant> and <constant>CPU_CLK_UNHALTED</constant>,
there would be two profile classes, one for each event. Or if we're on
an SMP system and doing per-cpu profiling, and we request
<command>opreport</command> to show results for each CPU side-by-side,
there would be a profile class for each CPU.
</para></glossdef></glossentry>

<glossentry><glossterm>profile specification</glossterm>
<glossdef><para>
The parameters the user passes to the post-profiling tools that limit
what sample files are used. This specification is matched against
the available sample files to generate a selection of profile data.
</para></glossdef></glossentry>

<glossentry><glossterm>profile template</glossterm>
<glossdef><para>
The parameters that define what goes in a particular profile class.
This includes a symbolic name (e.g. "cpu:1") and the code-usable
equivalent.
</para></glossdef></glossentry>

</glossary>

</book>