gcc profiler internals
I occasionally encounter profiler bugs in gcc
. Two related examples I
remember are tauthon
(python2
fork) and
python
ICE
s in PGO
build mode. PGO
stands for profile-guided optimization.
I keep forgetting gcc
internals around profiling handling and keep
re-tracing the same steps every time it comes up. I decided to write a
few words on it. I’ll cover a few generic details first and then we’ll
explore today’s python
bug.
using gcc
profiler: -fprofile-generate
/ -fprofile-use
Let’s start from a trivial example. To build our program we run gcc
on it:
$ gcc $CFLAGS prog.c -o prog
To build an optimized program using profile feedback we need to perform 3 steps:
$ gcc $CFLAGS -fprofile-generate prog.c -o prog
$ ./prog some-training-input-data
$ gcc $CFLAGS -fprofile-use prog.c -o prog
These steps are:
- build instrumented program
- run built program to get a profile
- build optimized program against generated profile
PGO
build mode is not the only user of profile instrumentation.
GCC
allows gathering runtime execution stats using different types
and formats (instrumentation options).
A few frequent uses are:
- call graph style profiling (
-pg
option): used to explore call graph ingprof
format. - execution statictics coverage (
-ftest-coverage
option): used to extract close to line-by-line execution coverage ingcno
format. - execution statistics profiles (
-fprofile-generate
): used to guide optimization in future compilation (PGO
) ingcda
format.
All these modes are not fundamentally different. They share
implementation and in-memory/on-disk format. I’ll look at a gcda
form. We’ll use the following sample code as our running example:
#include <stddef.h>
/* Keep the functions around to have real indirect calls. */
#define DECL_F(fn) static void fn(void) __attribute__((noipa))
#define DEF_F(fn) static void fn(void) {}
(f0); DEF_F(f0);
DECL_F(f1); DEF_F(f1);
DECL_F(f2); DEF_F(f2);
DECL_F
int main (int argc, char *argv[]) {
if (argc < 2) return 0;
static const void(*ft[])(void) = { &f0, &f1, &f2, };
size_t sz = sizeof (ft) / sizeof (ft[0]);
for (const char * p = argv[1]; *p; p++)
[*p % sz]();
ft
return 0;
}
This program accepts one string argument (argv[1]
) and calls
function f0()
, f1()
, or f2()
somewhat randomly (based on
char
modulo) against each byte of input string. The idea is that
it’s not immediately obvious which of the functions is called most
frequently.
This program has a few conditional branches and indirect calls. Let’s
profile it and see what stats we can collect. Building:
$ gcc a.c -o a -fprofile-generate -fprofile-arcs -ftest-coverage -O2
Running our program 5 times:
$ ./a
$ ./a 123456789
$ ./a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
$ ./a aa
$ ./a aaaa
$ ls
a a.c a.gcda a.gcno
Note that each ./a
run creates or updates a.gcda
and a.gcno
.
We can dump profiling statistics with gcov
:
$ gcov -a -b -c -f a
$ cat a.c.gcov
-: 0:Source:a.c
-: 0:Graph:a.gcno
-: 0:Data:a.gcda
-: 0:Runs:5
-: 1:#include <stddef.h>
-: 2:
-: 3:/* Keep the functions around to have real indirect calls. */
-: 4:#define DECL_F(fn) static void fn(void) __attribute__((noipa))
-: 5:#define DEF_F(fn) static void fn(void) {}
-: 6:
function f0 called 3 returned 100% blocks executed 100%
3: 7:DECL_F(f0); DEF_F(f0);
function f1 called 53 returned 100% blocks executed 100%
53: 8:DECL_F(f1); DEF_F(f1);
function f2 called 3 returned 100% blocks executed 100%
3: 9:DECL_F(f2); DEF_F(f2);
-: 10:
function main called 5 returned 100% blocks executed 100%
5: 11:int main (int argc, char *argv[]) {
5: 12: if (argc < 2) return 0;
5: 12-block 0
branch 0 taken 4 (fallthrough)
branch 1 taken 1
-: 13:
-: 14: static const void(*ft[])(void) = { &f0, &f1, &f2, };
-: 15: size_t sz = sizeof (ft) / sizeof (ft[0]);
-: 16:
63: 17: for (const char * p = argv[1]; *p; p++)
4: 17-block 0
63: 17-block 1
branch 0 taken 59
branch 1 taken 4 (fallthrough)
59: 18: ft[*p % sz]();
59: 18-block 0
call 0 returned 59
-: 19:
-: 20: return 0;
-: 21:}
Here we see that:
ft[*p % sz]()
gets called59
times:3
(f0
) +53
(f1
) +3
(f2
)- function
main()
gets called5
times; a full length of all the inputs we passed to./a
program
We also see frequency of if (argc < 2) return 0;
branches:
- was taken
1
time (./a
case, no arguments) - was not taken
4
times (all the other./a ...
calls)
After profile collection we can build new binary using .gcda
files
by changing -fprofile-generate
to -fprofile-use
in our gcc
calls:
$ gcc a.c -o a -fprofile-use -O2
Let’s compare the result against unprofiled build:
$ gcc a.c -o a-unprof -O2
I’ll skip disassembly as instructions do not change materially (the program is too simple). The code order is slightly different in the two binaries. Let’s have a look:
$ nm -n a-unprof
...
0000000000400390 r __abi_tag
0000000000401000 T _init
0000000000401020 T main
0000000000401090 T _start
00000000004010c0 T _dl_relocate_static_pie
00000000004010d0 t deregister_tm_clones
0000000000401100 t register_tm_clones
0000000000401140 t __do_global_dtors_aux
0000000000401170 t frame_dummy
0000000000401180 t f0
0000000000401190 t f1
00000000004011a0 t f2
...
$ nm -n a
...
0000000000400390 r __abi_tag
0000000000401000 T _init
0000000000401020 T main
0000000000401080 t f1
0000000000401090 T _start
...
nm -n
orders symbols in their in-memory order so we could
get the idea what goes where.
One can spot that in unprofiled build main()
is quite far away from most
frequently called f1()
function. While in profiled build
f1()
immediately follows main()
.
In more practical scenarios the effects are more pronounced:
- likely to execute code is laid out in a way that does not require branching
- inlining can rely on execution statistics instead of function size heuristics
-flto
could perform global program code reordering to speed up binary loading from disk- and many other effects
The main takeaway here is that PGO
requires two full builds
(-fprofile-generate
and -fprofile-use
) and a program training
run.
In our example training data is biased towards inputs with 'a'
symbols and thus our program is probably slightly more efficient at
handling those.
In PGO
builds the trick is to find suitable input training data.
Typical rookie mistake is to use ./prog --help
as a training run.
Do not do that: you will get program optimized for printing help text.
The rest will perform worse than typical unprofiled build.
internals: gcov-dump
tool
What exactly do these .gcda
files contain? Let’s have a peek with
gcov-dump
tool shipped with gcc
:
$ gcov-dump a.gcda
a.gcda:data:magic `gcda':version `B12*'
a.gcda:stamp 1137774494
a.gcda: a1000000: 2:OBJECT_SUMMARY runs=5, sum_max=60
a.gcda: 01000000: 3:FUNCTION ident=108032747, lineno_checksum=0x0a7a17ea, cfg_checksum=0xc835c602
a.gcda: 01a10000: 8:COUNTERS arcs 4 counts
a.gcda: 01a90000: 16:COUNTERS indirect_call 8 counts
a.gcda: 01af0000: 2:COUNTERS time_profiler 1 counts
a.gcda: 01000000: 3:FUNCTION ident=1567133468, lineno_checksum=0xdadb6f0d, cfg_checksum=0xa43083b8
a.gcda: 01a10000: 2:COUNTERS arcs 1 counts
a.gcda: 01a90000: 0:COUNTERS indirect_call 0 counts
a.gcda: 01af0000: 2:COUNTERS time_profiler 1 counts
a.gcda: 01000000: 3:FUNCTION ident=1025457522, lineno_checksum=0x745daa69, cfg_checksum=0xa43083b8
a.gcda: 01a10000: 2:COUNTERS arcs 1 counts
a.gcda: 01a90000: 0:COUNTERS indirect_call 0 counts
a.gcda: 01af0000: 2:COUNTERS time_profiler 1 counts
a.gcda: 01000000: 3:FUNCTION ident=1634904005, lineno_checksum=0x432c0dd4, cfg_checksum=0xa43083b8
a.gcda: 01a10000: 2:COUNTERS arcs 1 counts
a.gcda: 01a90000: 0:COUNTERS indirect_call 0 counts
a.gcda: 01af0000: 2:COUNTERS time_profiler 1 counts
Here we see something about a.gcda
contents. It has:
- 5 program runs (
OBJECT_SUMMARY runs=5
) - 4
FUNCTION
s, each contains information about:arcs
: conditional or unconditional branchesindirect_call
indirect branch targetstime_profiler
: count of times basic block was executed
Note that functions are identified not by name, but by a triple:
ident
: name hash, calculated withcoverage_compute_profile_id()
lineno_checksum
: source file name and line number hash, calculated withcoverage_compute_lineno_checksum()
cfg_checksum
: control flow graph hash, calculated withcoverage_compute_cfg_checksum()
identifier stability can survive minor source code changes.
You need a .gcno
file to resolve these obscure identifiers back
to source line numbers.
Let’s look into individual counters of .gcda
files. We can extract
then with -l
option:
$ gcov-dump -l a.gcda
a.gcda:data:magic `gcda':version `B12*'
a.gcda:stamp 1137774494
a.gcda: a1000000: 2:OBJECT_SUMMARY runs=5, sum_max=60
a.gcda: 01000000: 3:FUNCTION ident=108032747, lineno_checksum=0x0a7a17ea, cfg_checksum=0xc835c602
a.gcda: 01a10000: 8:COUNTERS arcs 4 counts
a.gcda: 0: 1 4 59 4
a.gcda: 01a90000: 16:COUNTERS indirect_call 8 counts
a.gcda: 0: 59 3 1025457522 53 1567133468 3 1634904005 3
a.gcda: 01af0000: 2:COUNTERS time_profiler 1 counts
a.gcda: 0: 1
a.gcda: 01000000: 3:FUNCTION ident=1567133468, lineno_checksum=0xdadb6f0d, cfg_checksum=0xa43083b8
a.gcda: 01a10000: 2:COUNTERS arcs 1 counts
a.gcda: 0: 3
a.gcda: 01a90000: 0:COUNTERS indirect_call 0 counts
a.gcda: 01af0000: 2:COUNTERS time_profiler 1 counts
a.gcda: 0: 3
a.gcda: 01000000: 3:FUNCTION ident=1025457522, lineno_checksum=0x745daa69, cfg_checksum=0xa43083b8
a.gcda: 01a10000: 2:COUNTERS arcs 1 counts
a.gcda: 0: 53
a.gcda: 01a90000: 0:COUNTERS indirect_call 0 counts
a.gcda: 01af0000: 2:COUNTERS time_profiler 1 counts
a.gcda: 0: 2
a.gcda: 01000000: 3:FUNCTION ident=1634904005, lineno_checksum=0x432c0dd4, cfg_checksum=0xa43083b8
a.gcda: 01a10000: 2:COUNTERS arcs 1 counts
a.gcda: 0: 3
a.gcda: 01a90000: 0:COUNTERS indirect_call 0 counts
a.gcda: 01af0000: 2:COUNTERS time_profiler 1 counts
a.gcda: 0: 4
Counters are harder to interpret without the context.
For example arcs
just list count of each branch taken known to the current function.
The string FUNCTION ident=108032747: arcs 4 counts: 1 4 59 4
means that first
arc was taken once, third was taken 59 times.
Looking at the above a.c.gcov
I
would guess FUNCTION ident=108032747
is our main()
function,
first counter is our if (argc < 2) return 0;
branch and the rest
are indirect call arcs to f1()
, f2()
, f3()
.
indirect_call
is more complicated: the string indirect_call 8 counts: 59 3 1025457522 53 1567133468 3 1634904005 3
means that there were 59 indirect calls, top 3 are listed, first
indirection had 53 calls (with ident=1025457522
as a target, probably f1()
),
second and third are taken 3 times (probably f0()
and f2()
are the targets).
time_profiler
is the simplest one: it says how many times the
function itself was called.
Quiz question: why does ident=108032747
(main()
function) has
only one time_profile
call and not 5?
internals: libgcov
.gcda
files are read at program startup and written at program
shutdown. Multiple program runs have an effect of merged profile
statictics from each run.
Some staticstics like runs
are easy to merge:
just sum values together. But some like arcs
are trickier:
how do we handle indirect calls with an unbound target fan out?
Should we store all of them? Or set a static or dynamic limit?
All these cases are handled by -lgcov
library linked into
instrumented binaries. It’s API resides in
libgcc/libgcov.h
and has a few things to note:
- entry and exit points:
__gcov_init()
/__gcov_exit()
. - profiler counters:
__gcov_interval_profiler()
/__gcov_time_profiler()
/__gcov_indirect_call_profiler_v4()
- profile mergers:
__gcov_merge_add()
/__gcov_merge_time_profile()
/__gcov_merge_topn()
- counter readers:
gcov_get_counter()
/gcov_get_counter_ignore_scaling()
/gcov_get_counter_target()
- hooks to catch program re-execution:
__gcov_fork()
/__gcov_execl()
Let’s look at indirect call profiler details as an example. The other metric types follow the same pattern.
__gcov_indirect_call_profiler_v4()
implements an increment (in-memory) for indirect_call
:
void
(gcov_type value, void *cur_func)
__gcov_indirect_call_profiler_v4 {
(value, cur_func, 0);
__gcov_indirect_call_profiler_body }
/* By default, the C++ compiler will use function addresses in the
vtable entries. Setting TARGET_VTABLE_USES_DESCRIPTORS to nonzero
tells the compiler to use function descriptors instead. The value
of this macro says how many words wide the descriptor is (normally 2).
It is assumed that the address of a function descriptor may be treated
as a pointer to a function. */
/* Tries to determine the most common value among its inputs. */
static inline void
(gcov_type value, void *cur_func,
__gcov_indirect_call_profiler_body int use_atomic)
{
/* If the C++ virtual tables contain function descriptors then one
function may have multiple descriptors and we need to dereference
the descriptors to see if they point to the same function. */
if (cur_func == __gcov_indirect_call.callee
|| (__LIBGCC_VTABLE_USES_DESCRIPTORS__
&& *(void **) cur_func == *(void **) __gcov_indirect_call.callee))
(__gcov_indirect_call.counters, value,
__gcov_topn_values_profiler_body );
use_atomic
.callee = NULL;
__gcov_indirect_call}
/* Tries to determine N most commons value among its inputs. */
static inline void
(gcov_type *counters, gcov_type value,
__gcov_topn_values_profiler_body int use_atomic)
{
(counters, value, 1, use_atomic, 1);
gcov_topn_add_value }
/* Add key value pair VALUE:COUNT to a top N COUNTERS. When INCREMENT_TOTAL
is true, add COUNT to total of the TOP counter. If USE_ATOMIC is true,
do it in atomic way. Return true when the counter is full, otherwise
return false. */
static inline unsigned
(gcov_type *counters, gcov_type value, gcov_type count,
gcov_topn_add_value int use_atomic, int increment_total)
{
// ...
Here we see that indirect_call
is a topn
style counter as it’s handled by
gcov_topn_add_value()
.
static inline unsigned
(gcov_type *counters, gcov_type value, gcov_type count,
gcov_topn_add_value int use_atomic, int increment_total)
{
if (increment_total)
{
/* In the multi-threaded mode, we can have an already merged profile
with a negative total value. In that case, we should bail out. */
if (counters[0] < 0)
return 0;
(&counters[0], 1, use_atomic);
gcov_counter_add }
struct gcov_kvp *prev_node = NULL;
struct gcov_kvp *minimal_node = NULL;
struct gcov_kvp *current_node = (struct gcov_kvp *)(intptr_t)counters[2];
while (current_node)
{
if (current_node->value == value)
{
(¤t_node->count, count, use_atomic);
gcov_counter_add return 0;
}
// ...
return 0;
}
A few things to note here:
gcov_topn_add_value()
incrementscounters[ix]
whereix
matchesvalue
being counted.gcov_topn_add_value()
incrementscounters[0]
as it treats it as “total”- negative “total” values are special and are related to merged values (how?)
Let’s look at the merge function for __gcov_merge_topn()
:
/* The profile merging function for choosing the most common value.
It is given an array COUNTERS of N_COUNTERS old counters and it
reads the same number of counters from the gcov file. The counters
are split into pairs where the members of the tuple have
meanings:
-- the stored candidate on the most common value of the measured entity
-- counter
We use -TOTAL for situation when merging dropped some values.
The information is used for -fprofile-reproducible flag.
*/
void
(gcov_type *counters, unsigned n_counters)
__gcov_merge_topn {
(!(n_counters % GCOV_TOPN_MEM_COUNTERS));
gcc_assert
for (unsigned i = 0; i < (n_counters / GCOV_TOPN_MEM_COUNTERS); i++)
{
/* First value is number of total executions of the profiler. */
= gcov_get_counter_ignore_scaling (-1);
gcov_type all = gcov_get_counter_ignore_scaling (-1);
gcov_type n
unsigned full = all < 0;
*total = &counters[GCOV_TOPN_MEM_COUNTERS * i];
gcov_type *total += full ? -all : all;
for (unsigned j = 0; j < n; j++)
{
= gcov_get_counter_target ();
gcov_type value = gcov_get_counter_ignore_scaling (-1);
gcov_type count
// TODO: we should use atomic here
|= gcov_topn_add_value (counters + GCOV_TOPN_MEM_COUNTERS * i,
full , count, 0, 0);
value}
if (full)
*total = -(*total);
}
}
Once again we see that first pair of key/value is treated as “total” count
of calls. Merger is the place where overflow of gcov_topn_add_value()
is detected and “total” is stored as a negative value to show this fact.
All the counter types are written on disk with
write_once_data()
.
From there we see that
write_topn_counters()
dumps them as:
/* Store all TOP N counters where each has a dynamic length. */
static void
(const struct gcov_ctr_info *ci_ptr,
write_topn_counters unsigned t_ix,
,
gcov_unsigned_t n_countsvoid (*dump_fn) (const void *, unsigned, void *),
void *(*allocate_fn)(unsigned, void *),
void *arg)
{
// ...
unsigned disk_size = GCOV_TOPN_DISK_COUNTERS * counters + 2 * pair_total;
(GCOV_TAG_FOR_COUNTER (t_ix), dump_fn, arg),
dump_unsigned (GCOV_TAG_COUNTER_LENGTH (disk_size), dump_fn, arg);
dump_unsigned
for (unsigned i = 0; i < counters; i++)
{
(ci_ptr->values[GCOV_TOPN_MEM_COUNTERS * i], dump_fn, arg);
dump_counter (list_sizes[i], dump_fn, arg);
dump_counter = ci_ptr->values[GCOV_TOPN_MEM_COUNTERS * i + 2];
gcov_type start
unsigned j = 0;
for (struct gcov_kvp *node = (struct gcov_kvp *)(__INTPTR_TYPE__)start;
< list_sizes[i]; node = node->next, j++)
j {
(node->value, dump_fn, arg);
dump_counter (node->count, dump_fn, arg);
dump_counter }
}
}
This just writes key/value pairs on disk. Nothing fancy. No special handling of negative values. This makes on-disk format roughly match in-memory format.
Now we can answer our quiz question on why time_profiler
still has a
value of 1
for main()
even after 5 program runs.
__gcov_merge_time_profile()
tells us:
/* Time profiles are merged so that minimum from all valid (greater than zero)
is stored. There could be a fork that creates new counters. To have
the profile stable, we chosen to pick the smallest function visit time. */
void
(gcov_type *counters, unsigned n_counters)
__gcov_merge_time_profile {
unsigned int i;
;
gcov_type value
for (i = 0; i < n_counters; i++)
{
= gcov_get_counter_target ();
value
if (value && (!counters[i] || value < counters[i]))
[i] = value;
counters}
}
The counter stores fastest execution for the function. Not a sum of execution times (as I thought initially).
The takeaways here are:
gcc
implements a few types of counters: histogram-styleTOPN
counters, simpler numeric counters and a few others.- merge policies for counters are dependent on their exact meaning; they are not complicated but have a few gotchas like negative values.
libgcov
provides runtime for both program being profiled andgcov
standalone tool
internals: gcc
-emitted code
Let’s look at the way counters are embedded into our program. I suggest
looking at the optimized GIMPLE
tree. Here is our original program:
#include <stddef.h>
/* Keep the functions around to have real indirect calls. */
#define DECL_F(fn) static void fn(void) __attribute__((noipa))
#define DEF_F(fn) static void fn(void) {}
(f0); DEF_F(f0);
DECL_F(f1); DEF_F(f1);
DECL_F(f2); DEF_F(f2);
DECL_F
int main (int argc, char *argv[]) {
if (argc < 2) return 0;
static const void(*ft[])(void) = { &f0, &f1, &f2, };
size_t sz = sizeof (ft) / sizeof (ft[0]);
for (const char * p = argv[1]; *p; p++)
[*p % sz]();
ft
return 0;
}
We will dump final GIMPLE
with -fdump-tree-optimized
flag.
I’ll paste the output in 3 pieces (as the output is quite big):
main()
functionf0()
function- constructor/destructor
Let’s dump GIMPLE
first:
$ gcc -c a.c -fprofile-generate -fprofile-arcs -ftest-coverage -O2 -fdump-tree-optimized
$ cat a.c.244t.optimized
Here is our main()
:
;; Function main (main, funcdef_no=3, decl_uid=1966, cgraph_uid=4, symbol_order=3) (executed once)
__attribute__((access ("^1[ ]", )))
int main (int argc, char * * argv)
{
const char * p;
static void (*<T34b>) (void) ft[3] = {f0, f1, f2};
long unsigned int _1;
long unsigned int _2;
void (*<T34b>) (void) _3;
char _4;
long int PROF_edge_counter_14;
long int PROF_edge_counter_15;
long int PROF_edge_counter_16;
long int PROF_edge_counter_17;
long int PROF_edge_counter_18;
long int PROF_edge_counter_19;
long int PROF_edge_counter_20;
long int PROF_edge_counter_21;
long int _23;
long int time_profile_24;
long int time_profile_25;
void * _26;
char _39;
char _40;
<bb 2> [local count: 160390549]:
_26 = __gcov_indirect_call.D.1993;
if (_26 != 0B)
goto <bb 3>; [20.00%]
else
goto <bb 4>; [80.00%]
<bb 3> [local count: 160390549]:
__gcov_indirect_call_profiler_v4 (108032747, main);
<bb 4> [local count: 160390549]:
_23 = __gcov7.main[0];
if (_23 == 0)
goto <bb 5>; [20.00%]
else
goto <bb 6>; [80.00%]
<bb 5> [local count: 160390549]:
time_profile_24 = __gcov_time_profiler_counter;
time_profile_25 = time_profile_24 + 1;
__gcov7.main[0] = time_profile_25;
__gcov_time_profiler_counter = time_profile_25;
<bb 6> [local count: 160390549]:
if (argc_8(D) <= 1)
goto <bb 7>; [26.36%]
else
goto <bb 8>; [73.64%]
<bb 7> [local count: 42278949]:
PROF_edge_counter_14 = __gcov0.main[0];
PROF_edge_counter_15 = PROF_edge_counter_14 + 1;
__gcov0.main[0] = PROF_edge_counter_15;
goto <bb 11>; [100.00%]
<bb 8> [local count: 118111600]:
PROF_edge_counter_16 = __gcov0.main[1];
PROF_edge_counter_17 = PROF_edge_counter_16 + 1;
__gcov0.main[1] = PROF_edge_counter_17;
p_11 = MEM[(char * *)argv_10(D) + 8B];
_39 = *p_11;
if (_39 != 0)
goto <bb 9>; [89.00%]
else
goto <bb 10>; [11.00%]
<bb 9> [local count: 955630226]:
# _40 = PHI <_4(9), _39(8)>
# p_41 = PHI <p_13(9), p_11(8)>
_1 = (long unsigned int) _40;
_2 = _1 % 3;
_3 = ft[_2];
__gcov_indirect_call.D.1994 = &__gcov4.main[0];
__gcov_indirect_call.D.1993 = _3;
_3 ();
PROF_edge_counter_18 = __gcov0.main[2];
PROF_edge_counter_19 = PROF_edge_counter_18 + 1;
__gcov0.main[2] = PROF_edge_counter_19;
p_13 = p_41 + 1;
_4 = MEM[(const char *)p_13];
if (_4 != 0)
goto <bb 9>; [89.00%]
else
goto <bb 10>; [11.00%]
<bb 10> [local count: 118111600]:
PROF_edge_counter_20 = __gcov0.main[3];
PROF_edge_counter_21 = PROF_edge_counter_20 + 1;
__gcov0.main[3] = PROF_edge_counter_21;
<bb 11> [local count: 160390549]:
return 0;
}
While it’s a lot of code each counter handling is straightforward:
- read the counter out
- update the counter
- write it back
Here we already see a few global variables being updated:
__gcov_time_profiler_counter
: global timer counter gets incremented once.__gcov7.main[0]
:main
timer counter gets incremented once.__gcov0.main[0,1,2,3]
:main
arc counters (tof0()
,f1()
,f2()
)
Note that main()
does build explicit arcs to
f0()
/f1()
/f2()
but it does not know where indirections hide that lead
it to. Thus gcc
stores indirection to __gcov_indirect_call.D.1993
global.
To maintain the indirect counters gcc
relies on target to increment it:
;; Function f2 (f2, funcdef_no=2, decl_uid=1960, cgraph_uid=3, symbol_order=2)
__attribute__((noipa, noinline, noclone, no_icf))
void f2 ()
{
long int PROF_edge_counter_2;
long int PROF_edge_counter_3;
long int _4;
long int time_profile_5;
long int time_profile_6;
void * _7;
<bb 2> [local count: 1073741824]:
_7 = __gcov_indirect_call.D.1993;
if (_7 != 0B)
goto <bb 3>; [20.00%]
else
goto <bb 4>; [80.00%]
<bb 3> [local count: 1073741824]:
__gcov_indirect_call_profiler_v4 (1567133468, f2);
<bb 4> [local count: 1073741824]:
PROF_edge_counter_2 = __gcov0.f2[0];
PROF_edge_counter_3 = PROF_edge_counter_2 + 1;
__gcov0.f2[0] = PROF_edge_counter_3;
_4 = __gcov7.f2[0];
if (_4 == 0)
goto <bb 5>; [20.00%]
else
goto <bb 6>; [80.00%]
<bb 5> [local count: 1073741824]:
time_profile_5 = __gcov_time_profiler_counter;
time_profile_6 = time_profile_5 + 1;
__gcov7.f2[0] = time_profile_6;
__gcov_time_profiler_counter = time_profile_6;
<bb 6> [local count: 1073741824]:
return;
}
;; Function f1 (f1, funcdef_no=1, decl_uid=1955, cgraph_uid=2, symbol_order=1)
... same as f0
;; Function f2 (f0, funcdef_no=0, decl_uid=1950, cgraph_uid=1, symbol_order=0)
... same as f2
Here the same __gcov_indirect_call.D.1993
is being used to increment the counter
in case f2()
is indirection target. If f2()
is called directly then this
global would not be populated.
;; Function _sub_I_00100_0 (_sub_I_00100_0, funcdef_no=4, decl_uid=2028, cgraph_uid=6, symbol_order=17) (executed once)
void _sub_I_00100_0 ()
{
<bb 2> [local count: 1073741824]:
__gcov_init (&*.LPBX0); [tail call]
return;
}
;; Function _sub_D_00100_1 (_sub_D_00100_1, funcdef_no=5, decl_uid=2031, cgraph_uid=7, symbol_order=18) (executed once)
void _sub_D_00100_1 ()
{
<bb 2> [local count: 1073741824]:
__gcov_exit (); [tail call]
return;
}
And here we see -lgcov
start up and shutdown code in constructor and
destructor. Simple!
the actual bug
Now if all the above has some sense to you then https://gcc.gnu.org/PR105282
should look less mysterious. There python
managed to ICE gcc
when
program was optimized with training data available.
Let’s look at the minimized example:
#include <stddef.h>
typedef void (*cb_t)(void);
#define F(__fn) static void __fn(void) {}
(f00);F(f01);F(f02);F(f03);F(f04);F(f05);F(f06);F(f07);F(f08);F(f09);
F(f10);F(f11);F(f12);F(f13);F(f14);F(f15);F(f16);F(f17);F(f18);F(f19);
F(f20);F(f21);F(f22);F(f23);F(f24);F(f25);F(f26);F(f27);F(f28);F(f29);
F(f30);F(f31);F(f32);F(f33);F(f34);F(f35);F(f36);F(f37);F(f38);F(f39);
F(f40);F(f41);F(f42);F(f43);F(f44);F(f45);F(f46);F(f47);F(f48);F(f49);
F
static void f(int i) {
/* Needs to be bigger than gcc's GCOV_TOPN_MAXIMUM_TRACKED_VALUES == 32
* to overflow GCOV_COUNTER_V_INDIR couter type.
*/
static const cb_t fs[] = {
&f00,&f01,&f02,&f03,&f04,&f05,&f06,&f07,&f08,&f09,
&f10,&f11,&f12,&f13,&f14,&f15,&f16,&f17,&f18,&f19,
&f20,&f21,&f22,&f23,&f24,&f25,&f26,&f27,&f28,&f29,
&f30,&f31,&f32,&f33,&f34,&f35,&f36,&f37,&f38,&f39,
&f40,&f41,&f42,&f43,&f44,&f45,&f46,&f47,&f48,&f49,
};
size_t sz = sizeof (fs) / sizeof (fs[0]);
[i % sz]();
fs}
int l(int argc, char * argv[]);
int main(int argc, char *argv[]) {
if (argc == 1)
for (unsigned int i = 0; i < 25; i++)
(i);
fif (argc == 2)
for (unsigned int i = 25; i < 50; i++)
(i);
f}
ICE on gcc-12
or gcc-11
:
$ gcc -flto -O0 a.c -fprofile-generate -o a
$ ./a # populate first 25 buckets
$ ./a 1 # populate 25 more buckets, cause overflow
$ gcc -flto -O0 a.c -fprofile-use -o a
during IPA pass: modref
a.c:36:1: internal compiler error: in stream_out_histogram_value, at value-prof.cc:340
36 | }
| ^
0x8351fb stream_out_histogram_value(output_block*, histogram_value_t*)
../../gcc-12-20220410/gcc/value-prof.cc:340
0x1c848c0 output_gimple_stmt
../../gcc-12-20220410/gcc/gimple-streamer-out.cc:192
0x1c848c0 output_bb(output_block*, basic_block_def*, function*)
../../gcc-12-20220410/gcc/gimple-streamer-out.cc:227
0xdc91ad output_function
../../gcc-12-20220410/gcc/lto-streamer-out.cc:2453
0xdc91ad lto_output()
../../gcc-12-20220410/gcc/lto-streamer-out.cc:2796
0xe57b11 write_lto
../../gcc-12-20220410/gcc/passes.cc:2762
0xe57b11 ipa_write_summaries_1
../../gcc-12-20220410/gcc/passes.cc:2826
0xe57b11 ipa_write_summaries()
../../gcc-12-20220410/gcc/passes.cc:2882
0xaac060 ipa_passes
../../gcc-12-20220410/gcc/cgraphunit.cc:2209
0xaac060 symbol_table::compile()
../../gcc-12-20220410/gcc/cgraphunit.cc:2282
0xaaea77 symbol_table::compile()
../../gcc-12-20220410/gcc/cgraphunit.cc:2262
0xaaea77 symbol_table::finalize_compilation_unit()
../../gcc-12-20220410/gcc/cgraphunit.cc:2530
Here the bug is in stream_out_histogram_value()
function which
tried to stream out on disk perfectly valid profile details around
main()
:
$ gcov-dump -l a.gcda
...
a.gcda: 01a90000: 528:COUNTERS indirect_call 66 counts
a.gcda: 0: -50 32 1456173180 1 1792104613 1 918340114 1
a.gcda: 8: 1406444659 1 263798468 1 1664310260 1 1063174467 1
a.gcda: 16: 1596551981 1 54847898 1 533075953 1 1135316294 1
a.gcda: 24: 601636648 1 2142348703 1 450479102 1 1186224457 1
a.gcda: 32: 416313568 1 1153296983 1 617240633 1 2024260238 1
a.gcda: 40: 1680162021 1 944285266 1 1480528956 1 72519307 1
a.gcda: 48: 1631250666 1 1029141085 1 941945699 1 1682532820 1
a.gcda: 56: 71228346 1 1481851149 1 1154596710 1 414983633 1
a.gcda: 64: 2026608575 1
The -50
is our “total” count of indirect calls. It’s negative because
25
calls from first run were merged with 25
calls from second call.
Histograms have a limit of 32
unique values. The other 18
were discarded.
Negative count signals this fact.
gcc/value-prof.cc:stream_out_histogram_value()
was too strict on it’s assumption around counter types:
void
(struct output_block *ob, histogram_value hist)
stream_out_histogram_value {
// ...
for (i = 0; i < hist->n_counters; i++)
{
/* When user uses an unsigned type with a big value, constant converted
to gcov_type (a signed type) can be negative. */
= hist->hvalue.counters[i];
gcov_type value if (hist->type == HIST_TYPE_TOPN_VALUES
|| hist->type == HIST_TYPE_IOR)
/* Note that the IOR counter tracks pointer values and these can have
sign bit set. */
;
else
(value >= 0);
gcc_assert
(ob, value);
streamer_write_gcov_count }
if (hist->hvalue.next)
(ob, hist->hvalue.next);
stream_out_histogram_value }
Here gcc
asserts that every value in TOPN
counters has to be non-negative.
It was trivial to fix
by skipping first “total” counter that could be negative on overflows like above:
--- a/gcc/value-prof.c
+++ b/gcc/value-prof.c
@@ -336,6 +336,10 @@ stream_out_histogram_value (struct output_block *ob, histogram_value hist)
/* Note that the IOR counter tracks pointer values and these can have
sign bit set. */
;+ else if (hist->type == HIST_TYPE_INDIR_CALL && i == 0)
+ /* 'all' counter overflow is stored as a negative value. Individual
+ counters and values are expected to be non-negative. */
+ ;
else gcc_assert (value >= 0);
So much behind this three-liner.
final words
Profiler bugs are frequently hard. Mechanical reduction on real program
is usually very fragile: most material code changes change both generated
instrumented code and break the program on training run. It’s “infeasible”
to incrementally reduce both program and input data for large projects
like firefox
or python
.
I say “infeasible” as I successfully reduced
tauthon
once in https://gcc.gnu.org/PR96394. It took me almost 2 months
to extract small reproducer manually. I don’t think I’ll do it ever again :)
First time it was fun: I found a lot more than I imagined about python` implementation: how it embeds it’s own bytecode into executable, how bootstrap python loads the rest of python modules using this bootstrap code.
When faced with another prof failure in python
in https://gcc.gnu.org/PR105282
I initially thought it would be impossible for me to do it again from
scratch. On top of that it added -flto
to make things even less
manageable. But I was lucky to reproduce the crash and derive the
reproducer out of gdb
backtrace.
PGO
-style builds have another unusual property: they are very dependent
on input training data and on any internal non-determinism your program
has. For example, if your program uses random-seeded hash tables you
will get slightly different profile outputs from run to run.
Slightly different profile outputs lead to slightly different
optimization decision during compilation. Different decisions lead to
different output binaries. Without special effort you will probably get
unique binary every time you build a real world project with PGO
support.
It might not be a big deal with regards to performance of final binary. But it might be quite a headache if one of such rare states causes compiler to generate invalid result. It might be very hard to reproduce (and fix).
But if you have to deal with PGO
bugs then try to look straight into
what gcc
is doing.
Have fun!