profiling binutils linkers in nixpkgs
background
I’ve been using binutils-2.45
against local nixpkgs
checkout for a
while to weed out minor problems in other packages. So far I encountered
my old friend
guile
over-stripping issue.
GNU gold
linker was
deprecated
in binutils
upstream as it does not have developer
power behind it. While bfd
linker (the default) gets maintenance
attention. binutils-2.45
intentionally does not have gold
source
distributed with it to nudge users off gold
.
To fix rare nixpkgs
package build failures that rely on ld.gold
I
trivially replaced all the links to ld.bfd
locally and built my system.
No major problems found.
an ld.gold
removal obstacle
In a recent discussion thread
the question was raised if/how nixpkgs
could switch to gold
-less
binutils
. One of the interesting points of the thread is that ld.bfd
is occasionally ~3x
times slower than ld.gold
on files that already take multiple seconds to link with ld.gold
.
For me it was quite a surprise as ld.bfd
does
get speedup improvements
time to time. Thus, I suspected it should be some kind of a serious bug
on ld.bfd
side.
I wondered if I could reproduce such a big performance
drop and if I could find a low hanging fix. Or at least report a bug to
binutils
upstream.
I used the same pandoc
nixpkgs
package to do the linker testing. It’s
a nice example as it builds only 4 small haskell
source files and
links in a huge amount of static haskell
libraries. A perfect linker
load test. Preparing the baseline:
# pulling in already built package into cache
$ nix build --no-link -f. pandoc
# pulling in all the build dependencies into cache
$ nix build --no-link -f. pandoc --rebuild
# timing the build:
$ time nix build --no-link -f. pandoc --rebuild
error: derivation '/nix/store/az6dbzm341jc7n4sw7w0ifspxgsm4093-pandoc-cli-3.7.0.2.drv' may not be deterministic: output '/nix/store/znmj21k8nrqc3hcax6yfy446g8bgk7z3-pandoc-cli-3.7.0.2' differs
real 0m12,850s
user 0m0,719s
sys 0m0,105s
Do not mind that determinism error. I got about 13 seconds
of the
package build time. Seems to match the original timing. Then I tried
ld.bfd
by passing extra --ghc-option=-optl-fuse-ld=bfd
option to
./Setup configure
:
$ time nix build --impure --expr 'with import <nixpkgs> {};
pandoc.overrideAttrs (oa: {
configureFlags = oa.configureFlags ++ ["--ghc-option=-optl-fuse-ld=bfd"];})'
...
real 0m37,391s
user 0m0,691s
sys 0m0,120s
37 seconds! At least 3x
slowdown. I was glad to see such a simple
reproducer.
linker performance profiles
I dropped into a development shell to explore individual ld
commands:
$ nix develop --impure --expr 'with import <nixpkgs> {};
pandoc.overrideAttrs (oa: {
configureFlags = oa.configureFlags ++ ["--ghc-option=-optl-fuse-ld=bfd" ];})'
$$ genericBuild
...
[4 of 4] Linking dist/build/pandoc/pandoc
^C
$$ # ready for interactive exploration
I ran the ./Setup build -v
to extract exact ghc --make ...
invocation:
$$ ./Setup build -v
...
Linking...
Running: <<NIX>>-ghc-9.10.3/bin/ghc --make -fbuilding-cabal-package -O -split-sections -static -outputdir dist/build/pandoc/pandoc-tmp -odir dist/build/pandoc/pandoc-tmp ...
And ran ld.bfd
under perf
:
$ perf record -g <<NIX>>-ghc-9.10.3/bin/ghc --make ... '-optl-fuse-ld=bfd' -fforce-recomp
Then I built the flamegraph
picture:
$ perf script > out.perf
$ perl ~/dev/git/FlameGraph/stackcollapse-perf.pl out.perf > out.folded
$ perl ~/dev/git/FlameGraph/flamegraph.pl out.folded > bfd.svg
You can click on the pictures and explore it interactively.
Does it look fine to you? Anything odd?
We already see a tiny hint: _bfd_elf_gc_mark()
takes suspiciously large
amount of space on the picture.
Let’s build the same picture for gold
using -optl-fuse-ld=gold
option:
$ perf record -g <<NIX>>-ghc-9.10.3/bin/ghc --make ... '-optl-fuse-ld=gold' -fforce-recomp
$ perf script > out.perf
$ perl ~/dev/git/FlameGraph/stackcollapse-perf.pl out.perf > out.folded
$ perl ~/dev/git/FlameGraph/flamegraph.pl out.folded > gold.svg
The profile looks more balanced. The staircase around sorting looks
peculiar. Most of the time is spent
on section sorting in gold::Output_section::sort_attached_input_sections
.
It’s also a great hint: how many sections should there be so that sorting
alone would take 25%
of the link time?
section count
Where do these numerous sections come from?
nixpkgs
enables -split-sections
by default on linux
which uses ghc
-fsplit-sections
.
Those are very close in spirit to gcc
-ffunction-sections
feature. Both place each function in a separate ELF
section assuming
that the user will use -Wl,--gc-sections
linker option to
garbage-collect unreferenced sections in the final executable to make
binaries smaller.
So how many sections do you normally expect per object file?
In C
(with -ffunction-sections
) I would expect section count to
be close to function count in the source
file. Some optimization passes duplicate (clone) or inline functions,
but the expansion should not be too large (famous last words). Hopefully
not 100x
, but closer to 1.5x
maybe? C++
might be trickier to
reason about.
In haskell
it’s a lot more complicated: lazy evaluation model creates
numerous smaller functions out of one source function, cross-module
aggressive inlining brings in many expressions.
Here is an example of a one liner compilation and it’s section count:
-- # cat Main.hs
= print "hello" main
Quiz question: how many sections do you expect to see for this source
file after ghc -c Main.hs
?
Building and checking unsplit form first:
$ ghc -c Main.hs -fforce-recomp
$ size Main.o
text data bss dec hex filename
378 304 0 682 2aa Main.o
$ readelf -SW Main.o
There are 13 section headers, starting at offset 0xb50:
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 0000000000000000 000040 00013a 00 AX 0 0 8
[ 2] .rela.text RELA 0000000000000000 000708 0001c8 18 I 10 1 8
[ 3] .data PROGBITS 0000000000000000 000180 000130 00 WA 0 0 8
[ 4] .rela.data RELA 0000000000000000 0008d0 000210 18 I 10 3 8
[ 5] .bss NOBITS 0000000000000000 0002b0 000000 00 WA 0 0 1
[ 6] .rodata.str PROGBITS 0000000000000000 0002b0 000010 01 AMS 0 0 1
[ 7] .note.GNU-stack PROGBITS 0000000000000000 0002c0 000000 00 0 0 1
[ 8] .comment PROGBITS 0000000000000000 0002c0 00000c 01 MS 0 0 1
[ 9] .note.gnu.property NOTE 0000000000000000 0002d0 000030 00 A 0 0 8
[10] .symtab SYMTAB 0000000000000000 000300 000228 18 11 5 8
[11] .strtab STRTAB 0000000000000000 000528 0001dd 00 0 0 1
[12] .shstrtab STRTAB 0000000000000000 000ae0 00006e 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
D (mbind), l (large), p (processor specific)
378 bytes of code, 12 sections. 6 of them are relevant: .text
, .rela.text
,
.data
, .rela.data
, .rodata.str
. All very close to a typical C
program.
Now let’s throw in -fsplit-sections
.
Quiz question: guess how many more sections there will be? 0? 1? 10? 100? 1000?
$ ghc -c Main.hs -fforce-recomp -fsplit-sections
$ size Main.o
text data bss dec hex filename
365 304 0 669 29d Main.o
$ readelf -SW Main.o
There are 39 section headers, starting at offset 0xd90:
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 0000000000000000 000040 000000 00 AX 0 0 1
[ 2] .data PROGBITS 0000000000000000 000040 000000 00 WA 0 0 1
[ 3] .bss NOBITS 0000000000000000 000040 000000 00 WA 0 0 1
[ 4] .rodata.str..LrKs_bytes PROGBITS 0000000000000000 000040 000005 01 AMS 0 0 1
[ 5] .rodata.str..LrKq_bytes PROGBITS 0000000000000000 000045 000005 01 AMS 0 0 1
[ 6] .rodata.str.cKA_str PROGBITS 0000000000000000 00004a 000006 01 AMS 0 0 1
[ 7] .data..LsKw_closure PROGBITS 0000000000000000 000050 000028 00 WA 0 0 8
[ 8] .rela.data..LsKw_closure RELA 0000000000000000 0007c8 000030 18 I 36 7 8
[ 9] .data..LuKL_srt PROGBITS 0000000000000000 000078 000020 00 WA 0 0 8
[10] .rela.data..LuKL_srt RELA 0000000000000000 0007f8 000048 18 I 36 9 8
[11] .text..LsKu_info PROGBITS 0000000000000000 000098 000062 00 AX 0 0 8
[12] .rela.text..LsKu_info RELA 0000000000000000 000840 000090 18 I 36 11 8
[13] .data..LsKu_closure PROGBITS 0000000000000000 000100 000020 00 WA 0 0 8
[14] .rela.data..LsKu_closure RELA 0000000000000000 0008d0 000018 18 I 36 13 8
[15] .data..LuL2_srt PROGBITS 0000000000000000 000120 000028 00 WA 0 0 8
[16] .rela.data..LuL2_srt RELA 0000000000000000 0008e8 000060 18 I 36 15 8
[17] .text.Main_main_info PROGBITS 0000000000000000 000148 000069 00 AX 0 0 8
[18] .rela.text.Main_main_info RELA 0000000000000000 000948 0000a8 18 I 36 17 8
[19] .data.Main_main_closure PROGBITS 0000000000000000 0001b8 000020 00 WA 0 0 8
[20] .rela.data.Main_main_closure RELA 0000000000000000 0009f0 000018 18 I 36 19 8
[21] .data..LuLj_srt PROGBITS 0000000000000000 0001d8 000020 00 WA 0 0 8
[22] .rela.data..LuLj_srt RELA 0000000000000000 000a08 000048 18 I 36 21 8
[23] .text.ZCMain_main_info PROGBITS 0000000000000000 0001f8 000062 00 AX 0 0 8
[24] .rela.text.ZCMain_main_info RELA 0000000000000000 000a50 000090 18 I 36 23 8
[25] .data.ZCMain_main_closure PROGBITS 0000000000000000 000260 000020 00 WA 0 0 8
[26] .rela.data.ZCMain_main_closure RELA 0000000000000000 000ae0 000018 18 I 36 25 8
[27] .data..LrKr_closure PROGBITS 0000000000000000 000280 000010 00 WA 0 0 8
[28] .rela.data..LrKr_closure RELA 0000000000000000 000af8 000030 18 I 36 27 8
[29] .data..LrKt_closure PROGBITS 0000000000000000 000290 000010 00 WA 0 0 8
[30] .rela.data..LrKt_closure RELA 0000000000000000 000b28 000030 18 I 36 29 8
[31] .data.Main_zdtrModule_closure PROGBITS 0000000000000000 0002a0 000020 00 WA 0 0 8
[32] .rela.data.Main_zdtrModule_closure RELA 0000000000000000 000b58 000048 18 I 36 31 8
[33] .note.GNU-stack PROGBITS 0000000000000000 0002c0 000000 00 0 0 1
[34] .comment PROGBITS 0000000000000000 0002c0 00000c 01 MS 0 0 1
[35] .note.gnu.property NOTE 0000000000000000 0002d0 000030 00 A 0 0 8
[36] .symtab SYMTAB 0000000000000000 000300 0002e8 18 37 13 8
[37] .strtab STRTAB 0000000000000000 0005e8 0001dd 00 0 0 1
[38] .shstrtab STRTAB 0000000000000000 000ba0 0001ea 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
D (mbind), l (large), p (processor specific)
38 sections! If we ignore 6 irrelevant sections it’s 32 relevant sections
compared to 6 relevant sections before.
Our actual main
top-level function itself is hiding in .*Main_main.*
sections. You will notices a lot of them. And on top of that whatever ghc
managed to “float-out” outside the function.
This is unoptimized code. If we throw in -O2
we will get this:
$ ghc -c Main.hs -fforce-recomp -fsplit-sections -O2
$ size Main.o
text data bss dec hex filename
306 304 0 610 262 Main.o
$ readelf -SW Main.o
There are 45 section headers, starting at offset 0x10d0:
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 0000000000000000 000040 000000 00 AX 0 0 1
[ 2] .data PROGBITS 0000000000000000 000040 000000 00 WA 0 0 1
[ 3] .bss NOBITS 0000000000000000 000040 000000 00 WA 0 0 1
[ 4] .rodata.str.Main_zdtrModule2_bytes PROGBITS 0000000000000000 000040 000005 01 AMS 0 0 1
[ 5] .rodata.str.Main_zdtrModule4_bytes PROGBITS 0000000000000000 000045 000005 01 AMS 0 0 1
[ 6] .rodata.str.Main_main5_bytes PROGBITS 0000000000000000 00004a 000006 01 AMS 0 0 1
[ 7] .data.Main_main4_closure PROGBITS 0000000000000000 000050 000028 00 WA 0 0 8
[ 8] .rela.data.Main_main4_closure RELA 0000000000000000 000a60 000030 18 I 42 7 8
[ 9] .data..Lu1AB_srt PROGBITS 0000000000000000 000078 000020 00 WA 0 0 8
[10] .rela.data..Lu1AB_srt RELA 0000000000000000 000a90 000048 18 I 42 9 8
[11] .text.Main_main3_info PROGBITS 0000000000000000 000098 000062 00 AX 0 0 8
[12] .rela.text.Main_main3_info RELA 0000000000000000 000ad8 000090 18 I 42 11 8
[13] .data.Main_main3_closure PROGBITS 0000000000000000 000100 000020 00 WA 0 0 8
[14] .rela.data.Main_main3_closure RELA 0000000000000000 000b68 000018 18 I 42 13 8
[15] .data.Main_main2_closure PROGBITS 0000000000000000 000120 000020 00 WA 0 0 8
[16] .rela.data.Main_main2_closure RELA 0000000000000000 000b80 000048 18 I 42 15 8
[17] .text.Main_main1_info PROGBITS 0000000000000000 000140 000032 00 AX 0 0 8
[18] .rela.text.Main_main1_info RELA 0000000000000000 000bc8 000060 18 I 42 17 8
[19] .data.Main_main1_closure PROGBITS 0000000000000000 000178 000028 00 WA 0 0 8
[20] .rela.data.Main_main1_closure RELA 0000000000000000 000c28 000060 18 I 42 19 8
[21] .text.Main_main_info PROGBITS 0000000000000000 0001a0 00001d 00 AX 0 0 8
[22] .rela.text.Main_main_info RELA 0000000000000000 000c88 000030 18 I 42 21 8
[23] .data.Main_main_closure PROGBITS 0000000000000000 0001c0 000010 00 WA 0 0 8
[24] .rela.data.Main_main_closure RELA 0000000000000000 000cb8 000018 18 I 42 23 8
[25] .text.Main_main6_info PROGBITS 0000000000000000 0001d0 000024 00 AX 0 0 8
[26] .rela.text.Main_main6_info RELA 0000000000000000 000cd0 000030 18 I 42 25 8
[27] .data.Main_main6_closure PROGBITS 0000000000000000 0001f8 000020 00 WA 0 0 8
[28] .rela.data.Main_main6_closure RELA 0000000000000000 000d00 000048 18 I 42 27 8
[29] .text.ZCMain_main_info PROGBITS 0000000000000000 000218 00001d 00 AX 0 0 8
[30] .rela.text.ZCMain_main_info RELA 0000000000000000 000d48 000030 18 I 42 29 8
[31] .data.ZCMain_main_closure PROGBITS 0000000000000000 000238 000010 00 WA 0 0 8
[32] .rela.data.ZCMain_main_closure RELA 0000000000000000 000d78 000018 18 I 42 31 8
[33] .data.Main_zdtrModule3_closure PROGBITS 0000000000000000 000248 000010 00 WA 0 0 8
[34] .rela.data.Main_zdtrModule3_closure RELA 0000000000000000 000d90 000030 18 I 42 33 8
[35] .data.Main_zdtrModule1_closure PROGBITS 0000000000000000 000258 000010 00 WA 0 0 8
[36] .rela.data.Main_zdtrModule1_closure RELA 0000000000000000 000dc0 000030 18 I 42 35 8
[37] .data.Main_zdtrModule_closure PROGBITS 0000000000000000 000268 000020 00 WA 0 0 8
[38] .rela.data.Main_zdtrModule_closure RELA 0000000000000000 000df0 000048 18 I 42 37 8
[39] .note.GNU-stack PROGBITS 0000000000000000 000288 000000 00 0 0 1
[40] .comment PROGBITS 0000000000000000 000288 00000c 01 MS 0 0 1
[41] .note.gnu.property NOTE 0000000000000000 000298 000030 00 A 0 0 8
[42] .symtab SYMTAB 0000000000000000 0002c8 000378 18 43 2 8
[43] .strtab STRTAB 0000000000000000 000640 00041a 00 0 0 1
[44] .shstrtab STRTAB 0000000000000000 000e38 000295 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
D (mbind), l (large), p (processor specific)
44 sections! A Lot.
Back to our pandoc
binary. It just calls into libHSpandoc.a
library
(nixpkgs
uses static linking for haskell
bits today). Linker would
have to wade through all it’s sections to pick only used things.
Quiz question: guess how many sections does pandoc
library have?
100? 1000? 10000? A million? What is your guess?
Let’s count! On my system pandoc
library happens to hide at
<<NIX>>-pandoc-3.7.0.2/lib/ghc-9.10.3/lib/x86_64-linux-ghc-9.10.3-2870/pandoc-3.7.0.2-Af80LA3Iq30D5LRTMZUszs/libHSpandoc-3.7.0.2-Af80LA3Iq30D5LRTMZUszs.a
.
I’ll just use that ugly path.
# dumping section count per individual object file in the archive:
$ readelf -h <<NIX>>-pandoc-3.7.0.2/lib/ghc-9.10.3/lib/x86_64-linux-ghc-9.10.3-2870/pandoc-3.7.0.2-Af80LA3Iq30D5LRTMZUszs/libHSpandoc-3.7.0.2-Af80LA3Iq30D5LRTMZUszs.a |
grep 'Number of section'
Number of section headers: 18
...
Number of section headers: 26207
Number of section headers: 16755
...
Number of section headers: 8898
..
Number of section headers: 1047
Number of section headers: 1324
...
Number of section headers: 687
Number of section headers: 228
# summing up all section counts:
$ readelf -h <<NIX>>-pandoc-3.7.0.2/lib/ghc-9.10.3/lib/x86_64-linux-ghc-9.10.3-2870/pandoc-3.7.0.2-Af80LA3Iq30D5LRTMZUszs/libHSpandoc-3.7.0.2-Af80LA3Iq30D5LRTMZUszs.a |
grep 'Number of section' | awk '{ size += $5 } END { print size }'
494450
494
thousands
sections! Almost half a million sections. And it’s just
one (largest) of many pandoc
dependencies. That’s why ld.gold
takes
a considerable amount of time to just sort through all these sections.
performance hog clues
ld.bfd
has even harder time getting through such a big list of sections.
But why exactly? The names
of _bfd_elf_gc_mark / _bfd_elf_gc_mark_reloc
functions in the profiles
hint that they track unreferenced sections.
To trigger section garbage collection
ghc
uses -Wl,--gc-sections
linker option.
Note that ghc
only enables garbage collection for GNU ld
(I think
both ld.gold
and ld.bfd
count as GNU). lld
and mold
both support -Wl,--gc-sections
option as well.
If we look at the implementation
of _bfd_elf_gc_mark
we see a suspicious
list traversal
and some recursive descend:
bool
(struct bfd_link_info *info,
_bfd_elf_gc_mark_reloc *sec,
asection ,
elf_gc_mark_hook_fn gc_mark_hookstruct elf_reloc_cookie *cookie)
{
*rsec;
asection bool start_stop = false;
= _bfd_elf_gc_mark_rsec (info, sec, gc_mark_hook, cookie, &start_stop);
rsec while (rsec != NULL)
{
if (!rsec->gc_mark)
{
if (bfd_get_flavour (rsec->owner) != bfd_target_elf_flavour
|| (rsec->owner->flags & DYNAMIC) != 0)
->gc_mark = 1;
rsecelse if (!_bfd_elf_gc_mark (info, rsec, gc_mark_hook))
return false;
}
if (!start_stop)
break;
= bfd_get_next_section_by_name (rsec->owner, rsec);
rsec }
return true;
}
If we do such marking for each section it probably has quadratic complexity.
But maybe not. To get something simpler to explore I tried to craft a
trivial example of 1 million
sections:
$ for (( i=0; i<1000000; i++ )); do printf "int var_$i __attribute__ ((section (\".data.$i\"))) = { $i };\n"; done > main.c; printf "int main() {}" >> main.c; gcc -c main.c -o main.o; echo "bfd:"; time gcc main.o -o main -fuse-ld=bfd; echo "gold:"; time gcc main.o -o main -fuse-ld=gold
bfd:
real 0m6,123s
user 0m5,384s
sys 0m0,701s
gold:
real 0m1,107s
user 0m0,844s
sys 0m0,242s
This test generates the following boilerplate code:
// $ head -n 5 main.c
int var_0 __attribute__ ((section (".data.0"))) = { 0 };
int var_1 __attribute__ ((section (".data.1"))) = { 1 };
int var_2 __attribute__ ((section (".data.2"))) = { 2 };
int var_3 __attribute__ ((section (".data.3"))) = { 3 };
int var_4 __attribute__ ((section (".data.4"))) = { 4 };
// ...
// $ tail -n 5 main.c
int var_999996 __attribute__ ((section (".data.999996"))) = { 999996 };
int var_999997 __attribute__ ((section (".data.999997"))) = { 999997 };
int var_999998 __attribute__ ((section (".data.999998"))) = { 999998 };
int var_999999 __attribute__ ((section (".data.999999"))) = { 999999 };
int main() {}
We are seeing 6x
time difference between linkers. Could it be our case?
Let’s check with perf
if we hit the same hot paths as in pandoc
case:
$ perf record -g gcc main.o -o main -fuse-ld=bfd
$ perf script > out.perf && perl ~/dev/git/FlameGraph/stackcollapse-perf.pl out.perf > out.folded && perl ~/dev/git/FlameGraph/flamegraph.pl out.folded > try1.svg
The profile looks not too bad: no large .*_gc_.*
bits seen anywhere.
Not exactly our problem then. Let’s try to add more references across
sections to see if we start seeing the symbol traversals:
$ printf "int var_0 __attribute__ ((section (\".data.0\"))) = { $i };\n" > main.c; for (( i=1; i<1000000; i++ )); do printf "void * var_$i __attribute__ ((section (\".data.$i\"))) = { &var_$((i-1)) };\n"; done >> main.c; printf "int main() { return (long)var_99999; }" >> main.c; gcc -c main.c -o main.o; echo "bfd:"; time gcc main.o -o main -fuse-ld=bfd -Wl,--gc-sections -Wl,--no-as-needed; echo "gold:"; time gcc main.o -o main -fuse-ld=gold -Wl,--gc-sections -Wl,--no-as-needed
gcc: internal compiler error: Segmentation fault signal terminated program cc1
Please submit a full bug report, with preprocessed source (by using -freport-bug).
See <https://gcc.gnu.org/bugs/> for instructions.
Whoops, crashed gcc
. Filed PR122198
.
It’s a stack overflow. Throwing more stack at the problem with ulimit -s unlimited
:
$ printf "int var_0 __attribute__ ((section (\".data.0\"))) = { $i };\n" > main.c; for (( i=1; i<1000000; i++ )); do printf "void * var_$i __attribute__ ((section (\".data.$i\"))) = { &var_$((i-1)) };\n"; done >> main.c; printf "int main() { return (long)var_99999; }" >> main.c; gcc -c main.c -o main.o; echo "bfd:"; time gcc main.o -o main -fuse-ld=bfd -Wl,--gc-sections -Wl,--no-as-needed; echo "gold:"; time gcc main.o -o main -fuse-ld=gold -Wl,--gc-sections -Wl,--no-as-needed
bfd:
real 0m5.296s
user 0m4.572s
sys 0m0.714s
gold:
real 0m1.172s
user 0m0.874s
sys 0m0.295s
This test generates slightly different form:
// $ head -n 5 main.c
int var_0 __attribute__ ((section (".data.0"))) = { 1000000 };
void * var_1 __attribute__ ((section (".data.1"))) = { &var_0 };
void * var_2 __attribute__ ((section (".data.2"))) = { &var_1 };
void * var_3 __attribute__ ((section (".data.3"))) = { &var_2 };
void * var_4 __attribute__ ((section (".data.4"))) = { &var_3 };
// ...
// $ tail -n 5 main.c
void * var_999996 __attribute__ ((section (".data.999996"))) = { &var_999995 };
void * var_999997 __attribute__ ((section (".data.999997"))) = { &var_999996 };
void * var_999998 __attribute__ ((section (".data.999998"))) = { &var_999997 };
void * var_999999 __attribute__ ((section (".data.999999"))) = { &var_999998 };
int main() { return (long)var_99999; }
The main difference compared to the previous attempt is that sections are used
and can’t be removed by section garbage collector. Is this profile closer to
pandoc
case? Looking at the trace:
$ perf record -g gcc main.o -o main -fuse-ld=bfd -Wl,--gc-sections -Wl,--no-as-needed
$ perf script > out.perf && perl ~/dev/git/FlameGraph/stackcollapse-perf.pl out.perf > out.folded && perl ~/dev/git/FlameGraph/flamegraph.pl out.folded > try2.svg
It’s not too close to pandoc
case, but the huge vertical thing on the
left is a good sign that we at least start hitting the gc
traversal
code. Now we need to increase its presence somehow.
Perhaps it’s more symbols per section? I tried to simulate code
references instead of data references and see what happens:
$ printf "int f_0() __attribute__ ((section (\".text.0\"))); int f_0() { return 0; };\n" > main.c; for (( i=1; i<20000; i++ )); do printf "int f_$i() __attribute__ ((section (\".text.$i\"))); int f_$i() { return f_$((i-1))(); };\n"; done >> main.c; printf "int main() { return f_19999(); }" >> main.c; gcc -O0 -c main.c -o main.o; echo "bfd:"; time gcc main.o -o main -fuse-ld=bfd -Wl,--gc-sections -Wl,--no-as-needed; echo "gold:"; time gcc main.o -o main -fuse-ld=gold -Wl,--gc-sections -Wl,--no-as-needed
bfd:
real 0m5,627s
user 0m2,567s
sys 0m3,047s
gold:
real 0m0,053s
user 0m0,041s
sys 0m0,012s
How the test generates this file:
// $ head -n 5 main.c
int f_0() __attribute__ ((section (".text.0"))); int f_0() { return 0; };
int f_1() __attribute__ ((section (".text.1"))); int f_1() { return f_0(); };
int f_2() __attribute__ ((section (".text.2"))); int f_2() { return f_1(); };
int f_3() __attribute__ ((section (".text.3"))); int f_3() { return f_2(); };
int f_4() __attribute__ ((section (".text.4"))); int f_4() { return f_3(); };
// ..
// $ tail -n 5 main.c
int f_19996() __attribute__ ((section (".text.19996"))); int f_19996() { return f_19995(); };
int f_19997() __attribute__ ((section (".text.19997"))); int f_19997() { return f_19996(); };
int f_19998() __attribute__ ((section (".text.19998"))); int f_19998() { return f_19997(); };
int f_19999() __attribute__ ((section (".text.19999"))); int f_19999() { return f_19998(); };
int main() { return f_19999(); }
That time difference is more interesting! Note that ld.gold
spends just
50ms
on the input while ld.bfd
does something for 5 seconds
!
Getting the profile:
$ perf record -g gcc main.o -o main -fuse-ld=bfd -Wl,--gc-sections -Wl,--no-as-needed
$ perf script > out.perf && perl ~/dev/git/FlameGraph/stackcollapse-perf.pl out.perf > out.folded && perl ~/dev/git/FlameGraph/flamegraph.pl out.folded > try3.svg
At last! We managed to hit exactly the _bfd_elf_gc_mark ()
.
The final input file object file is not too large:
$ ls -lh main.o
-rw-r--r-- 1 slyfox users 5.7M Oct 8 20:58 main.o
I filed PR33530
upstream bug report
hoping that fix will not be too complicated and started writing this
blog post.
testing the patch
My plan was to figure out more details about binutils
gc
in this post
and try to fix it. Alas even before I got to it H.J. already prepared
the fix!
Synthetic test shown great results:
$ printf "int f_0() __attribute__ ((section (\".text.0\"))); int f_0() { return 0; };\n" > main.c; for (( i=1; i<20000; i++ )); do printf "int f_$i() __attribute__ ((section (\".text.$i\"))); int f_$i() { return f_$((i-1))(); };\n"; done >> main.c; printf "int main() { return f_19999(); }" >> main.c; gcc -O0 -c main.c -o main.o; echo "bfd:"; time gcc main.o -o main -fuse-ld=bfd -Wl,--gc-sections
bfd:
real 0m0,119s
user 0m0,080s
sys 0m0,038s
120ms
compared to the previous 5s
is a 40x
speedup. It’s still
twice as slow as 50ms
for ld.gold
, but the absolute time is way
harder to notice. I tried to find a new degradation point by adding more
functions:
# 100K sections:
$ printf "int f_0() __attribute__ ((section (\".text.0\"))); int f_0() { return 0; };\n" > main.c; for (( i=1; i<100000; i++ )); do printf "int f_$i() __attribute__ ((section (\".text.$i\"))); int f_$i() { return f_$((i-1))(); };\n"; done >> main.c; printf "int main() { return f_99999(); }" >> main.c; gcc -O0 -c main.c -o main.o; echo "bfd:"; time gcc main.o -o main -fuse-ld=bfd -Wl,--gc-sections
bfd:
real 0m0.628s
user 0m0.472s
sys 0m0.154s
# 1M sections:
$ printf "int f_0() __attribute__ ((section (\".text.0\"))); int f_0() { return 0; };\n" > main.c; for (( i=1; i<1000000; i++ )); do printf "int f_$i() __attribute__ ((section (\".text.$i\"))); int f_$i() { return f_$((i-1))(); };\n"; done >> main.c; printf "int main() { return f_999999(); }" >> main.c; gcc -O0 -c main.c -o main.o; echo "bfd:"; time gcc main.o -o main -fuse-ld=bfd -Wl,--gc-sections
bfd:
real 0m8.697s
user 0m6.956s
sys 0m1.726s
$ size main.o
text data bss dec hex filename
43000115 0 0 43000115 2902133 main.o
8s
on a file with 1M
sections sounds quite fast! Let’s see where
ld.bfd
spends its time now:
$ perf record -g gcc main.o -o main -fuse-ld=bfd -Wl,--gc-sections
$ perf script > out.perf && perl ~/dev/git/FlameGraph/stackcollapse-perf.pl out.perf > out.folded && perl ~/dev/git/FlameGraph/flamegraph.pl out.folded > fixed.svg
Once again the profile looks more balanced now. Yay! How about real
pandoc
?
$ time nix build --no-link -f. pandoc --rebuild
real 0m17,013s
user 0m0,672s
sys 0m0,123s
17s
is a bit slower than 13s
of ld.gold
, but not as bad as it used
to be. And it’s profile:
$ perf record -g <<NIX>>-ghc-9.10.3/bin/ghc --make ... '-optl-fuse-ld=bfd' -fforce-recomp
$ perf script > out.perf && perl ~/dev/git/FlameGraph/stackcollapse-perf.pl out.perf > out.folded && perl ~/dev/git/FlameGraph/flamegraph.pl out.folded > fixed-pandoc.svg
bonus: other linkers
Both ld.gold
and ld.lld
are quite fast at handling synthetic tests now,
But both still spend a few seconds on pandoc
. How about other linkers?
Let’s add lld
and mold
to the mix. I’ll measure
ghc --make '-optl-fuse-ld=$linker' -fforce-recomp
execution time:
# without the fix
$ time <<NIX>>-ghc-9.10.3/bin/ghc --make ... pandoc ... -optl-fuse-ld=bfd -fforce-recomp
real 0m30.589s user 0m24.576s sys 0m6.447s
# with the fix
$ time <<NIX>>-ghc-9.10.3/bin/ghc --make ... pandoc ... -optl-fuse-ld=bfd -fforce-recomp
real 0m8.676s user 0m6.484s sys 0m2.584s
$ time <<NIX>>-ghc-9.10.3/bin/ghc --make ... pandoc ... -optl-fuse-ld=gold -fforce-recomp
real 0m5.929s user 0m5.543s sys 0m0.829s
$ time <<NIX>>-ghc-9.10.3/bin/ghc --make ... pandoc ... -optl-fuse-ld=lld -fforce-recomp
real 0m1.413s user 0m1.754s sys 0m1.509s
$ time <<NIX>>-ghc-9.10.3/bin/ghc --make ... pandoc ... -optl-fuse-ld=mold -fforce-recomp
real 0m1.209s user 0m0.424s sys 0m0.215s
Note: we do measure not just liker time, but also ghc
code generation
time. Actual link time does not take as much. Same values in tables:
linker | real (sec) |
user (sec) |
sys (sec) |
---|---|---|---|
ld.bfd (orig) |
30.6 | 24.6 | 6.4 |
ld.bdf (fixed) |
8.7 | 6.5 | 2.6 |
ld.gold |
5.9 | 5.5 | 0.8 |
lld |
1.4 | 1.8 | 1.5 |
mold |
1.2 | 0.4 | 0.2 |
lld
and mold
link times are impressive! They are 6x-7x
times faster
than fixed version of ld.bfd
. Let’s look at their profiles to see what
they spend their time on. lld
goes first:
$ perf record -g <<NIX>>-ghc-9.10.3/bin/ghc --make ... pandoc ... -optl-fuse-ld=lld -fforce-recomp
$ perf script > out.perf && perl ~/dev/git/FlameGraph/stackcollapse-perf.pl out.perf > out.folded && perl ~/dev/git/FlameGraph/flamegraph.pl out.folded > lld-pandoc.svg
The only things that stick out in this profile are malloc()
(about 20%
of the time?) and memmove()
calls (about 6%
of the time). Otherwise,
we see about equal time spent on reading (linkerDriver
part on the right),
relocation processing (RelocationScanner
in the middle) and writing
(HashTableSection::writeTo
slightly to the left).
I wonder if memory management were to be optimized for lld
, would it be
as fast as mold
on this input? And now mold
profile:
$ perf record -g <<NIX>>-ghc-9.10.3/bin/ghc --make ... pandoc ... -optl-fuse-ld=mold -fforce-recomp
$ perf script > out.perf && perl ~/dev/git/FlameGraph/stackcollapse-perf.pl out.perf > out.folded && perl ~/dev/git/FlameGraph/flamegraph.pl out.folded > mold-pandoc.svg
The profile is not very readable as we mainly see the tbb
threading
dispatch of chunks of work to do. If I scroll around I see
things like scan_abs_relocations
, apply_reloc_alloc
, resolve_symbols
,
split_contents
, scan_relocations
. It looks like half the time
is spent on the management of parallelism. Playing a bit with the
mold
parameters I noticed that -Wl,--threads=4
is the maximum thread
count where I get any speed improvements for parallelism. Anything above
will clutter CPU usage profile with sched_yield
“busy” wait threads.
To get the idea where actual CPU is spent it might be more interesting to
look at single-threaded profile using -optl-Wl,--no-threads
:
$ perf record -g <<NIX>>-ghc-9.10.3/bin/ghc --make ... pandoc ... -optl-fuse-ld=mold -optl-Wl,--no-threads -fforce-recomp
$ perf script > out.perf && perl ~/dev/git/FlameGraph/stackcollapse-perf.pl out.perf > out.folded && perl ~/dev/git/FlameGraph/flamegraph.pl out.folded > mold-no-threads.svg
I’ll leave the interpretation of the picture to the reader.
parting words
ld.gold
is about to be removed from binutils
.
In -fsplit-sections
mode ghc
code generator produces huge amount of
sections per ELF
file which manages to strain both ld.bfd
and
ld.gold
linkers. ghc
should probably be fixed to produce one section
per strongly connected component of functions that refer one another.
libHSpandoc
consists of almost half a million ELF
sections. Average
section size for which is about 250 bytes.
ld.bfd
while being slower than ld.gold
still has simple performance
bugs in obscure scenarios. Just like today’s
PR33530
example.
binutils
upstream was very fast to come up with a possible fix to test.
Even fixed ld.bfd
is still quite a bit slower on synthetic test with
huge sections compared to ld.gold
. But at least it’s not exponentially
worse.
lld
and mold
are still way faster than either ld.bfd
or ld.gold
.
About 6-7 times on pandoc
test. ld.bfd
still has a lot of room for
improvement :)
Have fun!