Mysterious patchelf linkage bug
Today’s mystery started from this build failure
noticed by Vladimir on staging-next
branch of nixpkgs
.
There aarch64-linux
build
failed
to compile patchelf
as part of stdenv
with this mysterious error:
Making all in src
make[1]: Entering directory '/build/patchelf-0.15.0/src'
g++ -DPACKAGE_NAME=\"patchelf\" -DPACKAGE_TARNAME=\"patchelf\" -DPACKAGE_VERSION=\"0.15.0\" -DPACKAGE_STRING=\"patchelf\ 0.15.0\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"patchelf\" -DVERSION=\"0.15.0\" -I. -Wall -std=c++17 -D_FILE_OFFSET_BITS=64 -g -O2 -c -o patchelf.o patchelf.cc
g++ -Wall -std=c++17 -D_FILE_OFFSET_BITS=64 -g -O2 -o patchelf patchelf.o
ld: patchelf.o: in function `__gnu_cxx::__exchange_and_add(int volatile*, int)':
...-xgcc-12.3.0/include/c++/12.3.0/ext/atomicity.h:66:
undefined reference to `__aarch64_ldadd4_acq_rel'
ld: ...-xgcc-12.3.0/include/c++/12.3.0/ext/atomicity.h:66:
undefined reference to `__aarch64_ldadd4_acq_rel'
ld: ...-xgcc-12.3.0/include/c++/12.3.0/ext/atomicity.h:66:
undefined reference to `__aarch64_ldadd4_acq_rel'
ld: ...-xgcc-12.3.0/include/c++/12.3.0/ext/atomicity.h:66:
undefined reference to `__aarch64_ldadd4_acq_rel'
ld: patchelf.o: in function `__gnu_cxx::__atomic_add(int volatile*, int)':
...-xgcc-12.3.0/include/c++/12.3.0/ext/atomicity.h:71:
undefined reference to `__aarch64_ldadd4_acq_rel'
ld: patchelf.o:...-xgcc-12.3.0/include/c++/12.3.0/ext/atomicity.h:71:
more undefined references to `__aarch64_ldadd4_acq_rel' follow
collect2: error: ld returned 1 exit status
It is the build.log
in almost all of it’s entirety. I had a small
change in staging-next
branch which should absolutely not cause that
failure. But I was not sure (one can never be sure when it comes to the
toolchain bugs).
Quick quiz: why does it happen? A gcc
bug? A binutils
bug? Wrong
library lookup paths in cc-wrapper
or ld-wrapper
? Or a heisenbug?
First hypothesis
I had some past experience with errors like that in recent the past here and here.
In both cases it was a nixpkgs
-specific problem of mixing multiple
toolchain versions in a single stdenv
. I expect this kind of problem
to come back from time to time until bootstrap process is changed to
avoid any mixing of toolchain versions. Thus I was looking forward to
debug yet another one of those.
I ran the bisect and got the Merge pull request #245550 from trofi/gcc-restore-sys-include. as the culprit.
My own commit! Uh-on. One of the problems is that it’s a merge commit of
the change, not the change itself. Why did bisect skip the change
itself? Why did the PR cause this failure at all? It did not make sense.
Reverting the commit on top of staging-next
did fix the patchelf
linkage. Should it be the culprit then? I was about to submit the revert
and move on to less cryptic things.
But for some reason just before giving up I tried to run --rebuild
on
a random “good” commit and got the patchelf
linkage failure! That
meant the error was non-deterministic. It did not make sense.
On one hand sys-include
commit above is quite relevant to the way
gcc
bootstraps. On the other hand it is not supposed to lead to
non-deterministic failures.
I had to start from the first principles to see where and how linkage process breaks.
What is this error about?
Let’s look at the specifics of code using atomics in gcc
.
libstdc++
(gcc
s c++
template library) uses atomic operations in
various containers. For example <string>
uses atomics to implement
copy-on-write semantics. Naturally patchelf
uses a bit of
std::string
as well. Thus it’s expected to use a bit of atomics like
__atomic_fetch_add()
builtin.
Each architecture implements atomics in a slightly different way: some
get away with a single instruction, some require quite a bit of code.
Let’s have a look at x86_64
and aarch64
to see how close they are.
I’ll use the very /nix/store/c7qmp1dgqf3hh4fjw74y2k662nmaslcy-xgcc-12.3.0/include/c++/12.3.0/ext/atomicity.h:66
source code bit from the error we saw above.
// atomic.c
int f(int * p, int v) {
return __atomic_fetch_add(p, v, __ATOMIC_ACQ_REL);
}
The function f()
here atomically adds a v
value stored at p
and
returns some result.
x86_64
generates the following code:
$ gcc -O2 -S a.c -o -
f:
movl %esi, %eax
lock xaddl %eax, (%rdi)
ret
It’s literally one lock xaddl
instruction. How about aarch64
? Is it
as simple?
$ aarch64-unknown-linux-gnu-gcc -O2 -S a.c -o -
.arch armv8-a
f:
mov x2, x0
stp x29, x30, [sp, -16]!
mov w0, w1
mov x29, sp
mov x1, x2
bl __aarch64_ldadd4_acq_rel
ldp x29, x30, [sp], 16
ret
We see a big difference here: gcc
emits call into outside
__aarch64_ldadd4_acq_rel
function. It’s code hides in libgcc.a
:
__aarch64_ldadd4_acq_rel:
0: bti c
4: adrp x16, 0 <__aarch64_have_lse_atomics>
4: R_AARCH64_ADR_PREL_PG_HI21 __aarch64_have_lse_atomics
8: ldrb w16, [x16]
8: R_AARCH64_LDST8_ABS_LO12_NC __aarch64_have_lse_atomics
c: cbz w16, 18 <__aarch64_ldadd4_acq_rel+0x18>
10: ldaddal w0, w0, [x1]
14: ret
18: mov w16, w0
1c: ldaxr w0, [x1]
20: add w17, w0, w16
24: stlxr w15, w17, [x1]
28: cbnz w15, 1c <__aarch64_ldadd4_acq_rel+0x1c>
2c: ret
Quite a bit of code here as well: if CPU supports ldaddal
then
libgcc.a
uses that. Otherwise it falls back to
ldaxr; add; stlxr; cbnz
.
This amount of code is probably the reason why the code is not inlined
by gcc
. For comparison clang
does something a bit different:
$ clang -O2 -S a.c -o - -target aarch64-unknown-linux
f:
ldaxr w8, [x0]
add w9, w8, w1
stlxr w10, w9, [x0]
cbnz w10, f
mov w0, w8
ret
Here clang
inlined ldaxr; add; stlxr; cbnz
and did not consider
ldaddal
at all. If we nudge gcc
a bit we can force it to inline
ldaddal
with -march=
flag:
$ aarch64-unknown-linux-gnu-gcc -O2 -S a.c -o - -Wall -march=armv8.1-a
f:
ldaddal w1, w0, [x0]
ret
And the same for clang
:
$ clang -O2 -S a.c -o - -target aarch64-unknown-linux -march=armv8.1-a
f:
ldaddal w1, w0, [x0]
ret
All of the above tells us that sometimes gcc
generates calls into
“external” libraries like libgcc.a
(or libgcc_s.so
) to implement
larger primitives. And this decision might depend on the backend
architecture and compiler flags.
Looking at the linkage command
I filed the bug to start capturing important details of the bug to see how far into it I can get.
First I needed to extract exact attribute failing to build on aarch64
.
I added boot.binfmt.emulatedSystems = [ "aarch64-linux" ];
to
/etc/nixos/configuration.nix
to get binfmt-misc
wrapper and ran a
few build commands like:
$ nix build --no-link -f. --argstr system aarch64-linux patchelf -L
to point at the exact failing attribute. It ended up being
stdenv.__bootPackages.stdenv.__bootPackages.stdenv.__bootPackages.patchelf
.
The attribute points right in the middle of stdenv
bootstrap.
I dropped into interactive shell to poke at the exact details of build
failure against good
and bad
states with nix develop
:
$ nix develop -f. --argstr system aarch64-linux patchelf
error (ignored): error: '--arg' and '--argstr' are incompatible with flakes
$$ genericBuild
...
$$ cd src
$$ g++ -Wall -std=c++17 -D_FILE_OFFSET_BITS=64 -g -O2 -o patchelf patchelf.o
ld: patchelf.o: in function `__gnu_cxx::__exchange_and_add(int volatile*, int)':
...-xgcc-12.3.0/include/c++/12.3.0/ext/atomicity.h:66:
undefined reference to `__aarch64_ldadd4_acq_rel'
...
Now we can check where exactly linker looks the libraries up by adding
-Wl,--verbose
:
$$ g++ -Wall -std=c++17 -D_FILE_OFFSET_BITS=64 -g -O2 -o patchelf patchelf.o -Wl,--verbose
...
==================================================
ld: mode aarch64linux
attempt to open ...-glibc-2.37-8/lib/crt1.o succeeded
...-glibc-2.37-8/lib/crt1.o
attempt to open ...-glibc-2.37-8/lib/crti.o succeeded
...
Here is the diff against bad
and good
environments:
$ diff -u <(cat /tmp/bad | unnix) <(cat /tmp/good | unnix)--- /dev/fd/63 2023-07-30 08:26:51.561118824 +0100
+++ /dev/fd/62 2023-07-30 08:26:51.561118824 +0100
@@ -273,6 +273,8 @@
attempt to open /<<NIX>>/xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcc.so failed
attempt to open /<<NIX>>/xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcc.a succeeded
/<<NIX>>/xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcc.a+(/<<NIX>>/xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcc.a)ldadd_4_4.o
+(/<<NIX>>/xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcc.a)lse-init.o
/<<NIX>>/xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcc.a
attempt to open /<<NIX>>/bootstrap-stage0-glibc-bootstrapFiles/lib/libgcc.so failed attempt to open /<<NIX>>/bootstrap-stage0-glibc-bootstrapFiles/lib/libgcc.a failed
I expected some difference in library version outputs, some order
difference in path lookups. But we see nothing like that here. The only
change is the difference in access to individual object files in
libgcc.a
: ldadd_4_4.o
and lse-init.o
which we expect in both cases.
And on inspection of libgcc.a
I noticed that in bad
case it was
corrupted:
$ nm ...-xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcc.a
nm: ...-xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcc.a:
malformed archive
How did we manage to corrupt libgcc.a
? A bit of strip log spills the
clue:
stripping (with command strip and flags -S -p) in
...-xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcc.a
...-xgcc-12.3.0/
lib/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcc_eh.a
...-xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcov.a
...-xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/crtbegin.o
...-xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/crtbeginS.o
...-xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/crtbeginT.o
...-xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/crtend.o
...-xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/crtendS.o
...-xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/crtfastmath.o
...-xgcc-12.3.0/lib64/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcc.a
...-xgcc-12.3.0/lib64/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcc_eh.a
...-xgcc-12.3.0/lib64/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcov.a
...-xgcc-12.3.0/lib64/gcc/aarch64-unknown-linux-gnu/12.3.0/crtbegin.o
...-xgcc-12.3.0/lib64/gcc/aarch64-unknown-linux-gnu/12.3.0/crtbeginS.o
...-xgcc-12.3.0/lib64/gcc/aarch64-unknown-linux-gnu/12.3.0/crtbeginT.o
...-xgcc-12.3.0/lib64/gcc/aarch64-unknown-linux-gnu/12.3.0/crtend.o
...-xgcc-12.3.0/lib64/gcc/aarch64-unknown-linux-gnu/12.3.0/crtendS.o
...-xgcc-12.3.0/lib64/gcc/aarch64-unknown-linux-gnu/12.3.0/crtfastmath.o
It’s somewhat hard to read but we have libgcc.a
twice in the list:
...-xgcc-12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcc.a
...-xgcc-12.3.0/lib64/gcc/aarch64-unknown-linux-gnu/12.3.0/libgcc.a
Both are the same file as lib64
is a symlink to lib
:
$ ls -ld ...-xgcc-12.3.0/lib64
lrwxrwxrwx 1419 root root 3 Jan 1 1970 ...-xgcc-12.3.0/lib64 -> lib
Race condition
So even if the file was stripped twice, why would it be a problem? Isn’t
strip
idempotent in this regard?
It would as long as strip
is ran sequentially. And it used to be the
case for a while. But a few weeks ago the strip
hook was made parallel
in PR #207101!
As a result two strip
commands had a chance to open libgcc.a
, strip
it and write the result back. Sometimes you are lucky and you get
something that works. But sometimes you are not so lucky and one of the
stip
commands reads incompletely written libgcc.a
by the previous
strip
.
The fix
The fix (or the workaround) is not to attemt to process the same file
twice. PR #246164
implements naive form of the symlink resolution via realpath
and
double-strip avoidance via sort -u
:
--- a/pkgs/build-support/setup-hooks/strip.sh
+++ b/pkgs/build-support/setup-hooks/strip.sh
@@ -68,6 +68,11 @@ stripDirs() {
striperr="$(mktemp 'striperr.XXXXXX')"
# Do not strip lib/debug. This is a directory used by setup-hooks/separate-debug-info.sh.
find $paths -type f -a '!' -path "$prefix/lib/debug/*" -print0 |+ # Make sure we process files under symlinks only once. Otherwise
+ # 'strip` can corrupt files when writes to them in parallel:
+ # https://github.com/NixOS/nixpkgs/issues/246147#issuecomment-1657072039
+ xargs -r -0 -n1 -- realpath -z | sort -u -z |
+
xargs -r -0 -n1 -P "$NIX_BUILD_CORES" -- $cmd $stripFlags 2>"$striperr" || exit_code=$?
# xargs exits with status code 123 if some but not all of the # processes fail. We don't care if some of the files couldn't
Now we should generally try to strip slightly less from now on by skipping identical files.
Parting words
My initial guess of wrong library lookup paths was completely off. It was another case of non-deterministic build causing major toolchain breakage. I’m glad it was discovered so quickly after introduction. PR #246164 should fix it for good.
Turns out clang
and gcc
generate a bit different code around atomics.
Have fun!