Delay Slots strike back
Jeroen found another problem on hppa (PA-RISC) after switch to gcc-10: Python-3.9.0b3 started crashing at build time:
$ ./configure ... --host=hppa2.0-unknown-linux-gnu ...
$ make
...
if test $? -ne 0 ; then \
echo "generate-posix-vars failed" ; \
rm -f ./pybuilddir.txt ; \
exit 1 ; \
fi
Python path configuration:
PYTHONHOME = (not set)
PYTHONPATH = (not set)
program name = './python'
isolated = 0
environment = 0
user site = 1
import site = 0
sys._base_executable = '/var/tmp/portage/dev-lang/python-3.9.0_beta3/work/Python-3.9.0b3/python'
sys.base_prefix = '/usr'
sys.base_exec_prefix = '/usr'
sys.platlibdir = 'lib'
sys.executable = '/var/tmp/portage/dev-lang/python-3.9.0_beta3/work/Python-3.9.0b3/python'
sys.prefix = '/usr'
sys.exec_prefix = '/usr'
sys.path = [
'/usr/lib/python39.zip',
'/var/tmp/portage/dev-lang/python-3.9.0_beta3/work/Python-3.9.0b3/Lib',
'/var/tmp/portage/dev-lang/python-3.9.0_beta3/work/Python-3.9.0b3/none\n',
]
Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized
Traceback (most recent call last):
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 161, in __exit__
File "<frozen importlib._bootstrap>", line 116, in release
RuntimeError: cannot release un-acquired lock
generate-posix-vars failed
make: *** [Makefile:611: pybuilddir.txt] Error 1
We have two strange errors here:
- Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
- RuntimeError: cannot release un-acquired lock
Both errors seem unrelated. They signal internal inconsistency in the interpreter.
Debugging
Is it a gcc bug? I usually assume it’s not and try to find inconsisyency in the program. If it’s luck I’ll find the bug in the program. Otherwise I’ll get a small reproducer to fix the compiler.
First, I double-checked it’s a gcc-10 -O2:
# on hppa host
# CC=gcc-10.1.0 CFLAGS=-O2 emerge -v1 dev-lang/python:3.9
<fails>
# CC=gcc-10.1.0 CFLAGS=-O1 emerge -v1 dev-lang/python:3.9
<works>
# CC=gcc-9.3.0 CFLAGS=-O2 emerge -v1 dev-lang/python:3.9
<works>
The bug is reproducible. Assuming it’s an obscure python bug I tried AddressSanitizer and valigrind on amd64. Neither reported any relevant errors. valgrind found a seemingly legitimate use-after-free: https://bugs.gentoo.org/729570#c8.
–with-valgrind makes that valgrind report go away. It also makes original bug away. Which is not very useful. I expected some diagnostic that would report heap corruption.
Falling back to original setup. We need to find a function where input is the same but the output is different.
After much printf poking I found simple way to observe the problem:
$ git init .; git add .; git commit -m "initial state"
$ ./configure ... --host=hppa2.0-unknown-linux-gnu ...
$ make regen-importlib
$ git diff
--- a/Python/importlib_external.h
+++ b/Python/importlib_external.h
@@ -172,7 +172,7 @@ const unsigned char _Py_M__importlib_bootstrap_external[] = {
83,0,113,44,100,3,124,0,102,2,83,0,41,4,122,32,
82,101,112,108,97,99,101,109,101,110,116,32,102,111,114,32,
111,115,46,112,97,116,104,46,115,112,108,105,116,40,41,46,
- 233,1,0,0,0,41,1,90,8,109,97,120,115,112,108,105,
+ 114,15,0,0,0,41,1,90,8,109,97,120,115,112,108,105,
116,218,0,41,6,114,23,0,0,0,114,31,0,0,0,218,
10,114,112,97,114,116,105,116,105,111,110,114,35,0,0,0,
218,8,114,101,118,101,114,115,101,100,218,6,114,115,112,108,
...
This output means that freshly built interpreter generates bytecode slightly different from already pre-generated. The diff is not expected as building python with -O1 produces no diff on the same test.
This is a simple enough test to move debugging from debugging on hppa host to amd64 host. With binfmt_misc wrappers I was able to use exatly the same build commands on amd64 to get the same bytecode diff.
Shrinking the delta down
To narrow down on the trigger I wanted to find smallest piece of code to build with -O1 to see the bug disappear (while rest of code is built with -O2).
Python build system outputs exact commands used to generate everything. I dumped all commands and tweaked -O2 to -O1 in a binary search fashion:
$ make clean
$ make | tee build.sh
hppa2.0-unknown-linux-gnu-gcc -c -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -fdelayed-branch -frecord-gcc-switches -fwrapv -std=c99 -Wextra -Wno-unused-result -Wno-unused-parameter -Wno-missing-field-initializers -Werror=implicit-function-declaration -fvisibility=hidden -I./Include/internal -I. -I./Include -I/usr/include/ncursesw -fPIC -DPy_BUILD_CORE -o Programs/python.o ./Programs/python.c
hppa2.0-unknown-linux-gnu-gcc -c -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -fdelayed-branch -frecord-gcc-switches -fwrapv -std=c99 -Wextra -Wno-unused-result -Wno-unused-parameter -Wno-missing-field-initializers -Werror=implicit-function-declaration -fvisibility=hidden -I./Include/internal -I. -I./Include -I/usr/include/ncursesw -fPIC -DPy_BUILD_CORE -o Parser/acceler.o Parser/acceler.c
...
./Programs/_freeze_importlib zipimport \
./Lib/zipimport.py \
./Python/importlib_zipimport.h.new
python3.9 ./Tools/scripts/update_file.py ./Python/importlib_zipimport.h ./Python/importlib_zipimport.h.new
Now I can just edit build.sh slightly and rerun it. To avoid recompilation impact I used ccache shadows for hppa2.0-unknown-linux-gnu-gcc:
$ which hppa2.0-unknown-linux-gnu-gcc
/usr/lib/ccache/bin/hppa2.0-unknown-linux-gnu-gcc
That way our “full rebuild” is as cheap as incremental rebuild:
$ time bash build.sh
real 0m2,258s
user 0m1,673s
sys 0m0,595s
2.5 seconds on my 10 years old machine. That gives us very interactive debugging environment.
After a bit of poking I found that rebuilding Objects/longobject.c with -O1 is enough to make bug disappear:
--- a/build.sh
+++ b/build.sh
@@ -39 +39 @@ hppa2.0-unknown-linux-gnu-gcc -c -Wno-unused-result -Wsign-compare -DNDEBUG -O2
-hppa2.0-unknown-linux-gnu-gcc ... -O2 ... Objects/longobject.c
+hppa2.0-unknown-linux-gnu-gcc ... -O1 ... Objects/longobject.c
Now we can use advanced pragmas and attributes to re-enable -O2 only for subset of Objects/longobject.c. The tools are:
- #pragma GCC push_options / #pragma GCC optimize(2) / #pragma GCC pop_options: change optimization level only for a subset of functions in a file to narrow down the code which triggers problematic behaviour.
- __attribute__((noipa)): make a function opaque to inliner as if it was in a separate compilation unit. Useful when shrinking test example down to a single file.
Usage example looks like that:
--- a/Objects/longobject.c
+++ b/Objects/longobject.c
@@ -1,3 +1,5 @@
+#pragma GCC push_options
+#pragma GCC optimize(2)
/* Long (arbitrary precision) integer object implementation */
/* XXX The functional organization of this file is terrible */@@ -3009,6 +3011,7 @@ PyLong_AsDouble(PyObject *v)
if a == b, return 0
if a > b, return a positive number */
+
static Py_ssize_t
long_compare(PyLongObject *a, PyLongObject *b)
{@@ -3027,6 +3030,8 @@ long_compare(PyLongObject *a, PyLongObject *b)
return sign;
}
+static PyObject *
+long_richcompare(PyObject *self, PyObject *other, int op) __attribute__((noipa));
static PyObject *
long_richcompare(PyObject *self, PyObject *other, int op)
{@@ -5807,3 +5812,4 @@ _PyLong_Fini(PyThreadState *tstate)
}
#endif
}+#pragma GCC pop_options
With the above tricks I arrived at long_richcompare(). I could not see the immediate bug in the code. It looks very clean and simple.
Shrinking down test example
long_richcompare() implements a comparison operator for ‘int’ class (arbitrary precision integer) in python. It’s a simple mathematical operation.
I added a few printf() statements to extract exact exact inputs/outputs where long_richcompare() changes it’s behaviour. It was long_richcompare(0xFFFFffff, 1, EQ).
The full extracted example was:
/*
The test is extracted from Python-3.9.0 miscompilation
on hppa2.0: https://bugs.gentoo.org/729570
Original bug happens as an invalid bytecode generation
due to bad results from 'long_richcompare(0xFFFFffff, 1, EQ)' calls.
Failure example:
$ hppa2.0-unknown-linux-gnu-gcc -lm -Wsign-compare -Wall -O1 bug_test.c -o good-bug
$ hppa2.0-unknown-linux-gnu-gcc -lm -Wsign-compare -Wall -O2 bug_test.c -o bad-bug
$ ./good-bug
long_richcompare(2, 1, EQ) = FALSE (expect FALSE)
$ ./bad-bug
long_richcompare(2, 1, EQ) = TRUE (expect FALSE)
*/
// We use '__attribute__((noipa));' aggressively to simulate
// unavailable function definitions from outside translation units.
static int cmp(int *lhs, int *rhs)
{
int sign = *lhs - *rhs;
// semantically this should be 'return 0;' but this condition is not
// supposed to trigger on our input data.
if (sign == 0) return 1;
return sign;
}
static int yes(void) __attribute__((noipa));
static int yes(void) { return 1; }
static int long_richcompare(int *self, int *other, int op) __attribute__((noipa));
static int long_richcompare(int *self, int *other, int op)
{
int result;
if (!yes() || !yes())
return 0;
if (self == other)
= 0;
result else
= cmp(self, other);
result
// has to force jump table
switch (op) {
// only 0 case is used on actual data
case 0: return (result == 0);
case 1: return 0;
case 3: return 0;
case 5: if (result == 0) return 1; else return 0;
default:
();
__builtin_unreachable}
}
#include <stdio.h>
int main() {
int l = 2;
int r = 1;
int res = long_richcompare(&l, &r, 0);
("long_richcompare(2, 1, EQ) = %s (expect FALSE)\n", res ? "TRUE" : "FALSE");
printf}
Note: it’s not a real comparison anymore. cmp() is an elaborate no-op. This file is very easy to trace through and verify that it has no problems related to undefined behaviour.
I asked GCC developers for help in https://gcc.gnu.org/PR96015 and Eric immediately suggested trying -fno-delayed-branch to see if it makes a difference. It did!
Using -fno-delayed-branch allowed me to simplify the example even further with cvise. It produced the following example:
int b, c;
int a() __attribute__((noipa));
int a(int *d, int *f, int g) {
int e;
if (d == f)
= 0; // never gets here on our input data
e else
= 1; // always gets here on our input data
e switch (g) {
case 0:
return e; // 'return 1'; always gets here on our input data
case 1:
case 3:
case 5:
if (e)
return 10;
default:
();
__builtin_unreachable}
}
int main() { return a(&b, &c, 0); }
$ hppa2.0-unknown-linux-gnu-gcc -O2 bug_test.c -o bad; ./bad; echo $?
0
$ hppa2.0-unknown-linux-gnu-gcc -O2 bug_test.c -o good -fno-delayed-branch; ./good; echo $?
1
Generated code
Let’s peek at generated code in this example:
;; hppa2.0-unknown-linux-gnu-gcc -O2 -S ../bug_test.c -o bug.S
a:
%r0(%r2) ; ret = 0; return ret;
bv 0,%r28 ; delayed slot for 'ret = 0' ldi
This is very short and wrong ‘return 0’ code.
;; hppa2.0-unknown-linux-gnu-gcc -O2 -S ../bug_test.c -o bug.S -fno-delayed-branch
a:
,<> %r26,%r25,%r0 ; compare d == f
comclr,n .L11 ; if (d == f) goto .L11 else no-op;
bnop
.L4:
.L12:
'.L6,%r28
ldil L'.L6(%r28),%r28 ; load address of jump table at .L6
ldo R,s %r24(%r28),%r28 ; fetch .L6[g(arg2)] target address
ldwx,n %r0(%r28) ; goto at .L6[g(arg2)]
bv.L6:
.begin_brtabword .L8 ; 'case 0:' code (our case)
.word .L5 ; 'case 1:' code
.word .L4 ; 'case 2:' code
.word .L5 ; 'case 3:' code
.word .L4 ; 'case 4:' code
.word .L5 ; 'case 5:' code
.
.end_brtab.L5:
10,%r28 ; ret = 10
ldi ,n %r0(%r2) ; return ret;
bv.L11:
0,%r28 ; ret = 0
ldi ,n %r0(%r2) ; return ret;
bv.L8:
1,%r28 ; ret = 1
ldi ,n %r0(%r2) ; return ret; bv
This is somewhat long and correct code.
So what is so special about -fno-delayed-branch? Why does it turn things upside down?
Delay slots
Delay slot (https://en.wikipedia.org/wiki/Delay_slot) is a simple concept: on simple architectures some instructions take not unual one clock but two clock cycles. Instead of stalling the CPU pipeline for extra cycle CPU just executes next instruction (or a few of those) following such heavyweight instruction (wat?). Such place in code is called a a “delay slot”.
Simple example on hppa:
a:
bv %r0(%r2) ; 'return ret' ; takes 2 cycles
ldi 0,%r28 ; 'ret = 0' ; takes 1 cycle, executes in parallel to 'bv'
; total execution time: 2 cycles
Here bv (branch vectored) takes clock 2 cycles and CPU always executes one instruction after it. The actual execution sequence is equivalent to:
a:
0,%r28 ; takes 1 cycle
ldi %r0(%r2) ; takes 2 cycles
bv nop ; takes 1 cycle; executes in parallel to 'bv'
; total execution time: 3 cycles
It sounds like a simple transformation. But it’s full of fancy corner cases:
a:
; invalid
%r0(%r2) ; 'return ret' ; takes 2 cycles
bv %r0(%r2) ; 'return ret' ; takes 2 cycles, overlaps with previous
bv nop ; does it get executed?
Thankfuly hppa forbids putting branch instructions themselves into delay slot.
A few other architectures that use delay slot are: mips, superh, sparc. But not arm, not powerpc and not riscv.
Having slightly tweaked my example I managed to reproduce the same bug on sh4(superh): https://gcc.gnu.org/PR96015#c27.
The bug ended up being in gcc’s reporg pass: a late pass that handles instruction reordering to pick the better sequence. It made incorrect assumptions about instructions with delay slots.
Have fun!