Delay Slots strike back

Jeroen found another problem on hppa (PA-RISC) after switch to gcc-10: Python-3.9.0b3 started crashing at build time:

$ ./configure ... --host=hppa2.0-unknown-linux-gnu ...
$ make

if test $? -ne 0 ; then \
echo "generate-posix-vars failed" ; \
rm -f ./pybuilddir.txt ; \
exit 1 ; \
Python path configuration:
  PYTHONHOME = (not set)
  PYTHONPATH = (not set)
  program name = './python'
  isolated = 0
  environment = 0
  user site = 1
  import site = 0
  sys._base_executable = '/var/tmp/portage/dev-lang/python-3.9.0_beta3/work/Python-3.9.0b3/python'
  sys.base_prefix = '/usr'
  sys.base_exec_prefix = '/usr'
  sys.platlibdir = 'lib'
  sys.executable = '/var/tmp/portage/dev-lang/python-3.9.0_beta3/work/Python-3.9.0b3/python'
  sys.prefix = '/usr'
  sys.exec_prefix = '/usr'
  sys.path = [
Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 161, in __exit__
  File "<frozen importlib._bootstrap>", line 116, in release
RuntimeError: cannot release un-acquired lock
generate-posix-vars failed
make: *** [Makefile:611: pybuilddir.txt] Error 1

We have two strange errors here:

  1. Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
  2. RuntimeError: cannot release un-acquired lock

Both errors seem unrelated. They signal internal inconsistency in the interpreter.


Is it a gcc bug? I usually assume it’s not and try to find inconsisyency in the program. If it’s luck I’ll find the bug in the program. Otherwise I’ll get a small reproducer to fix the compiler.

First, I double-checked it’s a gcc-10 -O2:

# on hppa host
# CC=gcc-10.1.0 CFLAGS=-O2 emerge -v1 dev-lang/python:3.9
# CC=gcc-10.1.0 CFLAGS=-O1 emerge -v1 dev-lang/python:3.9
# CC=gcc-9.3.0 CFLAGS=-O2 emerge -v1 dev-lang/python:3.9

The bug is reproducible. Assuming it’s an obscure python bug I tried AddressSanitizer and valigrind on amd64. Neither reported any relevant errors. valgrind found a seemingly legitimate use-after-free:

–with-valgrind makes that valgrind report go away. It also makes original bug away. Which is not very useful. I expected some diagnostic that would report heap corruption.

Falling back to original setup. We need to find a function where input is the same but the output is different.

After much printf poking I found simple way to observe the problem:

$ git init .; git add .; git commit -m "initial state"
$ ./configure ... --host=hppa2.0-unknown-linux-gnu ...
$ make regen-importlib
$ git diff
--- a/Python/importlib_external.h
+++ b/Python/importlib_external.h
@@ -172,7 +172,7 @@ const unsigned char _Py_M__importlib_bootstrap_external[] = {
-    233,1,0,0,0,41,1,90,8,109,97,120,115,112,108,105,
+    114,15,0,0,0,41,1,90,8,109,97,120,115,112,108,105,

This output means that freshly built interpreter generates bytecode slightly different from already pre-generated. The diff is not expected as building python with -O1 produces no diff on the same test.

This is a simple enough test to move debugging from debugging on hppa host to amd64 host. With binfmt_misc wrappers I was able to use exatly the same build commands on amd64 to get the same bytecode diff.

Shrinking the delta down

To narrow down on the trigger I wanted to find smallest piece of code to build with -O1 to see the bug disappear (while rest of code is built with -O2).

Python build system outputs exact commands used to generate everything. I dumped all commands and tweaked -O2 to -O1 in a binary search fashion:

$ make clean
$ make | tee
hppa2.0-unknown-linux-gnu-gcc -c -Wno-unused-result -Wsign-compare -DNDEBUG  -O2 -fdelayed-branch -frecord-gcc-switches -fwrapv   -std=c99 -Wextra -Wno-unused-result -Wno-unused-parameter -Wno-missing-field-initializers -Werror=implicit-function-declaration -fvisibility=hidden  -I./Include/internal  -I. -I./Include -I/usr/include/ncursesw  -fPIC -DPy_BUILD_CORE -o Programs/python.o ./Programs/python.c
hppa2.0-unknown-linux-gnu-gcc -c -Wno-unused-result -Wsign-compare -DNDEBUG  -O2 -fdelayed-branch -frecord-gcc-switches -fwrapv   -std=c99 -Wextra -Wno-unused-result -Wno-unused-parameter -Wno-missing-field-initializers -Werror=implicit-function-declaration -fvisibility=hidden  -I./Include/internal  -I. -I./Include -I/usr/include/ncursesw  -fPIC -DPy_BUILD_CORE -o Parser/acceler.o Parser/acceler.c
./Programs/_freeze_importlib zipimport \
./Lib/ \
python3.9 ./Tools/scripts/ ./Python/importlib_zipimport.h ./Python/

Now I can just edit slightly and rerun it. To avoid recompilation impact I used ccache shadows for hppa2.0-unknown-linux-gnu-gcc:

$ which hppa2.0-unknown-linux-gnu-gcc

That way our “full rebuild” is as cheap as incremental rebuild:

$ time bash
real 0m2,258s
user 0m1,673s
sys  0m0,595s

2.5 seconds on my 10 years old machine. That gives us very interactive debugging environment.

After a bit of poking I found that rebuilding Objects/longobject.c with -O1 is enough to make bug disappear:

--- a/
+++ b/
@@ -39 +39 @@ hppa2.0-unknown-linux-gnu-gcc -c -Wno-unused-result -Wsign-compare -DNDEBUG  -O2
-hppa2.0-unknown-linux-gnu-gcc ... -O2 ... Objects/longobject.c
+hppa2.0-unknown-linux-gnu-gcc ... -O1 ... Objects/longobject.c

Now we can use advanced pragmas and attributes to re-enable -O2 only for subset of Objects/longobject.c. The tools are:

  1. #pragma GCC push_options / #pragma GCC optimize(2) / #pragma GCC pop_options: change optimization level only for a subset of functions in a file to narrow down the code which triggers problematic behaviour.
  2. __attribute__((noipa)): make a function opaque to inliner as if it was in a separate compilation unit. Useful when shrinking test example down to a single file.

Usage example looks like that:

--- a/Objects/longobject.c
+++ b/Objects/longobject.c
@@ -1,3 +1,5 @@
+#pragma GCC push_options
+#pragma GCC optimize(2)
 /* Long (arbitrary precision) integer object implementation */

 /* XXX The functional organization of this file is terrible */
@@ -3009,6 +3011,7 @@ PyLong_AsDouble(PyObject *v)
    if a == b, return 0
    if a > b, return a positive number */

 static Py_ssize_t
 long_compare(PyLongObject *a, PyLongObject *b)
@@ -3027,6 +3030,8 @@ long_compare(PyLongObject *a, PyLongObject *b)
     return sign;

+static PyObject *
+long_richcompare(PyObject *self, PyObject *other, int op) __attribute__((noipa));
 static PyObject *
 long_richcompare(PyObject *self, PyObject *other, int op)
@@ -5807,3 +5812,4 @@ _PyLong_Fini(PyThreadState *tstate)
+#pragma GCC pop_options

With the above tricks I arrived at long_richcompare(). I could not see the immediate bug in the code. It looks very clean and simple.

Shrinking down test example

long_richcompare() implements a comparison operator for ‘int’ class (arbitrary precision integer) in python. It’s a simple mathematical operation.

I added a few printf() statements to extract exact exact inputs/outputs where long_richcompare() changes it’s behaviour. It was long_richcompare(0xFFFFffff, 1, EQ).

The full extracted example was:

   The test is extracted from Python-3.9.0 miscompilation
   on hppa2.0:

   Original bug happens as an invalid bytecode generation
   due to bad results from 'long_richcompare(0xFFFFffff, 1, EQ)' calls.

Failure example:
  $ hppa2.0-unknown-linux-gnu-gcc -lm -Wsign-compare -Wall -O1 bug_test.c -o good-bug
  $ hppa2.0-unknown-linux-gnu-gcc -lm -Wsign-compare -Wall -O2 bug_test.c -o bad-bug
  $ ./good-bug
  long_richcompare(2, 1, EQ) = FALSE (expect FALSE)
  $ ./bad-bug
  long_richcompare(2, 1, EQ) = TRUE (expect FALSE)


// We use '__attribute__((noipa));' aggressively to simulate
// unavailable function definitions from outside translation units.

static int cmp(int *lhs, int *rhs)
    int sign = *lhs - *rhs;

    // semantically this should be 'return 0;' but this condition is not
    // supposed to trigger on our input data.
    if (sign == 0) return 1;

    return sign;

static int yes(void) __attribute__((noipa));
static int yes(void) { return 1; }

static int long_richcompare(int *self, int *other, int op) __attribute__((noipa));
static int long_richcompare(int *self, int *other, int op)
    int result;

    if (!yes() || !yes())
        return 0;

    if (self == other)
        result = 0;
        result = cmp(self, other);

    // has to force jump table
    switch (op) {
        // only 0 case is used on actual data
        case 0: return (result == 0);

        case 1: return 0;
        case 3: return 0;
        case 5: if (result == 0) return 1; else return 0;

#include <stdio.h>

int main() {
    int l = 2;
    int r = 1;

    int res = long_richcompare(&l, &r, 0);
    printf("long_richcompare(2, 1, EQ) = %s (expect FALSE)\n", res ? "TRUE" : "FALSE");

Note: it’s not a real comparison anymore. cmp() is an elaborate no-op. This file is very easy to trace through and verify that it has no problems related to undefined behaviour.

I asked GCC developers for help in and Eric immediately suggested trying -fno-delayed-branch to see if it makes a difference. It did!

Using -fno-delayed-branch allowed me to simplify the example even further with cvise. It produced the following example:

int b, c;
int a() __attribute__((noipa));
int a(int *d, int *f, int g) {
  int e;
  if (d == f)
    e = 0; // never gets here on our input data
    e = 1; // always gets here on our input data
  switch (g) {
  case 0:
    return e; // 'return 1'; always gets here on our input data
  case 1:
  case 3:
  case 5:
    if (e)
      return 10;
int main() { return a(&b, &c, 0); }
$ hppa2.0-unknown-linux-gnu-gcc -O2 bug_test.c -o bad; ./bad; echo $?
$ hppa2.0-unknown-linux-gnu-gcc -O2 bug_test.c -o good -fno-delayed-branch; ./good; echo $?

Generated code

Let’s peek at generated code in this example:

;; hppa2.0-unknown-linux-gnu-gcc -O2 -S ../bug_test.c -o bug.S
        bv %r0(%r2) ; ret = 0; return ret;
         ldi 0,%r28 ; delayed slot for 'ret = 0'

This is very short and wrong ‘return 0’ code.

;; hppa2.0-unknown-linux-gnu-gcc -O2 -S ../bug_test.c -o bug.S -fno-delayed-branch
        comclr,<> %r26,%r25,%r0 ; compare d == f
        b,n .L11                ; if (d == f) goto .L11 else no-op;
        ldil L'.L6,%r28
        ldo R'.L6(%r28),%r28    ; load address of jump table at .L6
        ldwx,s %r24(%r28),%r28  ; fetch .L6[g(arg2)] target address
        bv,n %r0(%r28)          ; goto at .L6[g(arg2)]
        .word .L8 ; 'case 0:' code (our case)
        .word .L5 ; 'case 1:' code
        .word .L4 ; 'case 2:' code
        .word .L5 ; 'case 3:' code
        .word .L4 ; 'case 4:' code
        .word .L5 ; 'case 5:' code
        ldi 10,%r28   ; ret = 10
        bv,n %r0(%r2) ; return ret;
        ldi 0,%r28    ; ret = 0
        bv,n %r0(%r2) ; return ret;
        ldi 1,%r28    ; ret = 1
        bv,n %r0(%r2) ; return ret;

This is somewhat long and correct code.

So what is so special about -fno-delayed-branch? Why does it turn things upside down?

Delay slots

Delay slot ( is a simple concept: on simple architectures some instructions take not unual one clock but two clock cycles. Instead of stalling the CPU pipeline for extra cycle CPU just executes next instruction (or a few of those) following such heavyweight instruction (wat?). Such place in code is called a a “delay slot”.

Simple example on hppa:

    bv %r0(%r2) ; 'return ret' ; takes 2 cycles
     ldi 0,%r28 ; 'ret = 0'    ; takes 1 cycle, executes in parallel to 'bv'
; total execution time: 2 cycles

Here bv (branch vectored) takes clock 2 cycles and CPU always executes one instruction after it. The actual execution sequence is equivalent to:

    ldi 0,%r28  ; takes 1 cycle
    bv %r0(%r2) ; takes 2 cycles
     nop        ; takes 1 cycle; executes in parallel to 'bv'
; total execution time: 3 cycles

It sounds like a simple transformation. But it’s full of fancy corner cases:

    ; invalid
    bv %r0(%r2)  ; 'return ret' ; takes 2 cycles
     bv %r0(%r2) ; 'return ret' ; takes 2 cycles, overlaps with previous
      nop        ; does it get executed?

Thankfuly hppa forbids putting branch instructions themselves into delay slot.

A few other architectures that use delay slot are: mips, superh, sparc. But not arm, not powerpc and not riscv.

Having slightly tweaked my example I managed to reproduce the same bug on sh4(superh):

The bug ended up being in gcc’s reporg pass: a late pass that handles instruction reordering to pick the better sequence. It made incorrect assumptions about instructions with delay slots.

Have fun!

Posted on February 27, 2021 by trofi. Email, pull requests or comments are welcome!