grub-0.97 and gcc-4.9

December 16, 2015

Once upon a time (actually the day before yesterday) i dropped an eye at freenode/#gentoo-dev IRC channel and noticed an unusual question:

21:18 <@_AxS_> gcc internals issue..  does anyone happen to know if gcc-4.9 -O2 treats trampolines
               differently than gcc-4.8 and previous? (as well as gcc-4.9 -O0) ?

I did not know much about trampolines besides they are gcc extension allowing you to write nested functions that can refer to variables outside the scope: a doc.

the problem

_AxS described an issue: grub-0.97 happens to SIGSEGV when you run a ‘grub’ helper tool and try to exit from it’s shell. He also had a workaround to make grub work: build it with -O0 (optimisations disabled).

_AxS’s mystretous valgrind backtrace on a grub userspace binary looked like that:

==13972== Process terminating with default action of signal 11 (SIGSEGV)
==13972==  Bad permissions for mapped region at address 0x1C
==13972==    at 0x1C: ???
==13972==    by 0x40D2782: (below main) (in /lib32/libc-2.21.so)

It looks like a very simple error jumping by NULL address + tiny offset. “Must be very easy to fix” was my thought and i’ve started digging :)

reproducing

The first thing is to try to reproduce crash locally with debugging symbols enabled:

grub-0.97/grub $ echo quit | ./grub
Probing devices to guess BIOS drives. This may take a long time.
Segmentation fault

Yay! I’m in the game! Next step is to try to get a symbolized version of valgrind output.

valgrind

I was not so lucky as my libc uses too fresh instructions valgrind does not know about yet:

Probing devices to guess BIOS drives. This may take a long time.
vex x86->IR: unhandled instruction bytes: 0xC5 0xF8 0x77 0xE8
==19149== valgrind: Unrecognised instruction at address 0x412268f.
==19149==    at 0x412268F: vfprintf (vfprintf.c:1630)
==19149==    by 0x41D5DAF: __vsprintf_chk (vsprintf_chk.c:85)
==19149==    by 0x41D5CFF: __sprintf_chk (sprintf_chk.c:31)
==19149==    by 0x8060068: sprintf (stdio2.h:33)
==19149==    by 0x8060068: get_floppy_disk_name (device.c:431)
==19149==    by 0x8060068: init_device_map (device.c:825)
==19149==    by 0x8049C2A: grub_stage2 (asmstub.c:237)
==19149==    by 0x8049860: main (main.c:268)

C5 F8 77 is an AVX2 vzeroupper valgrind needs an update.

[vzeroupper] Set the upper half of all YMM registers to zero.
             Used when switching between 128-bit use and 256-bit use.

That means i have to use other tools to see what happens at crash time.

ltrace

To get an insight into what happened i’ve looked at function call trace. ltrace intercepts dynamic symbols (like libc symbols) program tries to call and outputs function names and parameters.

You can add custom description of data structures into /etc/ltrace.conf (or ~/.ltrace) to make things look even better. I have the following snippet to track GMP arithmetic i used to debug unrelated bug in ghc:

; direct translation from gmp.h
typedef mp_limb_t   = hex(ulong);
typedef mp_bitcnt_t = ulong;
typedef mp_size_t   = long;
;
typedef __mpz_struct = struct(int, int, array(mp_limb_t, elt1)*);
;
typedef mpz_ptr    = __mpz_struct*;
typedef mpz_srcptr = __mpz_struct*;
;
void __gmpz_mul_2exp(+mpz_ptr, mpz_srcptr, mp_bitcnt_t);
void __gmpz_ior(+mpz_ptr, mpz_srcptr, mpz_srcptr);
void __gmpz_add_ui(+mpz_ptr, mpz_srcptr, ulong);
void* __gmpz_realloc(mpz_srcptr, mp_size_t);
;
typedef mp_ptr    = void*;
;
mp_limb_t __gmpn_lshift(mp_ptr, array(mp_limb_t, arg3)*, mp_size_t, uint);
void __gmpn_copyi(mp_ptr, array(mp_limb_t, arg3)*, mp_size_t);

But back to grub:

$ echo quit | ltrace -olog ./grub
...
__ctype_b_loc()                                                               = 0xf75656cc
waddch(0x988f3b8, '\r')                                                       = 0
wmove(0x988f3b8, 8, 0)                                                        = 0
endwin(0xf6bdb200, 128, 0xf6bcaf67, 0x805b3f7)                                = 0
__longjmp_chk(0x8071f60, 1, 0x8071b3c, 0x8049f26 <no return ...>
--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++

longjmp() needs a setjmp() call to store stack context to return to. A few of setjmp() calls were in the log trace somewhere in the beginning of a trace:

scrollok(0x988f3b8, 1, 0, 0x8049d82)                                          = 0
keypad(0x988f3b8, 1, 0, 0x8049d82)                                            = 0
wtimeout(0x988f3b8, 100, 0, 0x8049d82)                                        = 0x988f3b8
signal(SIGWINCH, 0x1)                                                         = 0xf7737920
sync()                                                                        = 0
_setjmp(0x8071f60, 0, 0, 0)                                                   = 0
_setjmp(0x807d340, 0, 0, 0)                                                   = 0
_setjmp(0x807d440, 0xf6bcafb0, 0, 0x805d97d)                                  = 0

setjmp() accepts a single parameter): a pointer to buffer where to store current context.

locating crash site

Resolving 0x8071f60 to a symbol to get an idea where it gets called from (gdb breakpoint would also do the trick):

$ gdb ./grub
...
(gdb) disassemble 0x8071f60
Dump of assembler code for function env_for_exit:
   0x08071f60 <+0>:     add    %al,(%eax)
   0x08071f62 <+2>:     add    %al,(%eax)
   0x08071f64 <+4>:     add    %al,(%eax)

Looking up code that sets and restores env_for_exit:

/*
 * $ git grep -C40 env_for_exit
 * grub/asmstub.c:
 */
...
/* The jump buffer for exiting correctly.  */
static jmp_buf env_for_exit;

/* The main entry point into this mess. */
int
grub_stage2 (void)
{
  /* These need to be static, because they survive our stack transitions. */
  static int status = 0;
  static void *realstack;
  void *simstack_alloc_base, *simstack;
  size_t simstack_size, page_size;
  int i;

  auto void doit (void);

  /* We need a nested function so that we get a clean stack frame,
     regardless of how the code is optimized. */
  void doit (void)
    {
      /* Make sure our stack lives in the simulated memory area. */
      asm volatile ("movl %%esp, %0\n\tmovl %1, %%esp\n"
             : "=&r" (realstack) : "r" (simstack));

      /* Do a setjmp here for the stop command.  */
      if (! setjmp (env_for_exit))
        {
          /* Actually enter the generic stage2 code.  */
          status = 0;
          init_bios_info (); /* somewhere here quit()->stop() is called; */
        }
      else
        {
          /* If ERRNUM is non-zero, then set STATUS to non-zero.  */
          if (errnum)
            status = 1;
        }

      /* Replace our stack before we use any local variables. */
      asm volatile ("movl %0, %%esp\n" : : "r" (realstack));
    }
  assert (grub_scratch_mem == 0);
...
  doit ();
...
}
...
void
stop (void)
{
#ifdef HAVE_LIBCURSES
  if (use_curses)
    endwin ();
#endif

  /* Jump to doit.  */
  longjmp (env_for_exit, 1);
}
...
/* stage2/builtins.c: */
/* quit */
static int
quit_func (char *arg, int flags)
{
  stop ();
  /* Never reach here.  */
  return 0;
}
static struct builtin builtin_quit =
{
  "quit",
  quit_func,
  BUILTIN_CMDLINE | BUILTIN_HELP_LIST,
  "quit",
  "Exit from the GRUB shell."
};

A low of stuff happens here.

gcc passes

As we know from _AxS -O0 makes grub magically work.

Checking again by rebuilding as -O1 and -O0 this single file grub/asmstub.c. (The rest is built with -O1 to keep amount of changes at minimum).

gcc’s -O<N> levels are sets of indivitual (hopefully mostly orthogonal) optimisations. We can break -O1 down and find which of passes broke things.

The following command dumps all individual passes:

$ gcc -Q --help=optimizers
$ gcc -Q --help=optimizers | wc -l
221

Now it’s time to find out the exact optimization pass(es) that renders code unusable.

$ gcc -m32 -Q -O0 --help=optimizers > O0
$ gcc -m32 -Q -O1 --help=optimizers > O1
$ diff -U0 O0 O1 | grep enabled
+  -fbranch-count-reg                   [enabled]
+  -fcombine-stack-adjustments          [enabled]
+  -fcompare-elim                       [enabled]
+  -fcprop-registers                    [enabled]
+  -fdefer-pop                          [enabled]
+  -fforward-propagate                  [enabled]
+  -fguess-branch-probability           [enabled]
+  -fif-conversion                      [enabled]
+  -fif-conversion2                     [enabled]
+  -finline-functions-called-once       [enabled]
+  -fipa-profile                        [enabled]
+  -fipa-pure-const                     [enabled]
+  -fipa-reference                      [enabled]
+  -fmove-loop-invariants               [enabled]
+  -fshrink-wrap                        [enabled]
+  -fsplit-wide-types                   [enabled]
+  -fssa-phiopt                         [enabled]
+  -ftree-bit-ccp                       [enabled]
+  -ftree-ccp                           [enabled]
+  -ftree-ch                            [enabled]
+  -ftree-copy-prop                     [enabled]
+  -ftree-copyrename                    [enabled]
+  -ftree-dce                           [enabled]
+  -ftree-dominator-opts                [enabled]
+  -ftree-dse                           [enabled]
+  -ftree-fre                           [enabled]
+  -ftree-pta                           [enabled]
+  -ftree-sink                          [enabled]
+  -ftree-slsr                          [enabled]
+  -ftree-sra                           [enabled]
+  -ftree-ter                           [enabled]

Picking a bunch of flags at a time i’ve distilled it down to a single flag:

assembly changes

Now let’s check code genertion difference:

$ x86_64-pc-linux-gnu-gcc -m32 -DHAVE_CONFIG_H <...> -O1                                -S -o asmstub-O1.S asmstub.c
$ x86_64-pc-linux-gnu-gcc -m32 -DHAVE_CONFIG_H <...> -O1 -fno-combine-stack-adjustments -S -o asmstub-O1-fno.S asmstub.c
$ diff -U10 asmstub-O1.S asmstub-O1-fno.S
--- asmstub-O1.S        2015-12-16 22:28:30.656791615 +0000
+++ asmstub-O1-fno.S    2015-12-16 22:29:49.629693066 +0000
@@ -56,31 +56,33 @@
        movl    %edx, %eax
 .L8:
        rep ret
        .cfi_endproc
 .LFE113:
        .size   console_translate_key, .-console_translate_key
        .type   doit.7394, @function
 doit.7394:
 .LFB86:
        .cfi_startproc
-       subl    $24, %esp
-       .cfi_def_cfa_offset 28
+       subl    $12, %esp
+       .cfi_def_cfa_offset 16
        movl    (%ecx), %edx
 #APP
 # 176 "asmstub.c" 1
        movl %esp, %eax
        movl %edx, %esp

 # 0 "" 2
 #NO_APP
        movl    %eax, realstack.7387
+       subl    $12, %esp
+       .cfi_def_cfa_offset 28
        pushl   $env_for_exit
        .cfi_def_cfa_offset 32
        call    _setjmp
        addl    $16, %esp
        .cfi_def_cfa_offset 16
        testl   %eax, %eax
        jne     .L21
        movl    $0, status.7386
        call    init_bios_info
        jmp     .L20

See what happens here? You might need to revisit doit() code snippet above in this post. Optimisation combines stack allocation at the very doit() start (before asm statement) and stack allocation right before setjmp() call.

That %esp adjustment optimization happens as if there would be no #APP / #NOAPP guarded code. (#APP / #NOAPP wraps around code emitted by asm statements) But that asm statement overrides %esp value!

(Exercise for the reader: what exactly breaks by this transformation?)

gcc inline assembly

Going back to original source snippet:

void doit (void)
  {
    /* Make sure our stack lives in the simulated memory area. */
    asm volatile ("movl %%esp, %0\n\tmovl %1, %%esp\n"
           : "=&r" (realstack) : "r" (simstack));

Some facts about that asm statement:

gcc does very limited parsing of asm string statement as gcc usually does not know of all instructions binutils supports (gcc basically searches for “%<N>” templates and substitutes for allocated operands).

Thus programmer needs to specify explicit effect of written code. Effect gets derived only from input and output arguments but you can also specify unobvious effects like condition flags or arbitrary memory updates: more details.

the fix

Thus our fix is to update the assembly a bit:

diff --git a/grub/asmstub.c b/grub/asmstub.c
index 6354806..44b056f 100644
--- a/grub/asmstub.c
+++ b/grub/asmstub.c
@@ -175,5 +175,5 @@ grub_stage2 (void)
       /* Make sure our stack lives in the simulated memory area. */
       asm volatile ("movl %%esp, %0\n\tmovl %1, %%esp\n"
-                   : "=&r" (realstack) : "r" (simstack));
+                   : "=&r" (realstack) : "r" (simstack) : "%esp");

       /* Do a setjmp here for the stop command.  */
@@ -192,5 +192,5 @@ grub_stage2 (void)

       /* Replace our stack before we use any local variables. */
-      asm volatile ("movl %0, %%esp\n" : : "r" (realstack));
+      asm volatile ("movl %0, %%esp\n" : : "r" (realstack) : "%esp");
     }

fun facts

All these nested functions, setjmp(), longjmp() were only distractions hiding very old and simple code bug. Ancient grub-0.92 from 2002 (available here) also had that flaw.

The patch is already in gentoo.

The smarter compilers the funnier life :)