GHC on m68k
This all started from this ghc bug report.
The bug stated that on m68k ABI return values of int-type are passed in register %d0 while void *-type are passed in register %a0. GHC’s C codegen was not using return types consistently.
I had zero knowledge of m68k at that time. I only had a vague impression that CPU of this type is typically attached to ~1MB RAM or something. I expected it be 32-bit-ish or less :)
Portability bugs are my favourite and I’ve started The Quest by building a C crosscompiler.
Gentoo allows you to do it with crossdev script:
$ crossdev -t m68k-unknown-linux-gnu
# Done!
$ m68k-unknown-linux-gnu-gcc --version
m68k-unknown-linux-gnu-gcc (Gentoo 5.3.0 p1.0, pie-0.6.5) 5.3.0
the assembly
The m68k ISA is very simple. Smaller example from bug report:
long f(long a) { return a; }
void * g(void * p) { return p; }
-unknown-linux-gnu-gcc -S -O2 -fomit-frame-pointer a.c
# m68kf:
.l 4(%sp),%d0
move
rtsg:
.l 4(%sp),%a0
move.l %a0,%d0
move rts
Real-world code example (just to get a feeling about patterns used by gcc):
-unknown-linux-gnu-objdump -d /usr/m68k-unknown-linux-gnu/usr/bin/locale
# m68k>:
80001754 <.text%fp,#-8
80001754: 4e56 fff8 linkw %d2-%d7/%a2-%a5,%sp@-
80001758: 48e7 3f3c moveml 800096b0 <stdout@@GLIBC_2.0+0x1c>
8000175c: 42b9 8000 96b0 clrl 800096ac <stdout@@GLIBC_2.0+0x18>
80001762: 42b9 8000 96ac clrl 80004258 <_IO_stdin_used@@Base+0x178>
80001768: 4879 8000 4258 pea %sp@-
8000176e: 42a7 clrl lea 8000163c <setlocale@plt>,%a2
80001770: 45f9 8000 163c %a2@
80001776: 4e92 jsr #8,%sp
80001778: 508f addql %a0
8000177a: 4a88 tstl 80001a20 <calloc@plt+0x2e0>
8000177c: 6700 02a2 beqw 80004258 <_IO_stdin_used@@Base+0x178>
80001780: 4879 8000 4258 pea 5 <strstr@plt-0x800011eb>
80001786: 4878 0005 pea %a2@
8000178a: 4e92 jsr #8,%sp
8000178c: 508f addql %a0
8000178e: 4a88 tstl 80001a5e <calloc@plt+0x31e>
80001790: 6700 02cc beqw 8000968c <_libc_intl_domainname@@GLIBC_2.0>
80001794: 4879 8000 968c pea 8000131c <textdomain@plt>
8000179a: 4eb9 8000 131c jsr %sp@-
800017a0: 42a7 clrl %fp@(-8)
800017a2: 486e fff8 pea %sp@-
800017a6: 42a7 clrl %fp@(12),%sp@-
800017a8: 2f2e 000c movel %fp@(8),%sp@- 800017ac: 2f2e 0008 movel
first attempt
“Should be easy to weed out all the bugs” was my thought :)
Having spent some time arguing with gcc devs about handling of incomplete prototypes in gcc bugzilla and on gcc mailing list it was time to start fixing GHC.
I was familiar with the GHC code responsible for C code generation and gave first shot without actually building GHC compiler for m68k: patch.
Adrian tried to test the patch and reported GHC still SIGSEGVed as before. It was not very clear if patch changed anything. Did it crash later than used to or did I break something subtle?
cross-ghc nano-howto
Next step was to try to actually build GHC as a crosscompiler.
It appears to be as simple as building cross-gcc:
ghc-m68k $ ./configure --target=m68k-unknown-linux-gnu --enable-unregisterised \
\
--enable-bfd-debug
That’s it! The first line is enough but it’s faster for subsequent rebuilds to use system dependencies and not bundled libs. Thus i’ve installed dependencies to the SYSROOT:
$ ARCH=m68k emerge-m68k-unknown-linux-gnu -1 \
\
sys-libs/ncurses \
dev-libs/gmp \
virtual/libffi sys-libs/binutils-libs
And i’ve got cross-ghc! Testing:
$ echo 'main = print 42' > /tmp/hi.hs
$ inplace/bin/ghc-stage1 --make /tmp/hi.hs
$ file /tmp/hi
/tmp/hi: ELF 32-bit MSB executable, Motorola m68k, 68020, version 1 (SYSV),
dynamically linked, interpreter /lib/ld.so.1, for GNU/Linux 2.6.32, not stripped
Simple!
qemu-user nano-howto
qemu allows you to run not only system images (qemu-system) but also standalone executables (qemu-user) for foreign architectures and even other kernels!
The simple usage example for powerpc64 (it’s slightly better supported by qemu) would be:
#include <stdio.h>
int main() {
("Hello world!\n");
printf }
$ powerpc64-unknown-linux-gnu-gcc a.c -o a
$ /usr/bin/qemu-ppc64 -L /usr/powerpc64-unknown-linux-gnu ./a
Hello world!
With help of binfmt_misc kernel module you can run foreign binaries seamlessly as native binaries. This allows running things like testsuites for crosscompilers as they would have produced binaries you can run on your machine.
Setup for powerpc64:
$ echo ':ppc64:M::\x7fELF\x02\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x15:\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/qemu-ppc64:' > /proc/sys/fs/binfmt_misc/register
$ cat /qemu-ppc64
#!/bin/bash
exec /usr/bin/qemu-ppc64 -L /usr/powerpc64-unknown-linux-gnu/ "$@"
$ file a
a: ELF 64-bit MSB executable, 64-bit PowerPC or cisco 7500, version 1 (SYSV),
dynamically linked, interpreter /lib64/ld64.so.1, for GNU/Linux 2.6.32, not stripped
$ ./a
Hello world!
Gentoo has a detailed wiki page on how to get qemu-user on your system.
m68k hoops
I did the same for m68k but found out qemu-m68k can’t even run simplest programs:
$ /usr/bin/qemu-m68k -L /usr/m68k-unknown-linux-gnu ./a
qemu: fatal: Illegal instruction: ebc0 @ f67e36a8
D0 = 6ffffef5 A0 = f67fb6bc F0 = 0000000000000000 ( 0)
D1 = 0000010a A1 = f67de000 F1 = 0000000000000000 ( 0)
D2 = 0000000f A2 = f6ffec04 F2 = 0000000000000000 ( 0)
D3 = 00000000 A3 = 00000000 F3 = 0000000000000000 ( 0)
D4 = 00000000 A4 = 00000000 F4 = 0000000000000000 ( 0)
D5 = 00000000 A5 = f67fb774 F5 = 0000000000000000 ( 0)
D6 = 00000000 A6 = f6ffee54 F6 = 0000000000000000 ( 0)
D7 = 00000000 A7 = f6ffec04 F7 = 0000000000000000 ( 0)
PC = f67e36a8 SR = 0000 ----- FPRESULT = 0
That means qemu does not support some instructions generated by gcc.
objdump -d can help us find out if it’s a sane opcode or gcc/ld generated garbage:
$ m68k-unknown-linux-gnu-gcc -static a.c -o a
$ m68k-unknown-linux-gnu-objdump -d ./a | grep ebc0
800180ec: ebc0 105f bfexts %d0,1,31,%d1
8001ebc0: 226e 0014 moveal %fp@(20),%a1
80037ab8: ebc0 105f bfexts %d0,1,31,%d1
8003ebc0: 6612 bnes 8003ebd4 <_dl_close_worker+0x96a>
80058168: 61ff ffff ebc0 bsrl 80056d2a <get_cie_encoding>
ebc0 looks like a normal bfexts instruction (Bit Field Extract Signed). But Debian guys who reported a bug somehow can run m68k chroot!
hoop #1: qemu fork
Debian has a nice wiki page about it Basically the trick to emulate m68k binaries is to use Laurent’s qemu git tree. It allowed me to run C hello-world.
hoop #2: minimum alignment
Time to try haskell hello-world:
$ echo 'main = print 42' > /tmp/hi.hs
$ inplace/bin/ghc-stage1 --make -debug /tmp/hi.hs
$ /tmp/hi
SIGSEGV
We are on the right track. qemu understands all the instructions it sees.
qemu-user is capable of generating nice gdb-readable core files.
$ gdb /tmp/hi qemu_hi*.core
Core was generated by `/tmp/hi +RTS -Ds -Di -Dw -DG -Dg -Db -DS -Dt -Dp -Da -Dl -Dm -Dz -Dc -Dr'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x80463b0a in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=32858) at includes/rts/storage/ClosureMacros.h:248
248 return (info->type != INVALID_OBJECT && info->type < N_CLOSURE_TYPES) ? rtsTrue : rtsFalse;
(gdb) bt
#0 0x80463b0a in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=32858) at includes/rts/storage/ClosureMacros.h:248
#1 0x80463b46 in LOOKS_LIKE_INFO_PTR (p=32858) at includes/rts/storage/ClosureMacros.h:253
#2 0x80463b6c in LOOKS_LIKE_CLOSURE_PTR (p=0x805aac6e <stg_dummy_ret_closure>) at includes/rts/storage/ClosureMacros.h:258
#3 0x80463e4c in initStorage () at rts/sm/Storage.c:121
#4 0x8043ffb4 in hs_init_ghc (argc=0xf6ffebf0, argv=0xf6ffebf4, rts_config=...) at rts/RtsStartup.c:181
#5 0x80455982 in hs_main (argc=1, argv=0xf6ffed54, main_closure=0x804b8f70 <ZCMain_main_closure>, rts_config=...)
at rts/RtsMain.c:51
#6 0x80003c1c in main ()
These +RTS -Ds -Di -Dw -DG -Dg -Db -DS -Dt -Dp -Da -Dl -Dm -Dz -Dc -Dr are all flags for debug runtime. Sanity checks of sorts.
This sanity failure basically says that it got tagged pointer where if should not be tagged. GHC uses least significant bits for closure pointers to distinct between evaluated and unevaluated closures. On 32-bit systems last 2 bits are used for tags assuming those will always be zero.
// includes/Cmm.h
#if SIZEOF_VOID_P == 4
#define W_ bits32
/* Maybe it's better to include MachDeps.h */
#define TAG_BITS 2
#elif SIZEOF_VOID_P == 8
#define W_ bits64
/* Maybe it's better to include MachDeps.h */
#define TAG_BITS 3
#else
#error Unknown word size
#endif
/*
* The RTS must sometimes UNTAG a pointer before dereferencing it.
* See the wiki page Commentary/Rts/HaskellExecution/PointerTagging
*/
#define TAG_MASK ((1 << TAG_BITS) - 1)
#define UNTAG(p) (p & ~TAG_MASK)
#define GETTAG(p) (p & TAG_MASK)
// includes/rts/storage/ClosureMacros.h
...
static inline StgClosure *
(StgClosure * p)
UNTAG_CLOSURE{
return (StgClosure*)((StgWord)p & ~TAG_MASK);
}
Back to our backtrace. stg_dummy_ret_closure is an int array (untagged) with currently evaluated closure. But it’s address is clearly not aligned by 4-byte boundary:
Prelude> 0x805aac6e `mod` 4
2
Most of architectures have sizeof(int) == __alignof__(int) alignment for various reasons (performance, lack of support for unaligned access in memory/cache unit).
#include <stdio.h>
int main() {
( "sizeof (int) = %u, __alignof__ (int) = %u\n"
printf , (unsigned) sizeof (int), (unsigned) __alignof__ (int));
return 0;
}
Here is the result for a few targets:
target | sizeof (int) | __alignof__ (int) |
---|---|---|
x86_64-pc-linux-gnu | 4 | 4 |
i686-pc-linux-gnu | 4 | 4 |
powerpc-unknown-linux-gnu | 4 | 4 |
powerpc64-unknown-linux-gnu | 4 | 4 |
sparc-unknown-linux-gnu | 4 | 4 |
sparc64-unknown-linux-gnu | 4 | 4 |
ia64-unknown-linux-gnu | 4 | 4 |
<you get the pattern> | … | … |
m68k-unknown-linux-gnu | 4 | 2(!) |
The fix for this is simple: we need to add __attribute__((aligned(sizeof(int)))) to all closure declarations in .data section. Thus the fix does roughly that.
Python also had suspiciously similar problem reported here.
hoop #3: qemu bug
I did not expect GHC to work at this time as I’ve not fixed int / void* prototypes mismatch yet. But hello-world haskell program was already working!
The next step is to try to run GHC testsuite and see what will break:
$ make fasttest TEST_HC=$(pwd)/inplace/bin/ghc-stage1 THREADS=12
testsuite/mk/*inplace_bin_ghc-stage1.mk requires some manual tweaks in autodetected variables for our crosscompiler. Updated variables:
WORDSIZE=32 # was 64
TARGETPLATFORM=m68k-unknown-linux # was x86_64-unknown-linux
TargetARCH_CPP=m68k # was x86_64
GhcWithInterpreter=NO # was YES
What was unclear is why qemu itself did throw assertion errors on many samples:
qemu-m68k/tcg/tcg.c:1774: tcg fatal error
The samples seemingly didn’t do anything fancy. It was time to dive into qemu!
I’ve picked one of small testcases that failed: mul2
-- testsuite/tests/numeric/should_run/mul2.hs
{-# LANGUAGE MagicHash, UnboxedTuples #-}
import GHC.Prim
import GHC.Word
import Data.Bits
main :: IO ()
= do f 5 6
main 0xFD94E3B7FE36FB18 49
f 0xFD94E3B7FE36FB18 0xFC1D8A3BFB29FC6A
f
f :: Word -> Word -> IO ()
@(W# x) wy@(W# y)
f wx= do putStrLn "-----"
putStrLn ("Doing " ++ show wx ++ " * " ++ show wy)
case x `timesWord2#` y of
# h, l #) ->
(do let wh = W# h
= W# l
wl = shiftL (fromIntegral wh) (bitSize wh)
r + fromIntegral wl
putStrLn ("High: " ++ show wh)
putStrLn ("Low: " ++ show wl)
putStrLn ("Result: " ++ show (r :: Integer))
Stepping aside a few words on how qemu works:
qemu is a tiny jit compiler which has the following passes:
- target native instructions->TCG IR: JIT decodes target’s instructions and translates them into intermediate representation (TCG).
- TCG IR->TCG IR: a few basic optimisations like dead store elimination
- TCG IR->host-native instructions: intermediate representation is then compiled to host’s instruction set
- execute host-native instructions: executed host code; return status explains why generated code finished
qemu does these steps not on instruction-by-instruction basis but on sequence-by-sequence basis. Simplictically qemu disassembles up to the first branching instruction.
Our qemu assertion happens in pass 3. somewhere here
To find out bad instruction sequence it’s enough to run qemu with -d in_asm flag. It dumps whatever is read by pass 1.
$ qemu-m68k-git -d in_asm -L /usr/m68k-unknown-linux-gnu/ /tmp/mul2
----------------
IN:
0xf550c812: moveq #32,%d5
0xf550c814: subl %d4,%d5
0xf550c816: movel %d6,%d0
0xf550c818: asll #2,%d0
0xf550c81a: addal %d0,%a0
0xf550c81c: addal %d0,%a1
0xf550c81e: movel %a0@-,%d2
0xf550c820: movel %d2,%d0
0xf550c822: lsrl %d5,%d0
0xf550c824: lsll %d4,%d2
0xf550c826: movel %d2,%d1
0xf550c828: subql #1,%d6
0xf550c82a: beqs 0xf550c856
#
qemu-m68k/tcg/tcg.c:1774: tcg fatal error
The offending instruction sequence was quite large but it was easy to trim it down to 2 instructions:
_start:
#2,%d0
asll %a0@,%d2
movel rts
I’ve just copied original disassembly dump, pasted it in a text file and built a tiny program from it (removing instructions one by one):
$ m68k-unknown-linux-gnu-gcc -nostdlib -nostartfiles m68k.S -o foo
$ qemu-m68k-git -d in_asm -L /usr/m68k-unknown-linux-gnu/ ./foo
# fails as:
# IN:
# 0x80000054: asll #2,%d0
# 0x80000056: movel %a0@,%d2
# 0x80000058: rts
# qemu-m68k/tcg/tcg.c:1774: tcg fatal error
We don’t need to get a working program. It’s enough to keep qemu crashing in TCG pass.
Having explored failure a bit more I’ve found out it’s enough to have single bad instruction to appear and crash qemu:
.S
# m68k_start:
#1,%d0
asll br _start
Using qemu’s -d op flag allows us to look at intermediate representation of TCG:
$ m68k-unknown-linux-gnu-gcc -nostdlib -nostartfiles m68k.S -o foo
$ qemu-m68k-git -d in_asm,op -L ./foo
# asll #1,%d0 muops:
---- 80000054 ffffffff
movi_i32 tmp0,$0x0 # tmp0 = 0
movi_i32 tmp1,$0x1f # tmp1 = 31
shr_i32 CC_C,D0,tmp1 # CC_C = D0 >> tmp1
movi_i32 tmp1,$0x1 # tmp1 = 1
shl_i32 CC_N,D0,tmp1 # CC_N = D0 << tmp1
mov_i32 CC_V,tmp0 # CC_V = tmp0
movi_i32 tmp1,$0x1f # tmp1 = 31
movi_i32 tmp3,$0x1e # tmp3 = 30
shr_i32 CC_V,D0,tmp3 # CC_V = D0 << tmp3
sar_i32 tmp2,D0,tmp2 # tmp2 = D0 >> tmp2, but tmp2 is used uninitilalized!
...
Our tiny instruction got translated to a pile of intermediate muops. An interesting fact: tmp2 (a temporary register) is used in sar_i32 without initialization.
It’s clearly qemu’s decoding pass 1. bug (could be optimizing pass 2. but I disabled it) which generates read from freshly introduced temp register.
// somewhere in target-m68k/translate.c
static inline void shift_im(DisasContext *s, uint16_t insn, int opsize) {
....
= tcg_temp_new();
TCGv t0 (QREG_CC_V, reg, bits - 1 - count);
tcg_gen_shri_i32(t0, reg, t0); // that's our broken shift!
tcg_gen_sar_i32(t0, t0);
tcg_gen_not_i32...
I’ve fixed that with the simple patch and reran testsuite again.
hoop #4: crosscompiling doubles is hard
Or not that hard :) A lot of floating-point related tests worked without crashes but resulted in garbage outputs.
The broken test was to print a Float (32bit ‘float’ type in C-speak):
v :: Float
= 43
v = print v -- prints "0.0" main
This code printed “0.0” when was built with ghc-stage1. At this point I’ve looked up if m68k was a big-endian platform by chance. And it was!
GHC does weird things to encode double values in generated code. Basically it uses platform’s pointer-sized integrals to encode everything. GHC calls the type StgWord:
// somewhere in includes/stg/Types.h
#if SIZEOF_INT == 4
typedef unsigned int StgWord32;
#elif SIZEOF_LONG == 4
typedef unsigned long StgWord32;
#else
#error GHC untested on this architecture: sizeof(int) != 4
#endif
...
#if SIZEOF_VOID_P == 8
typedef StgWord64 StgWord;
#else
#if SIZEOF_VOID_P == 4
typedef StgWord32 StgWord;
#else
#error GHC untested on this architecture: sizeof(void *) != 4 or 8
#endif
#endif
Thus in the end of the day GHC encodes ‘double’ and ‘float’ values as StgWord closure[] = { (StgWord)&useful_callback, 0x00000000, 0x40458000, … }.
I’s an interesting question how 43.0 should look like:
int main() {
double d = 43.0;
float f = 43.0;
int * i2 = &d; printf ("i[2]: %08X %08X\n", i2[0], i2[1]);
long long * l = &d; printf (" l: %016llX\n", l[0]);
int * i = &f; printf (" i: %08X\n", i[0]);
}
# 64bit-LE
$ x86_64-pc-linux-gnu-gcc a.c -o a && ./a
i[2]: 00000000 40458000
l: 4045800000000000
i: 422C0000
# 32bit-LE
$ i686-pc-linux-gnu-clang a.c -o a && ./a
i[2]: 00000000 40458000
l: 4045800000000000
i: 422C0000
# 32bit-BE
$ m68k-unknown-linux-gnu-gcc a.c -o a && ./a
i[2]: 40458000 00000000
l: 4045800000000000
i: 422C0000
# 64bit-BE
$ powerpc64-unknown-linux-gnu-gcc a.c -o a && ./a
i[2]: 40458000 00000000
l: 4045800000000000
i: 422C0000
# 64bit-BE
$ sparc64-unknown-linux-gnu-gcc a.c -o a && ./a
i[2]: 40458000 00000000
l: 4045800000000000
i: 422C0000
The only thing that changes here is the order of elements of i[2] array. Just as like you would see if you stored 64-bit integral value.
The problem in original code was the assumption of target having the same endianness as a host. It took me a while to fix it cleanly in GHC: the result.
hoop #5: Cmm call annotations
Even more tests started working better but some SIGSEGVs still remained. I’ve looked at one example and noticed stack squeezing code consistently crashed garbage collector right after threadStackUnderflow() function call.
It was another case of int versus void* return type in m68k. One of the intermediate representations (or just pieces of low-level code written to deal with closures) is **Cmm** (c-minus-minus)
Suppose we need to call external C function from haskell. Typical cmm code would look like that:
(bits32 a) {
foo;
bits32 r(r) = foreign "C" cfun(a);
return (r);
}
Generated C code for it looks like that:
// $ inplace/bin/ghc-stage1 -c -keep-tmp-files a.cmm
// $ cat /tmp/ghc22508_0/ghc_3.hc
/* GHC_PACKAGES
*/
#include "Stg.h"
(cfun);
EFF_(foo) {
FN_;
StgWord32 _c1;
StgWord32 _c2;
W_ _c3;
StgWord32 _c4:
_c5= R1.w;
_c1 = (W_)&cfun;
_c3 = _c1;
_c4 = ((StgWord32 (*)(StgWord32))_c3)(_c4);;
_c2 .w = _c2;
R1(*((P_)(*Sp)));
JMP_}
Let’s tweak original example and add a hint to result type:
@@ -1,6 +1,6 @@
foo(bits32 a) {
bits32 r;- (r) = foreign "C" cfun(a);
+ ("ptr" r) = foreign "C" cfun(a);
return (r); }
@@ -1,18 +1,18 @@
/* GHC_PACKAGES
*/
#include "Stg.h"
....
EFF_(cfun);
FN_(foo) {
StgWord32 _c1;
StgWord32 _c2;
W_ _c3;
StgWord32 _c4;
_c5:
_c1 = R1.w;
_c3 = (W_)&cfun;
_c4 = _c1;- _c2 = ((StgWord32 (*)(StgWord32))_c3)(_c4);;
+ _c2 = (StgWord32)((void * (*)(StgWord32))_c3)(_c4);;
R1.w = _c2;
JMP_(*((P_)(*Sp))); }
On most 32-bit platforms this change does nothing. But on m68k it’s precisely what flips between %d0 and %a0 return registers.
The fix for this issue looks like that. I tried to eyeball most of calls’ return values but could have easily missed something.
At this point i’ve got GHCi working:
$ inplace/bin/ghc-stage2 --interactive
GHCi, version 8.1.20160305: http://www.haskell.org/ghc/ :? for help
Loaded GHCi configuration from /home/slyfox/.ghci
The result
Not all tests are passing yet. But the result is good enough for people to try out GHC and fix the bugs themselves. I’ve triaged some of these failures but did not bother fixing them yet.
Fun facts (as usual):
- building GHC as a crosscompiler is simple!
- m68k is a 32-bit big-endian architecture
- m68k has alignment 2 (less than natural alignment)
- m68k is a rare system which uses different registers for integral and pointer types
- qemu (and it’s TCG) is a fun tiny project to play with :)
- binfmt_misc linux kernel feature can run arbitrary binaries as executables on your system
- doubles are not as scary as I thought :)
- as a result GHC is friendlier for BE platforms both as a host and target for crosscompilation :)
Thanks!