page fault handling on alpha
Bug
This was a quiet evening on #gentoo-alpha
. Matt Turner shared an
unusual kernel crash report seen by
Dmitry V. Levin.
Dmitry noticed that one of AlphaServer ES40
machines does not handle
strace
test suite and generates kernel crashes:
Unable to handle kernel paging request at virtual address ffffffffffff9468
CPU 3
aio(26027): Oops 0
pc = [<fffffc00004eddf8>] ra = [<fffffc00004edd5c>] ps = 0000 Not tainted
pc is at sys_io_submit+0x108/0x200
ra is at sys_io_submit+0x6c/0x200
v0 = fffffc00c58e6300 t0 = fffffffffffffff2 t1 = 000002000025e000
t2 = fffffc01f159fef8 t3 = fffffc0001009640 t4 = fffffc0000e0f6e0
t5 = 0000020001002e9e t6 = 4c41564e49452031 t7 = fffffc01f159c000
s0 = 0000000000000002 s1 = 000002000025e000 s2 = 0000000000000000
s3 = 0000000000000000 s4 = 0000000000000000 s5 = fffffffffffffff2
s6 = fffffc00c58e6300
a0 = fffffc00c58e6300 a1 = 0000000000000000 a2 = 000002000025e000
a3 = 00000200001ac260 a4 = 00000200001ac1e8 a5 = 0000000000000001
t8 = 0000000000000008 t9 = 000000011f8bce30 t10= 00000200001ac440
t11= 0000000000000000 pv = fffffc00006fd320 at = 0000000000000000
gp = 0000000000000000 sp = 00000000265fd174
Disabling lock debugging due to kernel taint
Trace:
[<fffffc0000311404>] entSys+0xa4/0xc0
Oopses should never happen against userland workloads.
Here crash happened right in the io_submit()
syscall. “Should be a
very simple arch-specific bug. Can’t take much time to fix.” was my
thought. Haha.
Reproducer
Dmitry provided very nice reproducer of the problem (extracted from
strace
test suite):
// $ cat aio.c
#include <err.h>
#include <unistd.h>
#include <sys/mman.h>
#include <asm/unistd.h>
int main(void)
{
unsigned long ctx = 0;
if (syscall(__NR_io_setup, 1, &ctx))
(1, "io_setup");
errconst size_t page_size = sysconf(_SC_PAGESIZE);
const size_t size = page_size * 2;
void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
| MAP_ANONYMOUS, -1, 0);
MAP_PRIVATE if (MAP_FAILED == ptr)
(1, "mmap(%zu)", size);
errif (munmap(ptr, size))
(1, "munmap");
err(__NR_io_submit, ctx, 1, ptr + page_size);
syscall(__NR_io_destroy, ctx);
syscallreturn 0;
}
The idea of this test is simple: create valid context for asynchronous
IO
and pass invalid pointer ptr
to it. mmap()
/munmap()
trick makes sure the ptr
is pointing at an invalid non-NULL
user
memory location.
To reproduce and explore the bug locally I picked qemu-alpha
system
emulation. To avoid complexity of searching for proper IDE
driver for
root filesystem I built minimal linux
kernel with only initramfs
support without filesystem or block device support.
Then I put statically linked reproducer and busybox
into initramfs
:
$ LANG=C tree root/
root/
|-- aio (statically linked aio.c)
|-- aio.c (source above)
|-- bin
| |-- busybox (statically linked busybox)
| `-- sh -> busybox
|-- dev (empty dir)
|-- init (script below)
|-- proc (empty dir)
`-- sys (empty dir)
4 directories, 5 files
$ cat root/init
#!/bin/sh
mount -t proc none /proc
mount -t sysfs none /sys
exec bin/sh
To run qemu
system emulation against the above I used the following
one-liner:
#!/bin/sh
# run-qemu.sh
alpha-unknown-linux-gnu-gcc root/aio.c -o root/aio -static
( cd root && find . -print0 | cpio --null -ov --format=newc; ) | gzip -9 > initramfs.cpio.gz
qemu-system-alpha -kernel vmlinux -initrd initramfs.cpio.gz -m 1G "$@"
run-qemu.sh
builds initramfs
image and runs kernel against it.
Cross-compiling vmlinux
on alpha
* is also straightforward:
#!/bin/sh
# mk.sh
ARCH=alpha \
CROSS_COMPILE=alpha-unknown-linux-gnu- \
make -C ../linux.git \
"$(pwd)" \
O=\
"$@"
I built kernel and started a VM as:
# build kernel
$ ./mk.sh -j$(nproc)
# run kernel
$ ./run-qemu.sh -curses
...
[ 0.650390] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input0
/ #
That was simple. I got the prompt! Then I ran statically linked /aio
reproducer as:
/ # /aio
Unable to handle kernel paging request at virtual address 0000000000000000
aio(26027): Oops -1
...
Woohoo! Crashed \o/ This allowed me to explore failure in more detail.
I used -curses
(instead of default -sdl
) to ease copying of text
back from VM.
Fault address pattern was slightly different from the original report. I
hoped it’s a manifestation of the same bug. Worst case I would find
another bug to fix and get back to original one again :)
Into the rabbit hole
Oops was happening every time I ran /aio
on 4.20
kernel.
io_submit(2)
man page claims it’s an old system call from 2.5
kernel era. Thus it should not be a recent addition.
How about older kernels? Did they also fail?
I was still not sure I had correct qemu
/kernel setup. I decided to pick
older 4.14
kernel version known to run without major problems on our
alpha
box. 4.14
kernel version did not crash in qemu
either.
This reassured me that I have not completely broken setup.
I got first suspect: kernel regression.
Reproducer was very stable. Kernel bisection got me to first regressed
commit:
commit 95af8496ac48263badf5b8dde5e06ef35aaace2b
Author: Al Viro <viro@zeniv.linux.org.uk>
Date: Sat May 26 19:43:16 2018 -0400
aio: shift copyin of iocb into io_submit_one()
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
:040000 040000 20dd44ac4706540b1c1d4085e4269bd8590f4e80 05d477161223e5062f2f781b462e0222c733fe3d M fs
The commit clearly touched io_submit()
syscall handling. But there
is a problem: the change was not alpha-specific at all. If commit had
any problems it also should have caused problems on other systems.
To get better understanding of probable cause I decided to look at
failure mechanics. Actual values of local variables in io_submit()
right before crash might get me somewhere. I started adding printk()
statements around SYSCALL_DEFINE3(io_submit, ...)
implementation.
At some point after enough printk()
calls added crashes disappeared.
This confirmed it was not just a logical bug but something more subtle.
I also was not able to analyze the generated code difference between
printk()
/no-printk()
versions.
Then I attempted to isolate faulty code into a separate function but not
much success here either. Any attempt to factor out a subset of
io_submit()
into a separate function made bug to go away.
It was time for a next hypothesis: mysterious incorrect compiler code
generation or invalid __asm__
constraints for some kernel macro
exposed after minor code motion.
Single stepping through kernel
How to get an insight into the details without affecting original code
too much?
Having failed at minimal code snippet I attempted to catch exact place
of page fault by single-stepping through kernel using gdb
.
For qemu
-loadable kernels the procedure very straightforward:
- start
gdb
server onqemu
side with-s
option - start
gdb
client on host side withtarget remote localhost:1234
The same procedure in exact commands (I’m hooking into
sys_io_submit()
):
<at tty1>
$ ./run-qemu.sh -s
<at tty2>
$ gdb --quiet vmlinux
(gdb) target remote localhost:1234
Remote debugging using localhost:1234
0xfffffc0000000180 in ?? ()
(gdb) break sys_io_submit
Breakpoint 1 at 0xfffffc000117f890: file ../linux-2.6/fs/aio.c, line 1890.
(gdb) continue
Continuing.
<at qemu>
# /aio
<at tty2 again>
Breakpoint 1, 0xfffffc000117f89c in sys_io_submit ()
(gdb) bt
Breakpoint 1, __se_sys_io_submit (ctx_id=2199023255552, nr=1, iocbpp=2199023271936) at ../linux-2.6/fs/aio.c:1890
1890 SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr,
(gdb) bt
#0 __se_sys_io_submit (ctx_id=2199023255552, nr=1, iocbpp=2199023271936) at ../linux-2.6/fs/aio.c:1890
#1 0xfffffc0001011254 in entSys () at ../linux-2.6/arch/alpha/kernel/entry.S:476
Now we can single-step through every instruction with nexti
and
check where things go wrong.
To poke around efficiently I kept looking at these cheat sheets:
alpha
register names and meaning (1 page)alpha
instruction names and meaning (3 pages)
Register names are especially useful as each alpha
register has two
names: numeric and mnemonic. Source code might use one form and gdb
disassembly might use another. For example $16
/a0
for
gas
($r16
/$a0
for gdb
) is a register to pass first
integer argument to function.
After many backs and forths I found the suspicious behavior when
handling single instruction:
(gdb) disassemble
=> 0xfffffc000117f968 <+216>: ldq a1,0(t1)
0xfffffc000117f96c <+220>: bne t0,0xfffffc000117f9c0 <__se_sys_io_submit+304>
(gdb) p $gp
$1 = (void *) 0xfffffc0001c70908 # GOT
(gdb) p $a1
$2 = 0
(gdb) p $t0
$3 = 0
(gdb) nexti
0xfffffc000117f968 <+216>: ldq a1,0(t1)
=> 0xfffffc000117f96c <+220>: bne t0,0xfffffc000117f9c0 <__se_sys_io_submit+304>
(gdb) p $gp
$4 = (void *) 0x0
(gdb) p $a1
$5 = 0
(gdb) p $t0
$6 = -14 # -EFAULT
The above gdb
session executes single ldq a1,0(t1)
instruction
and observes effect on the registers gp
, a1
, t0
.
Normally ldq a1, 0(t1)
would read 64-bit value pointed by t1
into a1
register and leave t0
and gp
untouched.
The main effect seen here that causes later OOps is sudden gp
change. gp
is supposed to point to GOT
(global offset table)
table in current “program” (kernel in this case). Something managed to
corrupt it.
By /aio
test case construction instruction ldq a1,0(t1)
is not
supposed to read any valid data: our test case passes invalid memory
location there. All the register changing effects are the result of page
fault handling.
The smoking gun
Grepping around arch/alpha
directory I noticed entMM
page fault
handling
entry.
It claims to handle page faults and keeps gp
value on stack. Let’s
trace the fate of that on-stack value as page fault happens:
(gdb) disassemble
=> 0xfffffc000117f968 <+216>: ldq a1,0(t1)
0xfffffc000117f96c <+220>: bne t0,0xfffffc000117f9c0 <__se_sys_io_submit+304>
(gdb) p $gp
$1 = (void *) 0xfffffc0001c70908 # GOT
(gdb) break entMM
Breakpoint 2 at 0xfffffc0001010e10: file ../linux-2.6/arch/alpha/kernel/entry.S, line 200
(gdb) continue
Breakpoint 2, entMM () at ../linux-2.6/arch/alpha/kernel/entry.S:200
(gdb) x/8a $sp
0xfffffc003f51be78: 0x0 0xfffffc000117f968 <__se_sys_io_submit+216>
0xfffffc003f51be88: 0xfffffc0001c70908 <# GOT> 0xfffffc003f4f2040
0xfffffc003f51be98: 0x0 0x20000004000 <# userland address>
0xfffffc003f51bea8: 0xfffffc0001011254 <entSys+164> 0x120001090
(gdb) watch -l *0xfffffc003f51be88
Hardware watchpoint 3: -location *0xfffffc003f51be88
(gdb) continue
Old value = 29821192
New value = 0
0xfffffc00010319d0 in do_page_fault (address=2199023271936, mmcsr=<optimized out>, cause=0, regs=0xfffffc003f51bdc0)
at ../linux-2.6/arch/alpha/mm/fault.c:199
199 newpc = fixup_exception(dpf_reg, fixup, regs->pc);
Above gdb
session does the following:
break entMM
: break at page faultx/8a $sp
: print8
top stack values atentMM
call time- spot
gp
value at0xfffffc003f51be88
(sp+16
) address watch -l *0xfffffc003f51be88
: set hardware watch point at a memory location wheregp
is stored.
Watch triggers at seemingly relevant place:
fixup_exception()
where exception handler adjusts registers before resuming the faulted
task.
Looking around I found an off-by-two bug in page fault handling code.
The fix was simple:
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
-80,2 +80,2 @@ __load_new_mm_context(struct mm_struct *next_mm)
@@ (((unsigned long *)regs)[(r) <= 8 ? (r) : (r) <= 15 ? (r)-16 : \
- (r) <= 18 ? (r)+8 : (r)-10])
+ (r) <= 18 ? (r)+10 : (r)-10])
Patch is proposed upstream as https://lkml.org/lkml/2018/12/31/83.
Effect of the patch is to write 0
into on-stack location of a1
($17
register) instead of location of gp
.
That’s it!
Page fault handling magic
I always wondered how does kernel read data from user space when it’s needed. How does it do swap-in if data is not available? How does it check for permission privilege access? That kind of stuff. The above investigation covers most of involved components:
ldq
instruction is used to force the read from user space (as one would read from kernel’s memory)entMM
/do_page_fault()
handles the user space fault as if fault would not happen
The few minor missing details are:
- How does kernel know which instructions are known to generate user page faults?
- What piece of hardware holds a pointer to page fault handler on
alpha
?
Let’s expand the code involved in page fault handling. Call site:
(io_submit, aio_context_t, ctx_id, long, nr,
SYSCALL_DEFINE3struct iocb __user * __user *, iocbpp)
{
// ...
struct iocb __user *user_iocb;
if (unlikely(get_user(user_iocb, iocbpp + i))) {
// ...
which is translated to already familiar pair of instructions:
=> 0xfffffc000117f968 <+216>: ldq a1,0(t1)
0xfffffc000117f96c <+220>: bne t0,0xfffffc000117f9c0 <__se_sys_io_submit+304>
Fun fact: get_user()
has two return values: normal function return
value (stored into t0
register) and user_iocb
effect (stored
into a1
register).
Let’s expand get_user()
implementation
on alpha
:
// somewhere at arch/alpha/include/asm/uaccess.h:
#define get_user(x, ptr) \
__get_user_check((x), (ptr), sizeof(*(ptr)))
#define __get_user_check(x, ptr, size) \
({ \
long __gu_err = -EFAULT; \
unsigned long __gu_val = 0; \
const __typeof__(*(ptr)) __user *__gu_addr = (ptr); \
if (__access_ok((unsigned long)__gu_addr, size)) { \
__gu_err = 0; \
switch (size) { \
case 1: __get_user_8(__gu_addr); break; \
case 2: __get_user_16(__gu_addr); break; \
case 4: __get_user_32(__gu_addr); break; \
case 8: __get_user_64(__gu_addr); break; \
default: __get_user_unknown(); break; \
} \
} \
(x) = (__force __typeof__(*(ptr))) __gu_val; \
__gu_err; \
})
A lot of simple code above does two things:
- use
__access_ok()
to check for address to be a user space address to prevent data exfiltration from kernel. - dispatch across different supported sizes to do the rest of work. Our case is a simple 64-bit read.
Looking at __get_user_64()
in more detail:
struct __large_struct { unsigned long buf[100]; };
#define __m(x) (*(struct __large_struct __user *)(x))
#define __get_user_64(addr) \
__asm__("1: ldq %0,%2\n" \
"2:\n" \
EXC(1b,2b,%0,%1) \
: "=r"(__gu_val), "=r"(__gu_err) \
: "m"(__m(addr)), "1"(__gu_err))
#define EXC(label,cont,res,err) \
".section __ex_table,\"a\"\n" \
".long "#label"-.\n" \
"lda "#res","#cont"-"#label"("#err")\n" \
".previous\n"
A few observations:
- The actual check for address validity is done by CPU: load-8-bytes
instruction (
ldq %0,%2
) is executed and MMU handles a page fault - There is no explicit code to recover from the exception. All auxiliary
information it put into
__ex_table
section. ldq %0,%2
instruction uses only parameters"0"
(__gu_val
) and"2"
(addr
) but does not use"1"
(__gu_err
) parameter directly.__ex_table
uses coollda
instruction hack to encode auxiliary data:__gu_err
error register- pointer to next instruction after faulty instruction:
cont-label
(or2b-1b
) - result register
Page fault handling mechanism knows how to get to __ex_table
data
where "1"
(__gu_err
) is encoded and is able to reach that data to
use it later in mysterious fixup_exception()
we saw before.
In case of alpha
(and many other targets) __ex_table
collection is defined by arch/alpha/kernel/vmlinux.lds.S
linker
script using EXCEPTION_TABLE()
macro:
#define EXCEPTION_TABLE(align) \
. = ALIGN(align); \
__ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { \
__start___ex_table = .; \
KEEP(*(__ex_table)) \
__stop___ex_table = .; \
}
//...
Here all __ex_table
sections are gathered between
__start___ex_table
and __stop___ex_table
symbols.
Those are handled by generic
kernel/extable.c
code:
const struct exception_table_entry *search_exception_tables(unsigned long addr);
search_exception_tables()
resolves fault address to relevant struct exception_table_entry
.
Let’s look at the definition of struct exception_table_entry
:
/* Once again part of __get_user_64() responsible for __ex_table:
#define EXC(label,cont,res,err) \
".section __ex_table,\"a\"\n" \
".long "#label"-.\n" \
"lda "#res","#cont"-"#label"("#err")\n" \
".previous\n"
*/
struct exception_table_entry
{
signed int insn; /* .long #label-. */
union exception_fixup {
unsigned unit; /* lda #res,#cont-#label(#err) */
struct {
signed int nextinsn : 16; /* #cont-#label part */
unsigned int errreg : 5; /* #err part */
unsigned int valreg : 5; /* #res part */
} bits;
} fixup;
};
/* Returns the new pc */
#define fixup_exception(map_reg, _fixup, pc) \
({ \
if ((_fixup)->fixup.bits.valreg != 31) \
map_reg((_fixup)->fixup.bits.valreg) = 0; \
if ((_fixup)->fixup.bits.errreg != 31) \
map_reg((_fixup)->fixup.bits.errreg) = -EFAULT; \
(pc) + (_fixup)->fixup.bits.nextinsn; \
})
Note how lda
in-memory instruction format is used to encode all
details needed by fixup_exception()
! In case of our
sys_io_submit()
case it would be lda a1, 4(t0)
(lda r17, 4(r1)
):
(gdb) bt
#0 0xfffffc00010319d0 in do_page_fault (address=2199023271936, mmcsr=<optimized out>, cause=0,
regs=0xfffffc003f51bdc0) at ../linux-2.6/arch/alpha/mm/fault.c:199
#1 0xfffffc0001010eac in entMM () at ../linux-2.6/arch/alpha/kernel/entry.S:222
(gdb) p *fixup
$4 = {insn = -2584576, fixup = {unit = 572588036, bits = {nextinsn = 4, errreg = 1, valreg = 17}}}
Note how the page fault handling also advances pc
(program counter or
instruction pointer) nextinsn=4
bytes forward to skip failed ldq
instruction.
arch/alpha/mm/fault.c
does all the heavy-lifting of handling page faults. Here is a small
snippet that handles our case of faults covered by exception handling:
void
asmlinkage (unsigned long address, unsigned long mmcsr,
do_page_faultlong cause, struct pt_regs *regs)
{
// ...
:
no_context/* Are we prepared to handle this fault as an exception? */
if ((fixup = search_exception_tables(regs->pc)) != 0) {
unsigned long newpc;
= fixup_exception(dpf_reg, fixup, regs->pc);
newpc ->pc = newpc;
regsreturn;
}
// ...
}
/* ...
* Registers $9 through $15 are saved in a block just prior to `regs' and
* are saved and restored around the call to allow exception code to
* modify them.
*/
/* Macro for exception fixup code to access integer registers. */
#define dpf_reg(r) \
(((unsigned long *)regs)[(r) <= 8 ? (r) : (r) <= 15 ? (r)-16 : \
(r) <= 18 ? (r)+8 : (r)-10])
do_page_fault()
also does a few other page-fault related things I
carefully skipped here:
- page fault accounting
- handling of missing support for
"prefetch"
instruction - stack growth
- OOM handling
SIGSEGV
,SIGBUS
propagation
Once do_page_fault()
gets control it updates regs
struct in
memory for faulted task using dpf_reg()
macro. It looks unusual:
- refers to negative offsets sometimes:
(r) \<= 15 ? (r)-16
(out ofstruct pt_regs
) - defines not one but a few ranges of registers:
0-8
,9-15
,16-18
,19-...
struct pt_regs
as is:
struct pt_regs {
unsigned long r0; // 0
unsigned long r1;
unsigned long r2;
unsigned long r3;
unsigned long r4;
unsigned long r5; // 5
unsigned long r6;
unsigned long r7;
unsigned long r8;
unsigned long r19;
unsigned long r20; // 10
unsigned long r21;
unsigned long r22;
unsigned long r23;
unsigned long r24;
unsigned long r25; // 15
unsigned long r26;
unsigned long r27;
unsigned long r28;
unsigned long hae;
/* JRP - These are the values provided to a0-a2 by PALcode */
unsigned long trap_a0; // 20
unsigned long trap_a1;
unsigned long trap_a2;
/* These are saved by PAL-code: */
unsigned long ps;
unsigned long pc;
unsigned long gp; // 25
unsigned long r16;
unsigned long r17;
unsigned long r18;
};
Now meaning of dpf_reg()
should be more clear. As pt_regs
keeps
only a subset of registers is has to account for gaps and offsets.
Here I noticed the bug: r16-r18
range is handled incorrectly by
dpf_reg()
: r16
“address” is regs+10
(26-16
), not
regs+8
.
The implementation also means that dpf_reg()
can’t handle
gp
(r29
) and sp
(r30
) registers as value registers. That
should not normally be a problem as gcc
never assigns those
registers for temporary computations and keeps them to hold GOT
pointer and stack pointer at all times. But one could write assembly
code to do it :)
If all the above makes no sense to you it’s ok. Check kernel
documentation for x86
exception
handling
instead which uses very similar technique.
To be able to handle all registers we need to bring in r9-r15
. Those
are written right before struct pt_regs
right at
entMM
entry:
CFI_START_OSF_FRAME entMM
SAVE_ALL$9 - $15 so the inline exception code can manipulate them. */
/* save $sp, 56, $sp
subq $9, 0($sp) // push r9
stq $10, 8($sp)
stq $11, 16($sp)
stq $12, 24($sp)
stq $13, 32($sp)
stq $14, 40($sp)
stq $15, 48($sp) // push r15
stq $sp, 56, $19
addq */
/* handle the fault $8, 0x3fff
lda $sp, $8, $8
bic $26, do_page_fault
jsr . */
/* reload the registers after the exception code played$9, 0($sp) // pop r9
ldq $10, 8($sp)
ldq $11, 16($sp)
ldq $12, 24($sp)
ldq $13, 32($sp)
ldq $14, 40($sp)
ldq $15, 48($sp) // pop r15
ldq $sp, 56, $sp
addq . */
/* finish up the syscall as normal
br ret_from_sys_call CFI_END_OSF_FRAME entMM
A few subtle things going on here:
- at entry
entMM
already has a frame of last 6 values:ps,pc,gp,r16-r18
. - then
SAVE_ALL
(not pasted above) storesr0-r8,r19-r28,hae,trap_a0-trap-a2
- and only then
r9-r15
are stored (note thesubq $sp, 56, $sp
to place them before).
In c
land only 2.
and 3.
constitute struct pt_regs
.
1.
happens to be outside and needs negative addressing we saw in
dpf_reg()
.
As I understand the original idea was to share ret_from_sys_call
part across various kernel entry points:
- system calls:
entSys
- arithmetic exceptions:
entArith
- external interrupts:
entInt
- internal faults (bad opcode, FPU failures, breakpoint traps):
entIF
- page faults:
entMM
- handling of unaligned access:
entUna
MILO
debug break:entDbg
Of the above only page faults and unaligned faults need read/write
access to every register.
In practice entUna
uses different layout and simpler code patching.
The last step to get entMM
executed at a fault handler is to
register it in alpha
PALcode
subsystem (Privileged Architecture
Library code).
It’s done in
trap_init()
along with other handlers. Simple!
Or not so simple. What is that PALcode
thing (wiki’s
link)? It looks like a tiny
hypervisor that provides service points for CPU you can access with
call_pal <number>
instruction.
It puzzled me a lot of what call_pal
was supposed to do. Should it
transfer control somewhere else or is it a normal call?
Actually given it’s a generic mechanism to do “privileged service
calls” it can do both. I was not able to quickly find the details on
how different service calls affect registers and found it simplest to
navigate through qemu
s PAL
source.
AFAIU PALcode
of real alpha machine is a proprietary
process-specific blob that could have it’s own quirks.
Back to out qemu-palcode
let’s looks at a few examples.
First is function-like call_pal PAL_swpipl
used in entMM
and
others:
CallPal_SwpIpl:
, qemu_ps
mfpr v0and a0, PS_M_IPL, a0
and v0, PS_M_IPL, v0
, qemu_ps
mtpr a0
hw_rei ENDFN CallPal_SwpIpl
I know almost nothing about PAL
but I suspect mfpr
means
move-from-physical-register. hw_rei/hw_ret
is a branch from PAL
service routine back to “unprivileged” user/kernel.
hw_rei
does normal return from call_pal
to the instruction next
to call_pal
.
Here call_pal PAL_rti
is an example of task-switch-like routine:
CallPal_Rti:
, qemu_exc_addr // Save exc_addr for machine check
mfpr p6
, FRM_Q_PS($sp) // Get the PS
ldq p4, FRM_Q_PC($sp) // Get the return PC
ldq p5$gp, FRM_Q_GP($sp) // Get gp
ldq , FRM_Q_A0($sp) // Get a0
ldq a0, FRM_Q_A1($sp) // Get a1
ldq a1, FRM_Q_A2($sp) // Get a2
ldq a2$sp, FRM_K_SIZE($sp) // Pop the stack
lda
, 3, p5 // Clean return PC<1:0>
andnot p5
and p4, PS_M_CM, p3
, CallPal_Rti_ToUser
bne p3
and p4, PS_M_IPL, p4
, qemu_ps
mtpr p4(p5)
hw_ret ENDFN CallPal_Rti
Here target (p5
, some service only hardware register) was passed on
stack in FRM_Q_PC($sp)
.
That `PAL_rti managed to confused me a lot as I was trying to
single-step through it as a normal function. I did not notice how I was
jumping from page fault handling code to timer interrupt handling code.
But all became clear once I found it’s definition.
Parting words
qemu
can emulatealpha
good enough to debug obscure kernel bugsgdb
server is very powerful for debugging unmodified kernel code including hardware watch points, dumping registers, watching after interrupt handling routines- My initial guesses were all incorrect: it was not a kernel regression,
not a compiler deficiency and not an
__asm__
constraint annotation bug. PALcode
while a nice way to abstract low-level details of CPU implementation complicates debugging of operating system.PALcode
also happens to beOS
-dependent!- This was another one-liner fix :)
- The bug has been always present in kernel (for about 20 years?).
Have fun!