ia64 ptrace: off-by-one yak
gentoo
has one alive ia64
machine: HP rx3600
nicknamed
guppy
. It’s an old box, and yet it’s specs are not too different
from my 10 years old desktop: 32 GB RAM, 2 cores x 2 hyperthreads each.
Thus it’s suitable for on-host debugging of all sorts up to building
and running gcc
test suites or remote X client apps like pre-rust
version of firefox
.
kernel upgrade
There were a few major upstream kernel changes that made me a bit
nervous to update kernel on that machine from 4.9
:
cciss
SCSI driver was removed in favor ofhpsa
one (which among other things changed/dev/
paths for device)- large scale memory management cleanups broke kernel build on
ia64
for some memory models and that slipped to release tarball prompting questions ofia64
support status.
Fast forward a few years and we were running 4.9
kernel in 2021.
Whoops. No fancy new statx()
syscalls or modern container whistles.
Also John Paul Adrian started asking question if gentoo
were seeing
NUMA
-related boot failures.
I decided to upgrade a kernel to one of
recent releases.
First I attempted to upgrade straight to then most recent 5.10
(it
took me a while to write about it). Built 5.10
and booted it without
any problems (grub
++ for ease of messing with root=
over BMC
). I
rebuilt a few heavyweight packages to check basic ability to roll back if
needed. No problems! Phew! End of the story!
Or not.
the quest of strace
A few days after kernel upgrade Dmitry (strace
upstream maintainer)
reported a bug uncovered by strace
test suite:
https://bugs.gentoo.org/769614. Many tests started failing for invalid
syscall parameters as seen by ptrace()
.
Dmitry noticed that sometimes syscall parameters were off-by-one:
- expected:
sys_foo(a1,a2,a3,...)
- observed:
sys_foo(__NR_foo,a1,a2,a3,...)
Looks like a trivial bug, right?
Quiz: pause here for a minute and try to guess the nature of this bug.
It’s ok (and encouraged!) not to have any prior knowledge of kernels,
ptrace()
facility or ia64
architecture.
What kind of kernel change could trigger failures and what was the real cause of the observed discrepancy? A few distractors to chose for you:
- mechanical coding bug in recent kernel (like using wrong register during unwind)
- subtle failed assumption like stack handling direction or syscall argument passing
- a dormant
strace
bug - CPU hardware failure
- something else
the clues
At first I thought it to be a simpler form of alpha’s off-by-one
bug where we encoded
incorrect offsets to struct pt_regs
.
The fun and scary thing about ia64
compared to many other arches is
that it’s use of pt_regs
is a lot more
involved:
register access requires kernel stack unwindind up to the user space
boundary. But complex unwinding also happens to make it more uniform
compared to other arches: most of the time you just use slow unwinder
instead of hardcoding anything fast or special cased.
I hoped the fix would be a simpler variant of the alpha
bug. With that
bit of experience I hoped it would help.
The presence of unexpected syscall number in syscall parameters for
ptrace()
bothered me a bit. Why does it appear in such a strange
place?
I skimmed through arch/ia64
kernel commits in 4.9..5.10
range
related to ptrace()
, found 5 of them, stared a lot at them and did
not find anything suspicious.
Then I skimmed through kernel’s side of intercepting syscall
parameters. I looked at a break
trapping instruction handling.
break
instruction is similar to x86
int
: it trigger
interrupt style switch to kernel context. With a caveat that everything
on ia64
is special.
Looking at the source code ptrace()
handling path looked
straightforward:
Tracee:
- entry point is
ENTRY(break_fault)
(atarch/ia64/kernel/ivt.S
)- save only part of
struct pt_regs
GLOBAL_ENTRY(ia64_trace_syscall)
(atarch/ia64/kernel/entry.S
):PT_REGS_SAVES
: create new frame worth ofstruct pt_regs
f6-f11
FPU
registers are savedsyscall_trace_enter
(asm->C
boundary is crossed here, registers are passed as arguments) (atarch/ia64/kernel/ptrace.c
)tracehook_report_syscall_entry
communicates state to tracer- tracer resumes and gets data
- save only part of
Tracer:
ptrace_get_syscall_info_entry
info->entry.nr = syscall_get_nr(child, regs);
// regs->r15
The complication here is to find out which of two paths is buggy: tracer
path or tracee path? Tracee path looks more involved and has a higher
chance to fail. I needed to poke at a real example to match expected and
observed states.
I attempted to reproduce the bug on strace
test suite and instantly
got the same result. Yay! Sharing the machine with the reporter is very
convenient :)
I started sprinkling printk()
statement all over the kernel. Most
interesting result was for syscall_get_set_args_cb()
in
arch/ia64/kernel/ptrace.c
.
syscall_get_set_args_cb: krbs: 0xe000000103800ec0
syscall_get_set_args_cb: ndirty: 0x0
syscall_get_set_args_cb: count: 6
syscall_get_set_args_cb: krbs[0]: 0x40a
syscall_get_set_args_cb: krbs[1]: 0x2000000800122590
syscall_get_set_args_cb: krbs[2]: 0xbad1fed1
syscall_get_set_args_cb: krbs[3]: 0xbad2fed2
syscall_get_set_args_cb: krbs[4]: 0xbad3fed3
syscall_get_set_args_cb: krbs[5]: 0xbad4fed4
syscall_get_set_args_cb:args[0]: 0x40a
syscall_get_set_args_cb:args[1]: 0x2000000800122590
syscall_get_set_args_cb:args[2]: 0xbad1fed1
syscall_get_set_args_cb:args[3]: 0xbad2fed2
syscall_get_set_args_cb:args[4]: 0xbad3fed3
syscall_get_set_args_cb:args[5]: 0xbad4fed4
Here krbs
is kernel’s register backing store in memory.
RBS
memory area (pointed at ar.bsp
and ar.bspstore
registers) is where rotating registers are spilled on user’s request or
automatically. In case of ptrace()
registers are spilled on
kernel’s request with flushrs
instruction to make sure we can read
them from memory.
Normally krbs
should contain syscall arguments and maybe local
function variables. args
array is actual syscall arguments. We still
see 0x402
(__NR_read
) as first argument. That is probably a
manifestation of the bug we trace.
Fun fact: for some reason ia64
linux
syscalls start from 1024
(0x400
). Perhaps lower numbers are reserved for binary compatibility
with HPUX
?
bisecting the kernel
I still did not get anything obvious. Why did syscall number kept
getting in the list? I decided to bisect the kernel.
I rebooted back to 4.9
kernel and made sure that bug disappeared.
Ok, that meant it’s not at least some parallel user space update.
I bisected the kernel down to the following commit:
commit 201766a20e30f982ccfe36bebfad9602c3ff574a
Author: Elvira Khabirova <lineprinter@altlinux.org>
Date: Tue Jul 16 16:29:42 2019 -0700
ptrace: add PTRACE_GET_SYSCALL_INFO request
PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain
details of the syscall the tracee is blocked in.
include/linux/tracehook.h | 9 ++++++---
include/uapi/linux/ptrace.h | 35 +++++++++++++++++++++++++++++++++++
kernel/ptrace.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
tools/testing/selftests/seccomp/seccomp_bpf.c | 13 +++++++++----
4 files changed, 150 insertions(+), 8 deletions(-)
Tl;DR is: it’s a new ptrace()
API to fetch traced syscall
arguments.
This means it’s not strictly a regression of existing code:
- before the kernel change
strace
used to manually extract syscall arguments fromptrace(PTRACE_GETREGS)
knowing the syscall ABI for an architecture - after the change it calls
ptrace(PTRACE_GET_SYSCALL_INFO)
and gets all the arguments as a nice portable array. No need to deal with register numbers!
That looked even more confusing! Arch-specific ptrace()
code was not
changed in that commit and yet some part of ptrace()
is somehow
broken. It meant that something was probably broken forever.
What next? I had to fall back to debugging the issue from the first
principles: carefully trace syscall entry into the kernel to debug
ptrace()
hook being called and check how syscall arguments are
extracted.
Sounds tedious but straightforward.
syscalls on ia64
So what IS the syscall ABI on ia64
? I never looked into too much
details besides knowing that “it looks very close to calling a C
function”. Is it true though?
Actually, ia64
has not one but two close but different syscall ABIs:
break
instruction based kernel entry (similar to interrupt trap)epc
magic instruction based (also calledfsys
mode, “fast syscall”)
break
ABI is roughly the following:
- syscall number is placed into
r15
(non-rotating register) - syscall arguments go into rotating registers for output (not input or local)
break 0x10000
is executed to switch into kernel mode viabreak
interrupt vectorENTRY(break_fault)
(atarch/ia64/kernel/ivt.S
)
Here argument layout for syscall matches argument layout for a standard
c
function on i64
. Only syscall number is passed via unusual
r15
.
glibc
syscall wrapper fully illustrates typical call:
ENTRY(syscall)
:
/* We are called like so{out0,out1,...,out6} registers -> {NR, arg1, ..., arg6}
{out1...out6} are available
Shift the register window so that in {out0...out5} like the kernel syscall handler expects. */
=ar.pfs,1,0,8,0
alloc r2mov r15=r32 /* syscall number */
break __IA64_BREAK_SYSCALL
;;
cmp.ne p6,p0=-1,r10 /* r10 = -1 on error */
) ret
(p6.cond.spnt.few __syscall_error br
Not to go into too many details syscall
prototype is syscall(NR, arg1, arg2, arg3, ..., arg8)
and kernel handles it as
sys_NR(arg1,arg2,arg3,...,arg8)
.
At this point I thought, “aha! off-by-one!”. But tracing through all
the paths of ia64
assembly I was not able to find any problems. I
had to read on stack handling in branch calls and interrupts in
excellent software development intel’s manual to make sure I don’t
miss any special cases. Nothing stood out.
Adding more debugging I realized break
mechanism was not used at all
in failing cases!
After a bit of debugging I discovered that most of glibc
syscalls are
actually done via epc
(not break
) mechanism! Gah! I never had a
chance to have a closer look at it and always assumed it’s an
unimplemented feature. So much for being a local expert in
ia64-linux
:D
epc
ABI is unusual: epc
instruction (Enter Privileged Code)
itself does only one thing: it changes privileges of current execution
from user space to kernel level and executes next instruction right after
it. This means no context switch, no traps executed. It’s almost like a
nop
. Next few instructions after epc
need to manually perform
all the necessary context operations. Which might be none for simplest
syscalls like getpid()
!
If user space could call epc
from anywhere that would be a good way
to negate any kernel protection. Thus CPU has a few restrictions: code
page with epc
needs to be marked as privileged for MMU so kernel
could provide safe code that is not easy to turn into an arbitrary
privilege escalation.
In practice kernel provides such a page as part of vDSO
. linux
calls it GATE
page. glibc
finds the vDSO
out and uses it as
syscall implementation. Linux calls the whole ABI an fsys
mechanism:
https://www.kernel.org/doc/html/latest/ia64/fsys.html
The __kernel_syscall_via_epc()
kernel entry in
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/ia64/kernel/gate.S#n300
looks roughly like that:
(__kernel_syscall_via_epc)
GLOBAL_ENTRY
.prologue
.altrp b6
.body
/*: the kernel cannot assume that the first two instructions in this
* Note. The remaining code must be safe even if
* bundle get executednot get executed.
* they do
*/=-1024,r15 // A
adds r17mov r10=0 // A default to successful syscall execution
// B causes split-issue
epc ;;
(r20, r22) // M2 (5 cyc to srlz.d)
RSM_PSR_BE_I(r14) // X
LOAD_FSYSCALL_TABLE ...
the argument layout bug
Digging through break
vs. epc
mechanism I found that
ptrace()
gets the syscall register frame in slightly different
states. For break
the layout is:
outputs:
- arg8
- arg7
- arg6
- ...
- arg1
locals:
<none>
inputs:
- NR
For epc
:
outputs:
- arg8
- arg7
- arg6
- ...
- arg1
locals:
- NR
inputs:
<none>
Both are perfectly valid states prepared to pass arguments to callee
function. And both require a br.call
instruction to rotate
outputs
to the inputs
of callee target.
In normal syscall handling case (without ptrace()
attached)
br.call
is called for passing the control to syscall handler written
in c
. All arguments are present in their r32,r33,...
locations.
In tracing case ptrace()
is executed right before br.call
. There
we need to inspect syscall arguments (and possibly modify on user’s
request) in outputs
registers part of the active register set.
The bug was in the fact that ia64
-specific ptrace()
code assumed
that locals are never present (as in break
glibc
wrapper case).
Once we know the problem the fix is easy: skip locals and always use
output
registers when inspecting syscall arguments:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0ceb1ace4a2778e34a5414e5349712ae4dc41d85
Tl;DR of a patch: ia64
has a cr.ifs
register that tells us how
many inputs, local and output registers are used in current function
(register’s value is maintained by alloc
and br.call
/
br.ret
instructions). Before the change we collected both
locals
+ outputs
, after the change - only outputs
.
Are we done?
the error handling bug
After the above patch applied I ran the strace
test suite again. It
was a lot healthier with just a few failures. This time the error was in
syscall exit code (also new addition) of
ptrace(PTRACE_SYSCALL_INFO_EXIT)
. This time sign of errno
error
was wrong.
Here we have a chance to see how ia64
syscalls return status back to
user space: two fixed registers are used for that:
r10
for fail-or-not statusr8
for status-or-error.
Here is the original code to set and get the status:
static inline long syscall_get_error(struct task_struct *task,
struct pt_regs *regs)
{
return regs->r10 == -1 ? regs->r8:0;
}
static inline long syscall_get_return_value(struct task_struct *task,
struct pt_regs *regs)
{
return regs->r8;
}
static inline void syscall_set_return_value(struct task_struct *task,
struct pt_regs *regs,
int error, long val)
{
if (error) {
/* error < 0, but ia64 uses > 0 return value */
->r8 = -error;
regs->r10 = -1;
regs} else {
->r8 = val;
regs->r10 = 0;
regs}
}
6 lines of code. Simple, eh? Can you spot the error?
Note how syscall_get_error()
does not remove negation added in
syscall_set_return_value()
. The fix was obvious:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=61bf318eac2c13356f7bd1c6a05421ef504ccc8a
Surely done now?
signal handling bug
A few more test failures revealed another fun bug: signal handlers
return wrong signal masks.
In this case failed tests complained about status of blocked signals in
tracing rt_sigreturn()
syscall.
When signal is delivered to user space processes memory stack gets an
extra struct that describes delivered signal and a bit of context (on
any arch AFAIU):
struct sigframe {
unsigned long arg0; /* signum */
unsigned long arg1; /* siginfo pointer */
unsigned long arg2; /* sigcontext pointer */
void __user *handler; /* pointer to the plabel of the signal handler */
struct siginfo info;
struct sigcontext sc;
};
...
struct sigcontext {
...
; /* signal mask to restore after handler returns */
sigset_t sc_mask};
Somehow sc_mask
had unexpected value. Was it incorrectly populated
by kernel? ia64
has two user space stacks:
- usual
c
memory stack (tracked byr12
register), user space usually keeps there function-local buffers, variables that don’t fit in registers, etc. - register backing store stack (
ar.bsp
and friends point to it), this area is managed by CPU to maintain rotating registers contents onbr.call
/br.ret
. It’s actually quite hard to interpret it’s contents even in steady state due to fancy alignment restrictions, boundary framing withNaT
marking. It’s memory is also not synchronized with CPUs view of that memory to speed things up. You almost never want to mess with it for something like passing a struct around.
When it’s in rt_sigreturn()
syscall return that is the simplest way
to look at sigframe
on stack. Where stack is present as one of struct ptrace_syscall_info
fields:
struct ptrace_syscall_info {
; /* Type of system call stop */
__u8 op; /* AUDIT_ARCH_* value; see seccomp(2) */
__u32 arch; /* CPU instruction pointer */
__u64 instruction_pointer; /* CPU stack pointer */
__u64 stack_pointerunion {
struct { /* op == PTRACE_SYSCALL_INFO_ENTRY */
; /* System call number */
__u64 nr[6]; /* System call arguments */
__u64 args} entry;
struct { /* op == PTRACE_SYSCALL_INFO_EXIT */
; /* System call return value */
__s64 rval; /* System call error flag;
__u8 is_error Boolean: does rval contain
an error value (-ERRCODE) or
a nonerror return value? */
} exit;
struct { /* op == PTRACE_SYSCALL_INFO_SECCOMP */
; /* System call number */
__u64 nr[6]; /* System call arguments */
__u64 args; /* SECCOMP_RET_DATA portion
__u32 ret_data of SECCOMP_RET_TRACE
return value */
} seccomp;
};
};
So which of two stacks should be present in stack_pointer
field?
r12
or ar.bspstore
? As a result our frame is something like
that:
+-------
| ... <registers for outer functions, their cfm, predicates>
+-------
| ... <- ar.bspstore
+-------
| ... <- ar.bsp
| vvv grows down vvv
|
|
| ^^^ grows up ^^^
~~~~~~~~
| ... <- r12
| <signal frame>
| <16-bytes of scratch area>
| <rest of memory stack>
+-------
Kernel had to pick one and picked wrong: ar.bspstore
. It contains no
valid data at all. It consists only of leftover values for previous
register flushes and loads.
As a result ptrace()
looked at a part of register backing store to
look up rt_sigreturn(). Once this mismatch became clear the fix was obvious: <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7ad1e366167837daeb93d0bacb57dee820b0b898> By then
strace` test suite passed all tests!
Can we cautiously declare the endeavor done?
plot twist
Well, strace
test suite now started hanging whole box on almost every
test suite run. Looks like before it did not manage to get that far to
destroy kernel’s internal state. Example crash:
Unable to handle kernel NULL pointer dereference (address 0000000000000338)
sock_filter-v-X[6171]: Oops 11012296146944 [18]
Modules linked in: usb_storage e1000 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler rtc_efi
CPU: 0 PID: 6171 Comm: sock_filter-v-X Tainted: G B D W 5.12.0-rc2-00003-g97669c51470e-dirty #85
Hardware name: hp server rx3600 , BIOS 04.03 04/08/2008
psr : 0000121008026010 ifs : 800000000000040b ip : [<a00000010008d1f1>] Tainted: G B D W (5.12.0-rc2-00003-g97669c51470e-dirty)
ip is at ptrace_stop+0x2b1/0x860
unat: 0000000000000000 pfs : 000000000000040b rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr : 000000255aa66a15
ldrs: 0000000000000000 ccv : 00000000fffffa92 fpsr: 0009804c0270033f
csd : 0000000000000000 ssd : 0000000000000000
b0 : a00000010008d1b0 b6 : a0000001008b1b20 b7 : a00000010000d010
f6 : 000000000000000000000 f7 : 1003e8208208208208209
f8 : 1003effffffffffffffea f9 : 1003e0000000000000033
f10 : 1003e8208208208208209 f11 : 1003effffffffffffffe6
r1 : a000000101906440 r2 : 0000000000000010 r3 : 0000000000000000
r8 : 00000000b3a0d9d1 r9 : 00000000000059d0 r10 : 00000000b3a08001
r11 : 0000000000000001 r12 : e00000010f2d5880 r13 : e00000010f2d0000
r14 : a0000001015c8304 r15 : 00000000deaf1eed r16 : e00000010f2d0000
r17 : e00000010f2d100c r18 : a000000101706e70 r19 : e00000010f2d0018
r20 : 0000000000010289 r21 : e00000010f2d0450 r22 : 0000000000000000
r23 : 0000000000000338 r24 : 000000000000b3a2 r25 : 000000000000b3a2
r26 : e00000010f2d048c r27 : 0000000000010013 r28 : fffffffffff7ffff
r29 : 0000000000120000 r30 : 0000000000000000 r31 : e00000010f2d100c
Call Trace:
[<a000000100014d10>] show_stack+0x90/0xc0
sp=e00000010f2d54b0 bsp=e00000010f2d3738
[<a000000100015410>] show_regs+0x6d0/0xa40
sp=e00000010f2d5680 bsp=e00000010f2d36c8
[<a0000001000285e0>] die+0x1e0/0x3c0
sp=e00000010f2d56a0 bsp=e00000010f2d3688
[<a00000010005b160>] ia64_do_page_fault+0x820/0xb80
sp=e00000010f2d56a0 bsp=e00000010f2d35e8
[<a00000010000ca00>] ia64_leave_kernel+0x0/0x270
sp=e00000010f2d56b0 bsp=e00000010f2d35e8
[<a00000010008d1f0>] ptrace_stop+0x2b0/0x860
sp=e00000010f2d5880 bsp=e00000010f2d3590
[<a00000010008d8a0>] ptrace_do_notify+0x100/0x120
sp=e00000010f2d5880 bsp=e00000010f2d3560
[<a00000010008d950>] ptrace_notify+0x90/0x1a0
sp=e00000010f2d58c0 bsp=e00000010f2d3540
[<a000000100073700>] do_exit+0x1540/0x1700
sp=e00000010f2d58c0 bsp=e00000010f2d34c8
[<a0000001000287b0>] die+0x3b0/0x3c0
sp=e00000010f2d58d0 bsp=e00000010f2d3488
It’s a NULL
pointer dereference. How hard could it be to nail down
and fix (or at least workaround)?
To make the box less unstable I sprinkled a few if (p == NULL) { WARN_ON(1); return; }
around. That allowed surviving a few strace
test suite runs in a row. Woohoo!
guppy
was able to survive a few days and then crashed with even more
dire and inscrutable panic.
Before digging into more details I first synced to latest kernel git to
ease upstreaming things easier bit by bit and using Latest and Greatest
code. Surprisingly latest linux.git
did not even boot.
One of failures Jens quickly fixed right after successful bisection related to task-level flag handling: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f5f4fc4649ae542b1a25670b17aaf3cbb6187acc
Another failure was use of atomic
against unaligned bool
struct
field in a hpsa
disk driver. It was also easy to fix with help of Don
and others:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=02ec144292bc424a5800d45d4cb472c66e97c520
I was “lucky” to find those failures before another kernel release (or
another few years of negligence).
Latest linux.git
was still crashing the box under strace
test suite.
more kernel debugging
As a next step I enabled everything I could find related to memory
corruption debugging in linux
kernel:
CONFIG_VM_DEBUG
page_poison=on
init_on_alloc=1
init_on_free=1
page_owner=on
hardened_usercopy=1
memblock=debug
- various slab debugs
As a result I got kernel to an unbootable state again /o\.
I hoped boot failures were related to underlying problem I observed. One
of the annoyances was that kernel silently crashed and did not print
errors to BMC
serial output. I disabled most of debugging flags back
and left page_poison=on
init_on_alloc=1
init_on_free=1
.
This allowed catching and fixing some minor warnings like:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f2a419cf495f95cac49ea289318b833477e1a0e2.
After a while (about a day) I started seeing the reports of arbitrarily corrupted memory:
pagealloc: memory corruption
000000004a763954: 05 00 00 00 00 00 00 00 f8 b5 b0 ff ff 0f 00 60 ...............`
00000000b3626ed1: 60 b7 b0 ff ff 0f 00 60 50 68 1c 00 08 00 00 20 `......`Ph.....
00000000f59604da: 00 00 00 00 00 00 00 00 00 70 00 00 00 00 00 00 .........p......
00000000345d9313: e3 c2 9b 14 00 00 00 00 aa aa aa aa aa aa aa aa ................
00000000d092c8b5: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................
...
0000000088df4d5c: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................
00000000f6e761a6: aa aa aa aa aa aa aa aa 45 78 63 65 65 64 65 64 ........Exceeded
0000000000d45288: 20 4d 61 78 53 74 61 72 74 75 70 73 0d 0a 00 aa MaxStartups....
00000000c40693de: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................
00000000cf8ee6dc: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................
...
000000005fa7b069: aa aa aa aa aa aa aa aa 10 00 00 00 ............
CPU: 1 PID: 25234 Comm: sshd Not tainted 5.12.0-rc2-00010-gd6be88a244a9-dirty #125
Hardware name: hp server rx3600 , BIOS 04.03 04/08/2008
Call Trace:
[<a000000100015210>] show_stack+0x90/0xc0
[<a000000101161760>] dump_stack+0x150/0x1c0
[<a0000001003f17b0>] __kernel_unpoison_pages+0x3f0/0x400
[<a0000001003c1dc0>] get_page_from_freelist+0x1460/0x2ca0
[<a0000001003c6540>] __alloc_pages_nodemask+0x3c0/0x660
[<a0000001003ecfd0>] alloc_pages_vma+0xb0/0x500
[<a000000100375580>] wp_page_copy+0xe0/0x15e0
[<a0000001003799b0>] do_wp_page+0x170/0xa00
[<a00000010037e0e0>] __handle_mm_fault+0x1960/0x1fe0
[<a00000010037ea70>] handle_mm_fault+0x310/0x4e0
[<a00000010005da50>] ia64_do_page_fault+0x1f0/0xb80
[<a00000010000ca00>] ia64_leave_kernel+0x0/0x270
Yay! Maybe that’s it?
The backtrace tells us it’s a page fault handling code faulting in a
page that used to be in page freelist. But it’s already in a corrupted
state: page should be full of aa
values, but it clearly has some
unrelated data like Exceeded MaxStartups
.
I prepared for a deep dive into virtual memory management in linux
:
- got basic understanding of page fault handling on
ia64
,TLB
population (VHPT
,TR
, andTC
registers management) - got basic understanding of memory management layout on
ia64
: where linear mapping starts (aka “identity” + base offset), wherevmalloc()
starts it’s address, how and when it gets freed - got basic understanding of
linux
3-4-5 level page tables are maintained and synced back to architecture-specificTLB
.
And after much debugging I found that this corruption is a bug in
debugging mechanism:
init_on_alloc=1
, init_on_free=1
and page_poison=on
are in
direct conflict (and a bit of redundancy) with each other:
- redundancy: if you have freed a page on
init_on_free=1
system then memory page can be allocated withoutmemset(0)
even oninit_on_alloc=1
system becauseinit_on_free=1
already guarantees it! - conflict:
init_on_free=1
doesmemset(0)
whilepage_poison=1
doesmemset(0xaa)
. Which one has more priority?
The problem was that init_on_alloc=1
+ init_on_free=1
+
debug_pagealloc=1
led to page freeing with memset(aa)
and
allocation without any memset(0)
at all. This caused two problems:
- reports on corruption where it should not be
- return pages with garbage data to the system even if system requested `alloc_page(__GFP_ZERO)
In our case alloc_page(__GFP_ZERO)
was used for page table
(PTE
) allocations and instead of returning PTE
of no pages it
was full of bits that looked like pointers to other pages.
Once this conflict was understood it was easy to report the bug and fix
it with help of mm
folks:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9df65f522536719682bccd24245ff94db956256c
There are many nuances when exactly the bug could happen. For example
architecture must not support CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC
(x86_64
was not affected by the bug). Otherwise different mechanisms
kick in.
On a positive side a few month later I managed to fix a mirror image bug
on x86_64
:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=69e5d322a2fb86173fde8bad26e8eb38cad1b1e9.
I’ll spare you the details why there is a complication of handling
static keys and early parameters in a way that it could go out of sync
like that.
another yak: page_owner
Linux kernel has a very cool facility enabled by both
CONFIG_PAGE_OWNER=y
and page_owner=on
boot option. It’s idea is
to keep the history of most recent callers who freed and who allocated
the page including full backtrace and page flags. You can see full state
for each page in /sys/kernel/debug/page_owner
:
# cat /sys/kernel/debug/page_owner
...
Page allocated via order 0, mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY),
pid 1, ts 651931156 ns, free_ts 0 ns
PFN 1049310 type Unmovable Block 1024 type Unmovable Flags 0x8000000000000200(sl
ab|zone=2)
get_page_from_freelist+0xa31/0xcd0
__alloc_pages+0x161/0x2b0
allocate_slab+0x382/0x420
___slab_alloc.constprop.0+0x512/0x730
__slab_alloc.constprop.0+0x90/0xc0
kmem_cache_alloc+0x3f2/0x430
kmem_cache_create_usercopy+0x13e/0x2e0
kmem_cache_create+0x18/0x20
khugepaged_init+0x20/0x61
hugepage_init+0x84/0x131
do_one_initcall+0x41/0x200
kernel_init_freeable+0x18e/0x1d6
kernel_init+0x16/0x110
ret_from_fork+0x1f/0x30
...
Initially I hoped to use page_owner
to dump at corruption detection
time and at random times when I want to see past page history manually
(say, at NULL
-corruption detection time).
The problem was that setting page_owner=on
rendered ia64
unbootable. It happened because on ia64
stack unwinder requires
memory allocation (and gets into infinite recursion). On x86_64
for
example the stack unwinder does not need extra memory allocation.
Fun fact: actually storing stack trace for page owner itself does
require memory allocation as well (on any architecture). It could have a
potential getting into recursion. page_owner
code tried to prevent
it by scanning current backtrace for duplicate address entries.
I sidestepped page_owner=on
recursion by storing single bit in
currently running task:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e9b16c47680f6e7d6e5864a37f313f905a91cf5
It should also be slightly faster now.
While I have such a nice corruption reporter I attempted to explore and
fix a few tools available to debug it efficiently:
- added
page_owner
info reporting on detected page corruption: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f58bd538e6a2deb2bcdfe527d9ed45643348a4e6 - fixed
page_owner=on
to work the same aspage_owner=1
(for consistency): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=608b5d668c8ea6734594a401c9adab4093ad9847 - fixed
ia64
-specific symbolizer crash on function descriptors: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=99e729bd40fb3272fa4b0140839d5e957b58588a
back to strace
killer
strace
test suite was still able to kill the box. Example crash looked
like:
Unable to handle kernel paging request at virtual address aaaaaaaaaaaaaab2
swapper/0[0]: Oops 8813272891392 [1]
Modules linked in: acpi_ipmi e1000 usb_storage ipmi_si ipmi_devintf ipmi_msghandler rtc_efi
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc2-00010-gd6be88a244a9-dirty #129
Hardware name: hp server rx3600 , BIOS 04.03 04/08/2008
psr : 0000101008026010 ifs : 8000000000000307 ip : [<a000000100e8e670>] Not tainted (5.12.0-rc2-00010-gd6be88a244a9-dirty)
ip is at sk_filter_release_rcu+0x70/0x120
unat: 0000000000000000 pfs : 0000000000000895 rsc : 0000000000000003
rnat: 0000000000000468 bsps: 0000000000001000 pr : 01606a5694556a55
ldrs: 0000000000000000 ccv : 000000001f020f81 fpsr: 0009804c0270033f
csd : 0000000000000000 ssd : 0000000000000000
b0 : a00000010018fd00 b6 : a000000100e8e600 b7 : e00000003f81c740
f6 : 1003e000000000003ae7e f7 : 1003e0000000002813e10
f8 : 1003e0019aba91335cd31 f9 : 1003e0019b4e1e1313911
f10 : 1003e0000038521a57b7f f11 : 1003e0000000000000000
r1 : a0000001019465c0 r2 : a000000200034004 r3 : 00000000aaaaaaaa
r8 : aaaaaaaaaaaaaab2 r9 : a00000010174d0c0 r10 : ffffffffffa3cd70
r11 : a00000010148bec8 r12 : a000000101607ba0 r13 : a000000101600000
r14 : e000000116287718 r15 : a000000200034040 r16 : a000000100e8e600
r17 : e00000010e6dfe80 r18 : a000000101607bc0 r19 : e000000005808fc0
r20 : e00000010e6dfe80 r21 : 0000000000000000 r22 : e00000010e6dfe80
r23 : e00000010e6dfe80 r24 : e000000005808f78 r25 : 0000000000001f04
r26 : 00000000000cf92c r27 : 0000000000000007 r28 : a00000010174dba8
r29 : 0000000000000007 r30 : 0000000000000007 r31 : 000000000000000a
Call Trace:
[<a000000100015210>] show_stack+0x90/0xc0
[<a000000100015910>] show_regs+0x6d0/0xa40
[<a000000100029420>] die+0x1e0/0x3c0
[<a00000010005e370>] ia64_do_page_fault+0xb10/0xb80
[<a00000010000ca00>] ia64_leave_kernel+0x0/0x270
[<a000000100e8e670>] sk_filter_release_rcu+0x70/0x120
[<a00000010018fd00>] rcu_core+0x8c0/0x1440
[<a0000001001908a0>] rcu_core_si+0x20/0x40
[<a000000101182cb0>] __do_softirq+0x230/0x670
[<a000000100079d60>] irq_exit+0x180/0x220
[<a000000100013a70>] ia64_handle_irq+0x1b0/0x360
[<a00000010000ca00>] ia64_leave_kernel+0x0/0x270
[<a0000001000143f0>] ia64_pal_call_static+0x90/0xc0
[<a0000001000150c0>] ia64_pal_halt_light.isra.0+0x40/0x80
[<a000000100016200>] arch_cpu_idle+0x100/0x1c0
[<a0000001011818a0>] default_idle_call+0xe0/0x140
[<a0000001000eb530>] do_idle+0x330/0x4e0
[<a0000001000ebe30>] cpu_startup_entry+0x50/0x80
[<a00000010116ded0>] rest_init+0x230/0x250
[<a000000101490e70>] arch_call_rest_init+0x20/0x40
[<a000000101491ad0>] start_kernel+0xbf0/0xc20
[<a00000010116dc60>] start_ap+0x760/0x780
Disabling lock debugging due to kernel taint
Kernel panic - not syncing: Fatal exception
---[ end Kernel panic - not syncing: Fatal exception ]---
This time the victim is a swapper
thread that happens to execute
deferred sk_filter_release_rcu
execution: strace
test allocated
sk_filter
and someone else failed to free it. Virtual address
aaaaaaaaaaaaaab2
says that it’s probably a use-after-free case.
Should be simple to debug, right?
As I already spent A Lot of time spelunking through memory management in
ia64
I dropped a bit of dead code around DISCONTIGMEM
:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9187592b96385e5060dfb2b182aa9ec93d5c0332
From now on ia64
is just a SPARSEMEM
architecture. Or
FLATMEM
if you are lucky to get contiguous physical address layout (
I am not: rx3600
has 1TB gap in physical memory ranges for me).
I minimized strace
killer example down to:
#include <unistd.h>
#include <netinet/in.h>
#include <sys/socket.h>
#include <linux/filter.h>
int main(void)
{
struct sock_filter bpf_filter[] = {
(BPF_RET|BPF_K, 0)
BPF_STMT};
struct sock_fprog prog = {
.len = 1,
.filter = bpf_filter,
};
int fd = socket(AF_INET, SOCK_DGRAM, 0);
(fd, SOL_SOCKET, SO_ATTACH_FILTER, &prog, sizeof(prog));
setsockoptreturn 0;
}
This crashes guppy
in a second (usually takes 8 iterations of the
loop below):
$ gcc bug.c -o bug; while ./bug; do echo again; done
The sample program creates IPv4 socket and attaches BPF
program to it.
Kernel crashes at the socket memory cleanup time.
Socket filters are special in kernel because they use slightly different
virtual memory freeing policy compared to rest of vmalloc()
’ed regions:
it’s called VM_FLUSH_RESET_PERMS
. This flag should eagerly unmap
memory and eagerly flush TLB
. From my understanding it’s a security
feature that slightly decreases performance and does not affect
correctness (modulo bugs we probably observe here).
To get ia64
box into a fully stable state I ignore any
VM_FLUSH_RESET_PERMS
mappings with the following hack:
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2214,6 +2214,9 @@ static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
remove_vm_area(area->addr);
+ /* workaround mysterious double-free on vmalloc() for bpf. */
+ return;
+
/* If this is not VM_FLUSH_RESET_PERMS memory, no need for the below. */
if (!flush_reset) return;
Unfortunately I don’t know yet why crash happens and can only speculate
at this point. I suspect that mm
code lacks a barrier somewhere that
allows page reuse before TLB
flush happens. To be continued.
Parting Words
strace
has a great test suite to detect all sorts of corner cases in
linux
kernel. Random factoids:
ia64
linux
syscalls start from1024
.ia64
got even betterptrace()
support.page_owner=on
is now usable onia64
!- I still did not get to the bottom of it. But it feels I’m very close :)
VM_FLUSH_RESET_PERMS
is a thing.- It took me about 2 months to get some progress on this problem.
- It took
guppy
about 350 reboots to recover from machine lockups. I suspect it’s more than this machine ever saw in it’s previous life. - Debugging tools can corrupt your data sometimes even if original setup is not supposed to.
Have fun!