Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 769614 - arch/ia64/kernel/ptrace.c ia64_syscall_get_set_arguments off-by-one
Summary: arch/ia64/kernel/ptrace.c ia64_syscall_get_set_arguments off-by-one
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: IA64 Linux
: Normal normal (vote)
Assignee: Sergei Trofimovich (RETIRED)
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-02-08 22:03 UTC by Dmitry V. Levin
Modified: 2021-05-14 08:00 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
0001-ia64-fix-ia64_syscall_get_set_arguments-for-break-ba.patch (0001-ia64-fix-ia64_syscall_get_set_arguments-for-break-ba.patch,2.89 KB, patch)
2021-02-20 22:25 UTC, Sergei Trofimovich (RETIRED)
Details | Diff
0001-ia64-fix-ptrace-PTRACE_SYSCALL_INFO_EXIT-sign.patch (0001-ia64-fix-ptrace-PTRACE_SYSCALL_INFO_EXIT-sign.patch,1.96 KB, patch)
2021-02-21 00:01 UTC, Sergei Trofimovich (RETIRED)
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry V. Levin 2021-02-08 22:03:45 UTC
Hello, strace upstream is speaking. :)

Somebody has updated the kernel to 5.10.4-gentoo on guppy.ia64.dev.gentoo.org - the only place where I can regularly test strace on ia64.

That kernel has PTRACE_GET_SYSCALL_INFO support, but it doesn't work properly there.  The test (ptrace_syscall_info.gen.test) fails with the following diagnostics: ptrace_syscall_info.c:256: #1: entry stop mismatch

The internal runtime strace test fails with the same diagnostics:
$ ./src/strace -d -enone / 2>&1 |grep -i ptrace
./src/strace: ptrace_setoptions = 0x51
./src/strace: test_ptrace_get_syscall_info: #1: entry stop mismatch
./src/strace: PTRACE_GET_SYSCALL_INFO does not work

A bit of debugging shows the following picture:
$ ./src/strace -qqq -eptrace -o'|grep PTRACE_SYSCALL_INFO_ENTRY' ./src/strace -d -enone / 2>/dev/null 
ptrace(PTRACE_GET_SYSCALL_INFO, 13579, 88, {op=PTRACE_SYSCALL_INFO_ENTRY, arch=AUDIT_ARCH_IA64, instruction_pointer=0x200000000037e3f0, stack_pointer=0x600007fffffe0620, entry={nr=1034, args=[0x40a, 0x20000008001537f0, 0xbad1fed1, 0xbad2fed2, 0xbad3fed3, 0xbad4fed4]}})

As you can see, entry.args[0] == entry.nr, and the remaining entry.args[] elements are off by one element.

I'm pretty sure that kernel's tools/testing/selftests/ptrace/get_syscall_info.c would fail the same way.

The only plausible explanation is that ia64_syscall_get_set_arguments() has an off-by-one bug.
Comment 1 Dmitry V. Levin 2021-02-08 22:54:23 UTC
FWiW, I'm aware of the linux/arch/ia64 status change in https://git.kernel.org/torvalds/c/96ec72a3425d1515b69b7f9dc34a4a6ce5862a37 .
Comment 2 Sergei Trofimovich (RETIRED) gentoo-dev 2021-02-09 06:52:18 UTC
Yeah, I updated kernel yesterday from 4.9.95 on guppy. Let's see if we can find causing change.
Comment 3 Sergei Trofimovich (RETIRED) gentoo-dev 2021-02-09 07:16:19 UTC
Suspicious upstream changes are:
- 4ff8a356daafaafbf90141ee7a3b8fdc18e560a8
- e3fdfa37a3fabea38b01af05893050136a6cd136
- 6bc4f16c6c9bed6ce5c3ab77b95397c8c88bdb66
- a79ca8e7b9d56c762c3cd53465fde62f8ca41acf
- e2115cf3cc57fc4b3f6b0ea4f746e5ab6dc9c2c0
- 4c35bf3ae948d6232f3d2c6fbf2008c341490683
Comment 4 Sergei Trofimovich (RETIRED) gentoo-dev 2021-02-09 11:49:56 UTC
Let's see what is broken: 1) syscall trace handler or 2) ptrace() read call.

A few random factoids:
- on ia4 syscall number is supposed to be passed via r15 (`break` instruction triggers entry)

1) syscall trace handler seems to work as:

- entry point is ENTRY(break_fault) (at arch/ia64/kernel/ivt.S):
  -> save only part of pt_regs: 
  -> GLOBAL_ENTRY(ia64_trace_syscall) (at arch/ia64/kernel/entry.S):
    -> PT_REGS_SAVES: create new frame worth of pt_regs struct
    -> f6..f11 FPUs regs are saved
    -> syscall_trace_enter (asm->C, registers are passed as arguments) (at  arch/ia64/kernel/ptrace.c)
      -> tracehook_report_syscall_entry

2) ptrace() read call path is:
- ...
  -> ptrace_get_syscall_info_entry
     -> info->entry.nr = syscall_get_nr(child, regs); // regs->r15

At least 'nr' gets populated directly from 'pt_regs'.

I wonder if the problem is in passing ptrace_syscall_info struct itself back to userspace if we get consistent offset mismatch for all args and syscall number.

I see that on guppy we are using /usr/include/linux/ptrace.h from 5.10 that does not yet have a minor patch:

  commit 0032ce0f85a269a006e91277be5fdbc05fad8426
  Author: Peilin Ye <yepeilin.cs@gmail.com>
  Date:   Sat Aug 1 11:20:44 2020 -0400

    ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()

 struct ptrace_syscall_info {
        __u8 op;        /* PTRACE_SYSCALL_INFO_* */
-       __u32 arch __attribute__((__aligned__(sizeof(__u32))));
+       __u8 pad[3];
+       __u32 arch;

I think it should not matter (unless gcc has a bug and adds more than required alignment in strace client binary).
Comment 5 Sergei Trofimovich (RETIRED) gentoo-dev 2021-02-09 22:48:30 UTC
Here is the method that populates syscall args with a few printk()s added:

static void syscall_get_set_args_cb(struct unw_frame_info *info, void *data)
{
        struct syscall_get_set_args *args = data;
        struct pt_regs *pt = args->regs;
        unsigned long *krbs, cfm, ndirty;
        int i, count;

        if (unw_unwind_to_user(info) < 0)
                return;

        cfm = pt->cr_ifs;
        krbs = (unsigned long *)info->task + IA64_RBS_OFFSET/8;
        ndirty = ia64_rse_num_regs(krbs, krbs + (pt->loadrs >> 19));
        printk("syscall_get_set_args_cb: krbs: %#lx\n", (long)krbs);
        printk("syscall_get_set_args_cb: ndirty: %#lx\n", (long)ndirty);

        count = 0;
        if (in_syscall(pt)) {
                count = min_t(int, args->n, cfm & 0x7f);
                printk("syscall_get_set_args_cb: count: %d\n", count);
        }

        for (i = 0; i < ndirty + count; i++) {
                printk("syscall_get_set_args_cb: krbs[%d]: %#lx\n", i, (long)krbs[i]);
        }

        for (i = 0; i < count; i++) {
                if (args->rw)
                        *ia64_rse_skip_regs(krbs, ndirty + i + args->i) =
                                args->args[i];
                else
                        args->args[i] = *ia64_rse_skip_regs(krbs,
                                ndirty + i + args->i);
                printk("syscall_get_set_args_cb:args[%d]: %#lx\n", i, (long)args->args[i]);
        }

        if (!args->rw) {
                while (i < args->n) {
                        args->args[i] = 0;
                        i++;
                }
        }
}

It dumps the following output:

[   61.700354] syscall_get_set_args_cb: krbs: 0xe000000103800ec0
[   61.700354] syscall_get_set_args_cb: ndirty: 0x0
[   61.700354] syscall_get_set_args_cb: count: 6
[   61.700354] syscall_get_set_args_cb: krbs[0]: 0x40a
[   61.700354] syscall_get_set_args_cb: krbs[1]: 0x2000000800122590
[   61.700354] syscall_get_set_args_cb: krbs[2]: 0xbad1fed1
[   61.700354] syscall_get_set_args_cb: krbs[3]: 0xbad2fed2
[   61.700354] syscall_get_set_args_cb: krbs[4]: 0xbad3fed3
[   61.700354] syscall_get_set_args_cb: krbs[5]: 0xbad4fed4
[   61.700354] syscall_get_set_args_cb:args[0]: 0x40a
[   61.700354] syscall_get_set_args_cb:args[1]: 0x2000000800122590
[   61.701727] syscall_get_set_args_cb:args[2]: 0xbad1fed1
[   61.701727] syscall_get_set_args_cb:args[3]: 0xbad2fed2
[   61.701727] syscall_get_set_args_cb:args[4]: 0xbad3fed3
[   61.701727] syscall_get_set_args_cb:args[5]: 0xbad4fed4

krbs is supposed to be a register backing store for registers passed from userland to kernel. 'args[0]: 0x40a' indeed looks like a wrong base.

Trying to figure out where exactly register spill happens to backing store to get the idea where the off-by-one happens.
Comment 6 Sergei Trofimovich (RETIRED) gentoo-dev 2021-02-10 22:03:40 UTC
Oh, I see the test uses syscall() glibc directly to see what is passed around:

sysdeps/unix/sysv/linux/ia64/syscall.S

  ENTRY(syscall)
        /* We are called like so:
           {out0,out1,...,out6} registers -> {NR, arg1, ..., arg6}
           Shift the register window so that {out1...out6} are available
           in {out0...out5} like the kernel syscall handler expects.  */
        alloc r2=ar.pfs,1,0,8,0
        mov r15=r32             /* syscall number */
        break __IA64_BREAK_SYSCALL
        ;;
        cmp.ne p6,p0=-1,r10     /* r10 = -1 on error */
  (p6)  ret
        br.cond.spnt.few __syscall_error
  PSEUDO_END(syscall)

I don't quite understand why it's not a
    alloc r2=ar.pfs,1,0,6,0
but passing more data should be fine.

Kernel probably does not handle correctly input arg.
Comment 7 Sergei Trofimovich (RETIRED) gentoo-dev 2021-02-20 17:37:12 UTC
I decided to bisect kernel right on guppy and the result is the commit that adds PTRACE_GET_SYSCALL_INFO: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=201766a20e30f982ccfe36bebfad9602c3ff574a

commit 201766a20e30f982ccfe36bebfad9602c3ff574a
Author: Elvira Khabirova <lineprinter@altlinux.org>
Date:   Tue Jul 16 16:29:42 2019 -0700

    ptrace: add PTRACE_GET_SYSCALL_INFO request

    PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain
    details of the syscall the tracee is blocked in.

I agree ia64_syscall_get_set_arguments()'s assumption of stack frame is slightly wrong about RSE when kernel is entered via syscall() glibc wrapper.

It feels that we should have used only 'cfm.sof - cfm.sol' (ignore locals) range or RSE entries. While we are using 'cfm.sof' (use both locals and output parameters).
Comment 8 Sergei Trofimovich (RETIRED) gentoo-dev 2021-02-20 22:25:22 UTC
Created attachment 687792 [details, diff]
0001-ia64-fix-ia64_syscall_get_set_arguments-for-break-ba.patch

I'll try 0001-ia64-fix-ia64_syscall_get_set_arguments-for-break-ba.patch on guppy against v5.10.
Comment 9 Sergei Trofimovich (RETIRED) gentoo-dev 2021-02-20 22:34:03 UTC
(In reply to Sergei Trofimovich from comment #8)
> Created attachment 687792 [details, diff] [details, diff]
> 0001-ia64-fix-ia64_syscall_get_set_arguments-for-break-ba.patch
> 
> I'll try 0001-ia64-fix-ia64_syscall_get_set_arguments-for-break-ba.patch on
> guppy against v5.10.

I think it got better but the test still fails:

guppy ~/strace/tests # ./ptrace_syscall_info.gen.test
ptrace_syscall_info.c:335: #2: exit stop mismatch
ptrace_syscall_info.gen.test: failed test: ../ptrace_syscall_info failed with code 1

But the fetched args look better (yes?):

guppy ~/strace # ./src/strace -qqq -eptrace -o'|grep PTRACE_SYSCALL_INFO_ENTRY' ./src/strace -d -enone / 2>/dev/null
ptrace(PTRACE_GET_SYSCALL_INFO, 2129, 88, {op=PTRACE_SYSCALL_INFO_ENTRY, arch=AUDIT_ARCH_IA64, instruction_pointer=0x200000000037a3f0, stack_pointer=0x600007ffffddc620, entry={nr=1034, args=[0x200000080017d7e0, 0xbad1fed1, 0xbad2fed2, 0xbad3fed3, 0xbad4fed4, 0xbad5fed5]}}) = 80
Comment 10 Dmitry V. Levin 2021-02-20 22:49:59 UTC
Yes, the entry data looks good, but there is a bug in the exit data.
Let's just compare with a mundane x86_64 output.

$ ./src/strace -qqq -eptrace -o'|grep PTRACE_SYSCALL_INFO_E' ./src/strace -d -enone / 2>/dev/null 
ptrace(PTRACE_GET_SYSCALL_INFO, 10290, 88, {op=PTRACE_SYSCALL_INFO_ENTRY, arch=AUDIT_ARCH_IA64, instruction_pointer=0x200000000037e3f0, stack_pointer=0x600007ffffcb8620, entry={nr=1034, args=[0x2000000800154068, 0xbad1fed1, 0xbad2fed2, 0xbad3fed3, 0xbad4fed4, 0xbad5fed5]}}) = 80
ptrace(PTRACE_GET_SYSCALL_INFO, 10290, 88, {op=PTRACE_SYSCALL_INFO_EXIT, arch=AUDIT_ARCH_IA64, instruction_pointer=0x200000000037e3f0, stack_pointer=0x600007ffffcb8620, exit={rval=2, is_error=1}}) = 33

$ ./src/strace -qqq -eptrace -o'|grep PTRACE_SYSCALL_INFO_E |head -2' ./src/strace -d -enone / 2>/dev/null
ptrace(PTRACE_GET_SYSCALL_INFO, 474857, 88, {op=PTRACE_SYSCALL_INFO_ENTRY, arch=AUDIT_ARCH_X86_64, instruction_pointer=0x7f1c562f8819, stack_pointer=0x7ffea5490398, entry={nr=80, args=[0x5600151f09b7, 0xbad1fed1, 0xbad2fed2, 0xbad3fed3, 0xbad4fed4, 0xbad5fed5]}}) = 80
ptrace(PTRACE_GET_SYSCALL_INFO, 474857, 88, {op=PTRACE_SYSCALL_INFO_EXIT, arch=AUDIT_ARCH_X86_64, instruction_pointer=0x7f1c562f8819, stack_pointer=0x7ffea5490398, exit={rval=-ENOENT, is_error=1}}) = 33

Note the exit.rval difference.
Apparently, syscall_get_return_value() doesn't match syscall_set_return_value().
Comment 11 Sergei Trofimovich (RETIRED) gentoo-dev 2021-02-20 23:23:00 UTC
Aha, nice. Looks like we have some forward progress.

Interestingly ENOENT is '2'.

Could it be that `strace` (or kernel's handling of PTRACE_SYSCALL_INFO_EXIT) misinterprets `r8` value as negative? I'll need to spelunk around a bit to understand how we override return values.

static inline long syscall_get_error(struct task_struct *task,
                                     struct pt_regs *regs)
{
        return regs->r10 == -1 ? regs->r8:0;
}

static inline long syscall_get_return_value(struct task_struct *task,
                                            struct pt_regs *regs)
{
        return regs->r8;
}

static inline void syscall_set_return_value(struct task_struct *task,
                                            struct pt_regs *regs,
                                            int error, long val)
{
        if (error) {
                /* error < 0, but ia64 uses > 0 return value */
                regs->r8 = -error;
                regs->r10 = -1;
        } else {
                regs->r8 = val;
                regs->r10 = 0;
        }
}
Comment 12 Sergei Trofimovich (RETIRED) gentoo-dev 2021-02-20 23:25:48 UTC
I'll try the following:

--- a/arch/ia64/include/asm/syscall.h
+++ b/arch/ia64/include/asm/syscall.h
@@ -32,7 +32,7 @@ static inline void syscall_rollback(struct task_struct *task,
 static inline long syscall_get_error(struct task_struct *task,
                                     struct pt_regs *regs)
 {
-       return regs->r10 == -1 ? regs->r8:0;
+       return regs->r10 == -1 ? -regs->r8:0;
 }

 static inline long syscall_get_return_value(struct task_struct *task,
Comment 13 Sergei Trofimovich (RETIRED) gentoo-dev 2021-02-20 23:32:24 UTC
(In reply to Sergei Trofimovich from comment #12)
> I'll try the following:
> 
> --- a/arch/ia64/include/asm/syscall.h
> +++ b/arch/ia64/include/asm/syscall.h
> @@ -32,7 +32,7 @@ static inline void syscall_rollback(struct task_struct
> *task,
>  static inline long syscall_get_error(struct task_struct *task,
>                                      struct pt_regs *regs)
>  {
> -       return regs->r10 == -1 ? regs->r8:0;
> +       return regs->r10 == -1 ? -regs->r8:0;
>  }
> 
>  static inline long syscall_get_return_value(struct task_struct *task,

I'll hold off it for a while check if full "syscall_get_return_value() / syscall_set_return_value()" fix is needed as you suggest.
Comment 14 Dmitry V. Levin 2021-02-20 23:39:04 UTC
(In reply to Sergei Trofimovich from comment #12)
> I'll try the following:
> 
> --- a/arch/ia64/include/asm/syscall.h
> +++ b/arch/ia64/include/asm/syscall.h
> @@ -32,7 +32,7 @@ static inline void syscall_rollback(struct task_struct
> *task,
>  static inline long syscall_get_error(struct task_struct *task,
>                                      struct pt_regs *regs)
>  {
> -       return regs->r10 == -1 ? regs->r8:0;
> +       return regs->r10 == -1 ? -regs->r8:0;
>  }
> 
>  static inline long syscall_get_return_value(struct task_struct *task,

Yes, this should fix exit.rval.
Comment 15 Sergei Trofimovich (RETIRED) gentoo-dev 2021-02-21 00:01:49 UTC
Created attachment 687798 [details, diff]
0001-ia64-fix-ptrace-PTRACE_SYSCALL_INFO_EXIT-sign.patch
Comment 16 Sergei Trofimovich (RETIRED) gentoo-dev 2021-02-21 00:08:38 UTC
(In reply to Sergei Trofimovich from comment #15)
> Created attachment 687798 [details, diff] [details, diff]
> 0001-ia64-fix-ptrace-PTRACE_SYSCALL_INFO_EXIT-sign.patch

Booted guppy with the patch. Now ptrace_syscall_info.gen.test passes \o/
Comment 17 Dmitry V. Levin 2021-02-21 00:27:26 UTC
Cool!

By the way, PTRACE_GET_SYSCALL_INFO is not the only existing consumer of functions defined in arch/ia64/include/asm/syscall.h, for example, syscall_get_arguments() is also used by seccomp, so there is a chance these patches also fix something else.
Comment 18 Sergei Trofimovich (RETIRED) gentoo-dev 2021-03-12 08:42:03 UTC
I'm trying to make new kernel a bit more stable, but not succeeding yet. Most strace tests pass. But one9?) of them manages to destroy kernel's view of address space.

It's perhaps something ia64-specific about trying to deliver a signal to userspace to almost dead tracer (or tracee?) and falls into fault handling recursion.

It's probably not the first OOps in line:

[269391.432140] Unable to handle kernel NULL pointer dereference (address 0000000000000338)
[269391.432140] sock_filter-v-X[6171]: Oops 11012296146944 [18]
[269391.432140] Modules linked in: usb_storage e1000 acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler rtc_efi
[269391.432140]
[269391.432140] CPU: 0 PID: 6171 Comm: sock_filter-v-X Tainted: G    B D W         5.12.0-rc2-00003-g97669c51470e-dirty #85
[269391.432140] Hardware name: hp server rx3600                   , BIOS 04.03                                                            04/08/2008
[269391.432140] psr : 0000121008026010 ifs : 800000000000040b ip  : [<a00000010008d1f1>]    Tainted: G    B D W         (5.12.0-rc2-00003-g97669c51470e-dirty)
[269391.432140] ip is at ptrace_stop+0x2b1/0x860
[269391.432140] unat: 0000000000000000 pfs : 000000000000040b rsc : 0000000000000003
[269391.432140] rnat: 0000000000000000 bsps: 0000000000000000 pr  : 000000255aa66a15
[269391.432140] ldrs: 0000000000000000 ccv : 00000000fffffa92 fpsr: 0009804c0270033f
[269391.432140] csd : 0000000000000000 ssd : 0000000000000000
[269391.432140] b0  : a00000010008d1b0 b6  : a0000001008b1b20 b7  : a00000010000d010
[269391.432140] f6  : 000000000000000000000 f7  : 1003e8208208208208209
[269391.432140] f8  : 1003effffffffffffffea f9  : 1003e0000000000000033
[269391.432140] f10 : 1003e8208208208208209 f11 : 1003effffffffffffffe6
[269391.432140] r1  : a000000101906440 r2  : 0000000000000010 r3  : 0000000000000000
[269391.432140] r8  : 00000000b3a0d9d1 r9  : 00000000000059d0 r10 : 00000000b3a08001
[269391.432140] r11 : 0000000000000001 r12 : e00000010f2d5880 r13 : e00000010f2d0000
[269391.432140] r14 : a0000001015c8304 r15 : 00000000deaf1eed r16 : e00000010f2d0000
[269391.432140] r17 : e00000010f2d100c r18 : a000000101706e70 r19 : e00000010f2d0018
[269391.432140] r20 : 0000000000010289 r21 : e00000010f2d0450 r22 : 0000000000000000
[269391.432140] r23 : 0000000000000338 r24 : 000000000000b3a2 r25 : 000000000000b3a2
[269391.432140] r26 : e00000010f2d048c r27 : 0000000000010013 r28 : fffffffffff7ffff
[269391.432140] r29 : 0000000000120000 r30 : 0000000000000000 r31 : e00000010f2d100c
[269391.432140]
[269391.432140] Call Trace:
[269391.432140]  [<a000000100014d10>] show_stack+0x90/0xc0
[269391.432140]                                 sp=e00000010f2d54b0 bsp=e00000010f2d3738
[269391.432140]  [<a000000100015410>] show_regs+0x6d0/0xa40
[269391.432140]                                 sp=e00000010f2d5680 bsp=e00000010f2d36c8
[269391.432140]  [<a0000001000285e0>] die+0x1e0/0x3c0
[269391.432140]                                 sp=e00000010f2d56a0 bsp=e00000010f2d3688
[269391.432140]  [<a00000010005b160>] ia64_do_page_fault+0x820/0xb80
[269391.432140]                                 sp=e00000010f2d56a0 bsp=e00000010f2d35e8
[269391.432140]  [<a00000010000ca00>] ia64_leave_kernel+0x0/0x270
[269391.432140]                                 sp=e00000010f2d56b0 bsp=e00000010f2d35e8
[269391.432140]  [<a00000010008d1f0>] ptrace_stop+0x2b0/0x860
[269391.432140]                                 sp=e00000010f2d5880 bsp=e00000010f2d3590
[269391.432140]  [<a00000010008d8a0>] ptrace_do_notify+0x100/0x120
[269391.432140]                                 sp=e00000010f2d5880 bsp=e00000010f2d3560
[269391.432140]  [<a00000010008d950>] ptrace_notify+0x90/0x1a0
[269391.432140]                                 sp=e00000010f2d58c0 bsp=e00000010f2d3540
[269391.432140]  [<a000000100073700>] do_exit+0x1540/0x1700
[269391.432140]                                 sp=e00000010f2d58c0 bsp=e00000010f2d34c8
[269391.432140]  [<a0000001000287b0>] die+0x3b0/0x3c0
[269391.432140]                                 sp=e00000010f2d58d0 bsp=e00000010f2d3488
...
Comment 19 Dmitry V. Levin 2021-03-18 00:04:20 UTC
(In reply to Sergei Trofimovich from comment #18)
> I'm trying to make new kernel a bit more stable, but not succeeding yet.
> Most strace tests pass. But one(?) of them manages to destroy kernel's view
> of address space.

FWiW, when the host is booted using a new kernel, rt_sigreturn.gen fails with the following diagnostics:

$ sed -n '/^+ diff/,/^+ fail/p' rt_sigreturn.gen.log
+ diff -u -- exp log
--- exp	2021-03-16 00:23:58.705520318 +0000
+++ log	2021-03-16 00:23:58.706520303 +0000
@@ -1,2 +1,2 @@
-rt_sigreturn({mask=[INT USR2 CHLD RT_3 RT_4 RT_5 RT_26 RT_27]}) = 0
+rt_sigreturn({mask=[]})                 = 0
 +++ exited with 0 +++
+ fail_ '../../src/strace -e trace=rt_sigreturn -esignal=!USR1 ../rt_sigreturn output mismatch'

I don't know whether it's related.
Comment 20 Sergei Trofimovich (RETIRED) gentoo-dev 2021-03-30 21:22:01 UTC
Box did not become much stabler, but at least we got more debugging enabled on it. I hope it's not too slow.

Meanwhile trying to dig into rt_sigreturn failure. Comparing x86 and amd64 they call initial sigaction slightly differently (no SA_RESTORER, but maybe it's an arch-specific way to clean signals up):

ia64:

rt_sigprocmask(SIG_SETMASK, [INT USR2 CHLD RT_3 RT_4 RT_5 RT_26 RT_27], NULL, 8) = 0
rt_sigaction(SIGUSR1, {sa_handler=0x2000000800011ec0, sa_mask=[], sa_flags=SA_SIGINFO}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [INT USR2 CHLD RT_3 RT_4 RT_5 RT_26 RT_27], 8) = 0
getpid()                                = 88260
gettid()                                = 88260
tgkill(88260, 88260, SIGUSR1)           = 0
rt_sigprocmask(SIG_SETMASK, [INT USR2 CHLD RT_3 RT_4 RT_5 RT_26 RT_27], NULL, 8) = 0
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TKILL, si_pid=88260, si_uid=0} ---
rt_sigreturn({mask=[]})      

x86:

rt_sigprocmask(SIG_SETMASK, [INT USR2 CHLD RT_3 RT_4 RT_5 RT_26 RT_27], NULL, 8) = 0
rt_sigaction(SIGUSR1, {sa_handler=0x562eb487c340, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7f595b115500}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [INT USR2 CHLD RT_3 RT_4 RT_5 RT_26 RT_27], 8) = 0
getpid()                                = 1492986
gettid()                                = 1492986
tgkill(1492986, 1492986, SIGUSR1)       = 0
rt_sigprocmask(SIG_SETMASK, [INT USR2 CHLD RT_3 RT_4 RT_5 RT_26 RT_27], NULL, 8) = 0
--- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TKILL, si_pid=1492986, si_uid=1000} ---
rt_sigreturn({mask=[INT USR2 CHLD RT_3 RT_4 RT_5 RT_26 RT_27]}) = 0

At least signal delivery happens at the same time. Checking if we generate wrong frame at signal signal relivery or we handle return wrongly.
Comment 21 Sergei Trofimovich (RETIRED) gentoo-dev 2021-03-30 22:46:06 UTC
From kernel's standpoint ia64's signal frame sits on memory stack and is in form of:

- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/ia64/kernel/sigframe.h#n8
- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/ia64/include/uapi/asm/sigcontext.h#n35

struct sigframe {
	unsigned long arg0;		/* signum */
	unsigned long arg1;		/* siginfo pointer */
	unsigned long arg2;		/* sigcontext pointer */

	void __user *handler;		/* pointer to the plabel of the signal handler */
	struct siginfo info;
	struct sigcontext sc;
};
...
struct sigcontext {
...
	sigset_t		sc_mask;	/* signal mask to restore after handler returns */
};

Same in libc land: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/ia64/bits/sigcontext.h;h=7dca259c31f1afb99be0c620887a05d679f44153;hb=HEAD#l41

Looking at a value in action:

$ gdb --quiet ./tests/rt_sigreturn
Reading symbols from ./tests/rt_sigreturn...
(gdb) start
Temporary breakpoint 1 at 0xc40: file rt_sigreturn.c, line 27.
Starting program: /root/strace/tests/rt_sigreturn
Failed to read a valid object file image from memory.

Temporary breakpoint 1, main () at rt_sigreturn.c:27
27	{
(gdb) break handler
Breakpoint 2 at 0x2000000800001180: file rt_sigreturn.c, line 23.
(gdb) continue
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0xa000000000040721 in ?? ()
(gdb) continue
Continuing.

Breakpoint 2, handler (no=10, si=0x60000fffffffa380, uc=0x60000fffffffa400) at rt_sigreturn.c:23
23	}
(gdb) printf "%#lx\n", ((struct sigcontext*)uc)->sc_mask
0x600001c00010802

These at least match SIGINT (2), SIGUSR2 (12), SIGCHILD(17):

$ printf "%x\n" $(( (1 << 2 | 1 << 12 | 1 << 17) >> 1 ))
10802

Assuming _RT signal numbers are also correct this means frame generation is not too broken.

(gdb) disassemble handler
Dump of assembler code for function handler:
=> 0x2000000800001180 <+0>:	[MIB]       nop.m 0x0
   0x2000000800001181 <+1>:	            nop.i 0x0
   0x2000000800001182 <+2>:	            br.ret.sptk.many b0;;
End of assembler dump.

(gdb) print (void*)$b0
$7 = (void *) 0xa000000000040800

(gdb) disassemble $b0, $b0+160
Dump of assembler code from 0xa000000000040800 to 0xa0000000000408a0:
   0xa000000000040800:	[MMI]       adds r2=248,r12;;
   0xa000000000040801:	            ld8 r15=[r2]
   0xa000000000040802:	            nop.i 0x0
   0xa000000000040810:	[MMI]       mov.m r14=ar.bsp;;
   0xa000000000040811:	            cmp.eq p0,p1=r14,r15
   0xa000000000040812:	            nop.i 0x0
   0xa000000000040820:	[MIB]       nop.m 0x0
   0xa000000000040821:	            nop.i 0x0
   0xa000000000040822:	      (p01) br.cond.spnt.few 0xa000000000040900;;
   0xa000000000040830:	[MII]       adds r2=736,r12
   0xa000000000040831:	            adds r3=752,r12;;
   0xa000000000040832:	            nop.i 0x0
   0xa000000000040840:	[MMI]       ldf.fill f6=[r2],32
   0xa000000000040841:	            ldf.fill f7=[r3],32
   0xa000000000040842:	            nop.i 0x0;;
   0xa000000000040850:	[MMI]       ldf.fill f8=[r2],32
   0xa000000000040851:	            ldf.fill f9=[r3],32
   0xa000000000040852:	            nop.i 0x0;;
   0xa000000000040860:	[MMI]       ldf.fill f10=[r2],32
   0xa000000000040861:	            ldf.fill f11=[r3],32
   0xa000000000040862:	            nop.i 0x0;;
   0xa000000000040870:	[MMI]       ldf.fill f12=[r2],32
   0xa000000000040871:	            ldf.fill f13=[r3],32
   0xa000000000040872:	            nop.i 0x0;;
   0xa000000000040880:	[MMI]       ldf.fill f14=[r2],32
   0xa000000000040881:	            ldf.fill f15=[r3],32
   0xa000000000040882:	            mov r15=1181
   0xa000000000040890:	[MII]       break.m 0x100000

This is a return directly back to kernel's gate (~vdso) address which calls rt_sigreturn (1181 syscall) after stack adjustment: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/ia64/kernel/gate.S#n165

sys_rt_sigreturn's ABI is very unusual for an ia64 syscall. It is based on userspase stack (as opposed to be passed in 8 input registers): it ignores (?) all it's 8 input registers and effectively ignores them. Uses memory stack instead:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/ia64/kernel/entry.S#n1252

ENTRY(sys_rt_sigreturn)
  alloc r2=ar.pfs,8,0,1,0
  ...
  adds out0=16,sp				// out0 = &sigscratch
  br.call.sptk.many rp=ia64_rt_sigreturn

Not sure if strace lacks bit of ia64-specific code and it happened to work by accident before (unlikely) or kernel miscalculates memory stack position.

How does strace extract sc_mask?
Comment 22 Dmitry V. Levin 2021-03-30 23:09:45 UTC
(In reply to Sergei Trofimovich from comment #21)
> Not sure if strace lacks bit of ia64-specific code and it happened to work
> by accident before (unlikely) or kernel miscalculates memory stack position.
> 
> How does strace extract sc_mask?

strace obtains the tracee stack pointer using get_stack_pointer() and uses it in src/linux/ia64/arch_rt_sigframe.c to calculate the address of sigframe, this address is used later in src/rt_sigreturn.c to print the signal mask.

Since get_stack_pointer() also uses the result of PTRACE_GET_SYSCALL_INFO when available, it could have been affected by the kernel update that enabled PTRACE_GET_SYSCALL_INFO.
Comment 23 Sergei Trofimovich (RETIRED) gentoo-dev 2021-03-31 07:03:42 UTC
Aha, that helps! Looks like both GET_SYSCALL_INFO and SP_REG both return the same stack-looking value, but it's bspstore value (not memory stack):

--- a/src/syscall.c
+++ b/src/syscall.c
@@ -1332,12 +1332,14 @@ get_stack_pointer(struct tcb *tcp, kernel_ulong_t *sp)
 		if (!strace_get_syscall_info(tcp))
 			return false;
 		*sp = (kernel_ulong_t) ptrace_sci.stack_pointer;
-		return true;
+		fprintf(stderr, "\nSP from syscall_info: %#lx\n", *sp);
+		//return true;
 	}

 #if defined ARCH_SP_REG
 	if (get_regs(tcp) < 0)
 		return false;
+	fprintf(stderr,           "SP from       SP_REG: %#lx\n", *sp);
 	*sp = (kernel_ulong_t) ARCH_SP_REG;
 	return true;
 #elif defined ARCH_SP_PEEK_ADDR

# setarch -R ./src/strace ./tests/rt_sigreturn
rt_sigreturn(
SP from syscall_info: 0x6000080000000088
SP from       SP_REG: 0x6000080000000088
{mask=[INT USR2 CHLD RT_3 RT_4 RT_5 RT_26 RT_27]}) = 0

If I don't uncomment '//return true;' i get the same address but with decoding failure:

# setarch -R ./src/strace ./tests/rt_sigreturn
rt_sigreturn(
SP from syscall_info: 0x6000080000000088
{mask=[]})                 = 0

# setarch -R cat /proc/self/maps
6000080000000000-6000080000004000 rw-p 00000000 00:00 0
60000ffffffd8000-60000fffffffc000 rw-p 00000000 00:00 0                  [stack]

# setarch -R gdb --quiet -ex 'break handler' -ex 'run' -ex 'continue' ./tests/rt_sigreturn

(gdb) print (void*)$sp
$1 = (void *) 0x60000fffffffa350
(gdb) print (void*)$r12
$2 = (void *) 0x60000fffffffa350
(gdb) print (void*)$bsp
$3 = (void *) 0x6000080000000110
(gdb) print (void*)$bspstore
$4 = (void *) 0x6000080000000088

arch/ia64/include/asm/ptrace.h defines both helpers:

  static inline unsigned long user_stack_pointer(struct pt_regs *regs)
  {
        /* FIXME: should this be bspstore + nr_dirty regs? */
        return regs->ar_bspstore;
  }

  #define current_user_stack_pointer() (current_pt_regs()->r12)

I'll boot to 4.19 slightly later to check which of two we used to return before.
Comment 24 Sergei Trofimovich (RETIRED) gentoo-dev 2021-03-31 08:34:00 UTC
> I'll boot to 4.19 slightly later to check which of two we used to return
> before.

4.19.86-gentoo does something even more unexpected:

rt_sigreturn(
SP from       SP_REG: 0xc00000000000058f
{mask=[INT USR2 CHLD RT_3 RT_4 RT_5 RT_26 RT_27]}) = 0

On ia64 0xc... (region=6) is an uncached kernel MMIO region. It's also an unaligned address, etc. Don't know what it means for userspace (should be just unavailable address, unless kernel fails to detect it and allows mmio read from 0x58f by accident).

Booted guppy with:

--- a/arch/ia64/include/asm/ptrace.h
+++ b/arch/ia64/include/asm/ptrace.h
@@ -54,8 +54,7 @@

 static inline unsigned long user_stack_pointer(struct pt_regs *regs)
 {
-	/* FIXME: should this be bspstore + nr_dirty regs? */
-	return regs->ar_bspstore;
+	return regs->r12;
 }

 static inline int is_syscall_success(struct pt_regs *regs)
@@ -79,11 +78,6 @@ static inline long regs_return_value(struct pt_regs *regs)
 	unsigned long __ip = instruction_pointer(regs);			\
 	(__ip & ~3UL) + ((__ip & 3UL) << 2);				\
 })
-/*
- * Why not default?  Because user_stack_pointer() on ia64 gives register
- * stack backing store instead...
- */
-#define current_user_stack_pointer() (current_pt_regs()->r12)

   /* given a pointer to a task_struct, return the user's pt_regs */
 # define task_pt_regs(t)		(((struct pt_regs *) ((char *) (t) + IA64_STK_OFFSET)) - 1)

Now it looks more like memory stack:

rt_sigreturn(
SP from syscall_info: 0x60000fffffffa390
SP from       SP_REG: 0x60000fffffffa390
{mask=[INT USR2 CHLD RT_3 RT_4 RT_5 RT_26 RT_27]}) = 0
Comment 25 Sergei Trofimovich (RETIRED) gentoo-dev 2021-04-07 17:51:33 UTC
After many backs and forths of trying to blame ptrace() for killing fresh kernels I ended up reducing strace's test suite down to the following program that is able to crash guppy:

#include <unistd.h>
#include <netinet/in.h>
#include <sys/socket.h>
#include <linux/filter.h>

int
main(void)
{
	struct sock_filter bpf_filter[] = {
		BPF_STMT(BPF_RET|BPF_K, 0)
	};
	struct sock_fprog prog = {
		.len = 1,
		.filter = bpf_filter,
	};

	int fd = socket(AF_INET, SOCK_DGRAM, 0);
	setsockopt(fd, SOL_SOCKET, SO_ATTACH_FILTER, &prog, sizeof(prog));

	return 0;
}

This crashes guppy in a second:

$ gcc bug.c -o bug; while ./bug; do echo again; done


[35609.524427] Unable to handle kernel NULL pointer dereference (address 0000000000000088)
[35609.524427] bash[9024]: Oops on page_fault, isr: 11012296146944 [1]
[35609.524427] Modules linked in: acpi_ipmi e1000 usb_storage ipmi_si ipmi_devintf ipmi_msghandler rtc_efi
[35609.524427]
[35609.524427] CPU: 0 PID: 9024 Comm: bash Not tainted 5.12.0-rc5-00115-ga85f4ff49085-dirty #256
[35609.524427] current:
[35609.524427] state: 0
[35609.524427] stack: 0xe000000127e41080
[35609.524427] usage (refs): 1
[35609.524427] flags: 0x404040
[35609.524427] ptrace: 0x0
[35609.524427] exit_st: 0
[35609.524427] exit_c: 0
[35609.524427] exit_sig: 17
[35609.524427] pdeath_sig: 0
[35609.524427] aflags: 0
[35609.524427] Hardware name: hp server rx3600                   , BIOS 04.03                                                            04/08/2008
[35609.524427] psr : 0000121008026030 ifs : 8000000000000288 ip  : [<a00000010023b4e1>]    Not tainted (5.12.0-rc5-00115-ga85f4ff49085-dirty)
[35609.524427] ip is at bpf_prog_free+0x21/0xe0
[35609.524427] unat: 0000000000000000 pfs : 0000000000000307 rsc : 0000000000000003
[35609.524427] rnat: 0000000000000000 bsps: 0000000000000000 pr  : 696a6a56956a6955
[35609.524427] ldrs: 0000000000000000 ccv : 0000000000000200 fpsr: 0009804c0270033f
[35609.524427] csd : 0000000000000000 ssd : 0000000000000000
[35609.524427] b0  : a000000100e97780 b6  : a000000100e976e0 b7  : a00000010000cf70
[35609.524427] f6  : 1003e0000000000000001 f7  : 1003e0000000002e6cf0f
[35609.524427] f8  : 1003e0000000000000001 f9  : 1003e0000000000000098
[35609.524427] f10 : 1003e0000000000000098 f11 : 1003efffffffffffffff0
[35609.524427] r1  : a00000010194ebc0 r2  : a000000211048004 r3  : 0000000000000000
[35609.524427] r8  : 0000000000000008 r9  : e0000001274f0a80 r10 : a0000001017586c4
[35609.524427] r11 : a000000101492bf8 r12 : e000000127e4fb60 r13 : e000000127e40000
[35609.524427] r14 : 0000000000000088 r15 : a000000211048040 r16 : a000000100e976e0
[35609.524427] r17 : fffffffffffcce50 r18 : 0000001008022030 r19 : 0000000000000000
[35609.524427] r20 : e000000127e4fb50 r21 : 0000000000004000 r22 : e000000127e410a0
[35609.524427] r23 : 0000000000000101 r24 : 0000000000000100 r25 : 9e1f8354bb6453a3
[35609.524427] r26 : 9e1f8354bb6453a3 r27 : 100b4f27010000e0 r28 : 8e14cc73ba645343
[35609.524427] r29 : 0000000000000000 r30 : e0000001274f0a80 r31 : e0000001274f0a80
[35609.524427] b0: sk_filter_release_rcu+0xa0/0x120
[35609.524427] b6: sk_filter_release_rcu+0x0/0x120
[35609.524427] b7: unw_init_running+0x30/0xa0
[35609.524427]
[35609.524427] Call Trace:
[35609.524427]  [<a000000100015110>] show_stack+0x90/0xc0
[35609.524427]                                 sp=e000000127e4f790 bsp=e000000127e41b80
[35609.524427]  [<a000000100015850>] show_regs+0x710/0xa80
[35609.524427]                                 sp=e000000127e4f960 bsp=e000000127e41b10
[35609.524427]  [<a000000100028940>] die+0x1e0/0x3c0
[35609.524427]                                 sp=e000000127e4f980 bsp=e000000127e41ad0
[35609.524427]  [<a00000010005d800>] ia64_do_page_fault+0x820/0xb80
[35609.524427]                                 sp=e000000127e4f980 bsp=e000000127e41a38
[35609.524427]  [<a00000010000c960>] ia64_leave_kernel+0x0/0x270
[35609.524427]                                 sp=e000000127e4f990 bsp=e000000127e41a38
[35609.524427]  [<a00000010023b4e0>] bpf_prog_free+0x20/0xe0
[35609.524427]                                 sp=e000000127e4fb60 bsp=e000000127e419f0
[35609.524427]  [<a000000100e97780>] sk_filter_release_rcu+0xa0/0x120
[35609.524427]                                 sp=e000000127e4fb60 bsp=e000000127e419c0
[35609.524427]  [<a00000010018f480>] rcu_core+0x8c0/0x1440
[35609.524427]                                 sp=e000000127e4fb60 bsp=e000000127e41938
[35609.524427]  [<a000000100190020>] rcu_core_si+0x20/0x40
[35609.524427]                                 sp=e000000127e4fb80 bsp=e000000127e41920
[35609.524427]  [<a00000010118c3d0>] __do_softirq+0x230/0x650
[35609.524427]                                 sp=e000000127e4fb80 bsp=e000000127e41838
[35609.524427]  [<a000000100079670>] irq_exit+0x170/0x200
[35609.524427]                                 sp=e000000127e4fb80 bsp=e000000127e41820
[35609.524427]  [<a000000100013970>] ia64_handle_irq+0x1b0/0x360
[35609.524427]                                 sp=e000000127e4fb80 bsp=e000000127e41798
[35609.524427]  [<a00000010000c960>] ia64_leave_kernel+0x0/0x270
[35609.524427]                                 sp=e000000127e4fb90 bsp=e000000127e41798
[35609.524427]  [<a0000001008e2320>] memset+0x160/0x420
[35609.524427]                                 sp=e000000127e4fd60 bsp=e000000127e41780
[35609.524427]  [<a0000001003f1970>] __kernel_poison_pages+0xd0/0x140
[35609.524427]                                 sp=e000000127e4fd60 bsp=e000000127e41740
[35609.524427]  [<a0000001003baff0>] free_pcp_prepare+0x530/0x600
[35609.524427]                                 sp=e000000127e4fd60 bsp=e000000127e41710
[35609.524427]  [<a0000001003c0770>] free_unref_page_list+0x90/0x480
[35609.524427]                                 sp=e000000127e4fd60 bsp=e000000127e416a8
[35609.524427]  [<a0000001003097a0>] release_pages+0xd40/0x13e0
[35609.524427]                                 sp=e000000127e4fd60 bsp=e000000127e415e8
[35609.524427]  [<a0000001003d5190>] free_pages_and_swap_cache+0x110/0x2c0
[35609.524427]                                 sp=e000000127e4fd80 bsp=e000000127e41590
[35609.524427]  [<a000000100398660>] tlb_finish_mmu+0x100/0x2e0
[35609.524427]                                 sp=e000000127e4fd80 bsp=e000000127e41558
[35609.524427]  [<a000000100392460>] exit_mmap+0x180/0x320
[35609.524427]                                 sp=e000000127e4fd80 bsp=e000000127e41520
[35609.524427]  [<a000000100061f60>] mmput+0xc0/0x240
[35609.524427]                                 sp=e000000127e4fe00 bsp=e000000127e414f8
[35609.524427]  [<a000000100463340>] begin_new_exec+0xfa0/0x1d20
[35609.524427]                                 sp=e000000127e4fe00 bsp=e000000127e413c8
[35609.524427]  [<a000000100584120>] load_elf_binary+0x400/0x2a40
[35609.524427]                                 sp=e000000127e4fe00 bsp=e000000127e412a8
[35609.524427]  [<a000000100461240>] bprm_execve+0x560/0xde0
[35609.524427]                                 sp=e000000127e4fe20 bsp=e000000127e411e0
[35609.524427]  [<a000000100461d70>] do_execveat_common+0x2b0/0x320
[35609.524427]                                 sp=e000000127e4fe30 bsp=e000000127e41178
[35609.524427]  [<a000000100464100>] sys_execve+0x40/0x60
[35609.524427]                                 sp=e000000127e4fe30 bsp=e000000127e41110
[35609.524427]  [<a00000010000bee0>] ia64_execve+0x20/0x120
[35609.524427]                                 sp=e000000127e4fe30 bsp=e000000127e410c0
[35609.524427]  [<a00000010000c7e0>] ia64_ret_from_syscall+0x0/0x20
[35609.524427]                                 sp=e000000127e4fe30 bsp=e000000127e410c0
[35609.524427]  [<a000000000040720>] ia64_ivt+0xffffffff00040720/0x400
[35609.524427]                                 sp=e000000127e50000 bsp=e000000127e410c0
[35609.524427] Disabling lock debugging due to kernel taint
[35609.652269] Kernel panic - not syncing: Aiee, killing interrupt handler!
[35609.653270] ---[ end Kernel panic - not syncing: Aiee, killing interrupt handler! ]---

This is a lot simpler to work with. My next hypotheses are:
1. vmalloc() is broken on ia64 (VMA tracking is unusual where due to KERNEL/GATE regions)
2. lack of ordering when freeing bpf_prog (bpf_prog and sk_filter are freed ~asynchronously via RCU).

I'll flip vmalloc() to kmalloc() for bpf_prog refute [1.]. I hope 16K pages would be good enough for most filter allocations.
Comment 26 Sergei Trofimovich (RETIRED) gentoo-dev 2021-04-07 18:07:05 UTC
switch to kmalloc helped at least for a minimal test:

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 75244ecb2389..5921c28b2ce0 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -82,18 +82,21 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
 	struct bpf_prog *fp;

 	size = round_up(size, PAGE_SIZE);
-	fp = __vmalloc(size, gfp_flags);
+	//fp = __vmalloc(size, gfp_flags);
+	fp = kmalloc(size, gfp_flags);
 	if (fp == NULL)
 		return NULL;

 	aux = kzalloc(sizeof(*aux), GFP_KERNEL_ACCOUNT | gfp_extra_flags);
 	if (aux == NULL) {
-		vfree(fp);
+		//vfree(fp);
+		kfree(fp);
 		return NULL;
 	}
 	fp->active = alloc_percpu_gfp(int, GFP_KERNEL_ACCOUNT | gfp_extra_flags);
 	if (!fp->active) {
-		vfree(fp);
+		//vfree(fp);
+		kfree(fp);
 		kfree(aux);
 		return NULL;
 	}
@@ -124,7 +127,8 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags)
 	if (!prog->stats) {
 		free_percpu(prog->active);
 		kfree(prog->aux);
-		vfree(prog);
+		//vfree(prog);
+		kfree(prog);
 		return NULL;
 	}

@@ -220,7 +224,8 @@ void bpf_prog_fill_jited_linfo(struct bpf_prog *prog,
 void bpf_prog_free_linfo(struct bpf_prog *prog)
 {
 	bpf_prog_free_jited_linfo(prog);
-	kvfree(prog->aux->linfo);
+	//kvfree(prog->aux->linfo);
+	kfree(prog->aux->linfo);
 }

 struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size,
@@ -235,7 +240,8 @@ struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size,
 	if (pages <= fp_old->pages)
 		return fp_old;

-	fp = __vmalloc(size, gfp_flags);
+	//fp = __vmalloc(size, gfp_flags);
+	fp = kmalloc(size, gfp_flags);
 	if (fp) {
 		memcpy(fp, fp_old, fp_old->pages * PAGE_SIZE);
 		fp->pages = pages;
@@ -263,7 +269,8 @@ void __bpf_prog_free(struct bpf_prog *fp)
 	}
 	free_percpu(fp->stats);
 	free_percpu(fp->active);
-	vfree(fp);
+	//vfree(fp);
+	kfree(fp);
 }

 int bpf_prog_calc_tag(struct bpf_prog *fp)
@@ -279,7 +286,8 @@ int bpf_prog_calc_tag(struct bpf_prog *fp)
 	__be32 *result;
 	__be64 *bits;

-	raw = vmalloc(raw_size);
+	//raw = vmalloc(raw_size);
+	raw = kmalloc(raw_size, GFP_KERNEL);
 	if (!raw)
 		return -ENOMEM;

@@ -335,7 +343,8 @@ int bpf_prog_calc_tag(struct bpf_prog *fp)
 		result[i] = cpu_to_be32(digest[i]);
 	memcpy(fp->tag, result, sizeof(fp->tag));

-	vfree(raw);
+	//vfree(raw);
+	kfree(raw);
 	return 0;
 }

@@ -1096,7 +1105,8 @@ static struct bpf_prog *bpf_prog_clone_create(struct bpf_prog *fp_other,
 	gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags;
 	struct bpf_prog *fp;

-	fp = __vmalloc(fp_other->pages * PAGE_SIZE, gfp_flags);
+	//fp = __vmalloc(fp_other->pages * PAGE_SIZE, gfp_flags);
+	fp = kmalloc(fp_other->pages * PAGE_SIZE, gfp_flags);
 	if (fp != NULL) {
 		/* aux->prog still points to the fp_other one, so
 		 * when promoting the clone to the real program,
Comment 27 Sergei Trofimovich (RETIRED) gentoo-dev 2021-04-09 20:55:36 UTC
Another (perhaps unsurprising) workaround that works is switch from RCU (~async) to non-RCU (~sync) free:

--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1195,7 +1195,8 @@ static void sk_filter_release_rcu(struct rcu_head *rcu)
 static void sk_filter_release(struct sk_filter *fp)
 {
 	if (refcount_dec_and_test(&fp->refcnt))
-		call_rcu(&fp->rcu, sk_filter_release_rcu);
+		//call_rcu(&fp->rcu, sk_filter_release_rcu);
+		__sk_filter_release(fp);
 }

 void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp)
Comment 28 Sergei Trofimovich (RETIRED) gentoo-dev 2021-04-18 09:47:18 UTC
The crashes happen consistently even on maxcpus=1. That leans me towards correctness problem rather some complicated cross-CPU race.

Next theory: double vfree() of the same address. With a bit of printk i see double-free before corruption detection on a000000200008000 address:

[   13.398936] __vmalloc_node OUT: r=a000000200008000 caller=blk_add_partitions+0x140/0xbe0
[   13.414206] __vfree OUT: p=a000000200008000 c=blk_add_partitions+0x690/0xbe0
[   17.878155] __vmalloc_node OUT: r=a000000200008000 caller=kernel_read_file+0x4e0/0x580
[   17.882210] __vfree OUT: p=a000000200008000 c=load_module+0x6130/0x6e60
[   18.148456] __vfree OUT: p=a000000200008000 c=module_memfree+0xa0/0xc0

Checking now how load_module() / module_memfree() can do double-free of the same data.
Comment 29 Sergei Trofimovich (RETIRED) gentoo-dev 2021-04-24 23:31:16 UTC
I think I got guppy to state where it survives strace test suite \o/

So far the workaround is to avoid eager TLB flushes caused by VM_FLUSH_RESET_PERMS VMAs. I still don't understand why it helps but I suspect we somehow miss a flush for these VMAs and read from wrong pages a s result.

Current hack:

--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2214,6 +2214,9 @@ static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)

    remove_vm_area(area->addr);

+   /* workaround mysterious double-free on vmalloc() for bpf. */
+   return;
+
    /* If this is not VM_FLUSH_RESET_PERMS memory, no need for the below. */
    if (!flush_reset)
        return;
Comment 30 Sergei Trofimovich (RETIRED) gentoo-dev 2021-05-14 08:00:39 UTC
Let's declare it done. I'll sort out vmalloc() corruption separately. It looks very complicated and unrelated to ptrace().