Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 923822 - sys-kernel/gentoo-sources-6.7.3 OCFS2
Summary: sys-kernel/gentoo-sources-6.7.3 OCFS2
Status: RESOLVED UPSTREAM
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal blocker (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL: https://bugzilla.kernel.org/show_bug....
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-02-05 11:21 UTC by Urban Oettli
Modified: 2024-03-01 12:42 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Urban Oettli 2024-02-05 11:21:50 UTC
When migrating from sys-kernel/gentoo-sources-6.7.0 to sys-kernel/gentoo-sources-6.7.3 the machine freezes when mounting an OCFS2 drive.

I did a bisect but the result was suspect in my opinion.
Because I didn't trust the result, I repeated the entire bisect,
but the result was the same again.

These are the results:

dmesg from the machine that wants to mount the OCFS2 drive.

Feb 4 14:36] ocfs2: Registered cluster interface o2cb
[Feb 4 14:37] rcu: INFO: rcu_preempt self-detected stall on CPU
[  +0.000006] rcu:      13-....: (12 GPs behind) idle=d54c/1/0x4000000000000000 softirq=725/725 fqs=7490
[  +0.000008] rcu:      (t=15000 jiffies g=3701 q=1515 ncpus=24)
[  +0.000004] CPU: 13 PID: 2501 Comm: kworker/u48:2 Not tainted 6.7.1+ #1
[  +0.000003] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[  +0.000002] Workqueue: o2net o2net_accept_many [ocfs2_nodemanager]
[  +0.000021] RIP: 0010:queued_spin_lock_slowpath+0x12/0x192
[  +0.000007] Code: 02 00 00 e9 da 31 66 ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 89 fa 66 90 b9 01 00 00 00 8b 02 85 c0 74 04 f3 90 <eb> f6 f0 0f b1 0a 85 c0 75 ee c3 cc cc cc cc 81 fe 00 01 00 00 75
[  +0.000003] RSP: 0018:ffffc90000d33d68 EFLAGS: 00000202
[  +0.000002] RAX: 0000000000000001 RBX: ffff888102d79180 RCX: 0000000000000001
[  +0.000002] RDX: ffff888102d79218 RSI: 0000000000000001 RDI: ffff888102d79218
[  +0.000001] RBP: ffff888102d79218 R08: 0000000000000000 R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888102d79180
[  +0.000002] R13: ffffc90000d33dec R14: ffffffffa0585f08 R15: ffff8881033d5900

dmesg from a machine that has mounted the OCFS2 drive.

[Feb 4 14:36] o2net: Connection to node Buildhost (num 1) at 192.168.1.72:7777 shutdown, state 7
[  +3.071847] o2net: Connection to node Buildhost (num 1) at 192.168.1.72:7777 shutdown, state 7
[  +3.071904] o2net: Connection to node Buildhost (num 1) at 192.168.1.72:7777 shutdown, state 7
[  +3.071983] o2net: Connection to node Buildhost (num 1) at 192.168.1.72:7777 shutdown, state 7
[  +3.071801] o2net: Connection to node Buildhost (num 1) at 192.168.1.72:7777 shutdown, state 7
[  +3.071954] o2net: Connection to node Buildhost (num 1) at 192.168.1.72:7777 shutdown, state 7
[  +0.256005] o2net: No connection established with node 1 after 30.0 seconds, check network and cluster configuration.
[  +2.816065] o2net: Connection to node Buildhost (num 1) at 192.168.1.72:7777 shutdown, state 7
[  +3.071867] o2net: Connection to node Buildhost (num 1) at 192.168.1.72:7777 shutdown, state 7
[  +3.071909] o2net: Connection to node Buildhost (num 1) at 192.168.1.72:7777 shutdown, state 7
[  +3.071839] o2net: Connection to node Buildhost (num 1) at 192.168.1.72:7777 shutdown, state 7
[

Bisect result:

status: waiting for both good and bad commits
status: waiting for bad commit, 1 good commit known
Bisecting: 508 revisions left to test after this (roughly 9 steps)
[3437e35fe8a6a937b96da90716c0c538e15be0f7] serial: sc16is7xx: set safe default SPI clock frequency
Bisecting: 253 revisions left to test after this (roughly 8 steps)
[3e2680bd68fab7ba145393e4eb069d6c4d8a30fa] Bluetooth: btmtkuart: fix recv_buf() return value
Bisecting: 126 revisions left to test after this (roughly 7 steps)
[06a7919489f2e310f6ba1ccd8c62e5b19189b6e0] bpf, lpm: Fix check prefixlen before walking trie
Bisecting: 63 revisions left to test after this (roughly 6 steps)
[4d08e6c01cf2d853971d68700785646045a5afd6] ACPI: LPSS: Fix the fractional clock divider flags
Bisecting: 31 revisions left to test after this (roughly 5 steps)
[7fea58e8e647ffcc70dfa1ea53c325d7e52c893b] crypto: hisilicon/sec2 - save capability registers in probe process
Bisecting: 15 revisions left to test after this (roughly 4 steps)
[0055ff3ee060002c97cbbaf77ceead8c11d0b984] crypto: ccp - fix memleak in ccp_init_dm_workarea
Bisecting: 7 revisions left to test after this (roughly 3 steps)
[7f25cd51aa71711438a6c11fa638ba5aba61ed95] selinux: Fix error priority for bind with AF_UNSPEC on PF_INET6 socket
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[568a410e89f0ae73f2663bd1a48f981d3377aac9] kunit: debugfs: Handle errors from alloc_string_stream()
Bisecting: 1 revision left to test after this (roughly 1 step)
[02871710b93058eb1249d5847c0b2d1c2c3c98ae] thermal: core: Fix NULL pointer dereference in zone registration error path
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[f3bd89340eab3ff2740a499384977f6562b9a53f] ACPI: extlog: Clear Extended Error Log status when RAS_CEC handled the error
02871710b93058eb1249d5847c0b2d1c2c3c98ae is the first bad commit
commit 02871710b93058eb1249d5847c0b2d1c2c3c98ae
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date:   Thu Dec 14 11:52:25 2023 +0100

    thermal: core: Fix NULL pointer dereference in zone registration error path
    
    [ Upstream commit 04e6ccfc93c5a1aa1d75a537cf27e418895e20ea ]
    
    If device_register() in thermal_zone_device_register_with_trips()
    returns an error, the tz variable is set to NULL and subsequently
    dereferenced in kfree(tz->tzp).
    
    Commit adc8749b150c ("thermal/drivers/core: Use put_device() if
    device_register() fails") added the tz = NULL assignment in question to
    avoid a possible double-free after dropping the reference to the zone
    device.  However, after commit 4649620d9404 ("thermal: core: Make
    thermal_zone_device_unregister() return after freeing the zone"), that
    assignment has become redundant, because dropping the reference to the
    zone device does not cause the zone object to be freed any more.
    
    Drop it to address the NULL pointer dereference.
    
    Fixes: 3d439b1a2ad3 ("thermal/core: Alloc-copy-free the thermal zone parameters structure")
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 drivers/thermal/thermal_core.c | 1 -
 1 file changed, 1 deletion(-)

git bisect log

git bisect start
# status: waiting for both good and bad commits
# good: [0dd3ee31125508cd67f7e7172247f05b7fd1753a] Linux 6.7
git bisect good 0dd3ee31125508cd67f7e7172247f05b7fd1753a
# status: waiting for bad commit, 1 good commit known
# bad: [01e08e5d7656e660c8a4852191e1e133cbdb0a66] Linux 6.7.3
git bisect bad 01e08e5d7656e660c8a4852191e1e133cbdb0a66
# bad: [3437e35fe8a6a937b96da90716c0c538e15be0f7] serial: sc16is7xx: set safe default SPI clock frequency
git bisect bad 3437e35fe8a6a937b96da90716c0c538e15be0f7
# bad: [3e2680bd68fab7ba145393e4eb069d6c4d8a30fa] Bluetooth: btmtkuart: fix recv_buf() return value
git bisect bad 3e2680bd68fab7ba145393e4eb069d6c4d8a30fa
# bad: [06a7919489f2e310f6ba1ccd8c62e5b19189b6e0] bpf, lpm: Fix check prefixlen before walking trie
git bisect bad 06a7919489f2e310f6ba1ccd8c62e5b19189b6e0
# good: [4d08e6c01cf2d853971d68700785646045a5afd6] ACPI: LPSS: Fix the fractional clock divider flags
git bisect good 4d08e6c01cf2d853971d68700785646045a5afd6
# bad: [7fea58e8e647ffcc70dfa1ea53c325d7e52c893b] crypto: hisilicon/sec2 - save capability registers in probe process
git bisect bad 7fea58e8e647ffcc70dfa1ea53c325d7e52c893b
# bad: [0055ff3ee060002c97cbbaf77ceead8c11d0b984] crypto: ccp - fix memleak in ccp_init_dm_workarea
git bisect bad 0055ff3ee060002c97cbbaf77ceead8c11d0b984
# bad: [7f25cd51aa71711438a6c11fa638ba5aba61ed95] selinux: Fix error priority for bind with AF_UNSPEC on PF_INET6 socket
git bisect bad 7f25cd51aa71711438a6c11fa638ba5aba61ed95
# bad: [568a410e89f0ae73f2663bd1a48f981d3377aac9] kunit: debugfs: Handle errors from alloc_string_stream()
git bisect bad 568a410e89f0ae73f2663bd1a48f981d3377aac9
# bad: [02871710b93058eb1249d5847c0b2d1c2c3c98ae] thermal: core: Fix NULL pointer dereference in zone registration error path
git bisect bad 02871710b93058eb1249d5847c0b2d1c2c3c98ae
# good: [f3bd89340eab3ff2740a499384977f6562b9a53f] ACPI: extlog: Clear Extended Error Log status when RAS_CEC handled the error
git bisect good f3bd89340eab3ff2740a499384977f6562b9a53f
# first bad commit: [02871710b93058eb1249d5847c0b2d1c2c3c98ae] thermal: core: Fix NULL pointer dereference in zone registration error path
Comment 1 Urban Oettli 2024-02-29 10:08:19 UTC
sys-kernel/gentoo-sources-6.7.6

Mounting an OCFS2 file system still causes the entire machine to crash.

dmesg from the machine that wants to mount the OCFS2 drive.

  +0.046710] OCFS2 User DLM kernel interface loaded
[  +0.001107] FAT-fs (vda1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[  +0.400618] elogind-daemon[1557]: New seat seat0.
[  +0.000473] elogind-daemon[1557]: Watching system buttons on /dev/input/event3 (Power Button)
[  +0.000062] elogind-daemon[1557]: Watching system buttons on /dev/input/event0 (AT Translated Set 2 keyboard)
[Feb29 16:48] rcu: INFO: rcu_preempt self-detected stall on CPU
[  +0.000004] rcu:      3-....: (6315 ticks this GP) idle=3c24/1/0x4000000000000000 softirq=2798/2799 fqs=2982
[  +0.000004] rcu:      (t=6000 jiffies g=309 q=57 ncpus=4)
[  +0.000002] Sending NMI from CPU 3 to CPUs 1:
[  +0.000009] NMI backtrace for cpu 1
[  +0.000002] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.7.6-gentoo #3
[  +0.000002] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 4.2023.08-4 02/15/2024
[  +0.000001] RIP: 0010:__local_bh_enable_ip+0x1b/0x66
[  +0.000013] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 65 8b 05 a0 ed f9 7e a9 00 00 0f 00 74 02 0f 0b ff ce f7 de 65 01 35 8c ed f9 7e <65> 8b 05 85 ed f9 7e a9 00 ff ff 00 75 1e 48 c7 c7 99 5d f9 81 e8
[  +0.000001] RSP: 0018:ffffc900000e8aa0 EFLAGS: 00000207
[  +0.000002] RAX: 0000000000000303 RBX: ffff888008ee1200 RCX: 000000004d5e522e
[  +0.000001] RDX: 0000000000000103 RSI: 00000000fffffe00 RDI: ffffffffa05000eb
[  +0.000001] RBP: ffffffffa0500090 R08: 0000000000000430 R09: ffff8880080c1c4e
[  +0.000000] R10: 0000000041042000 R11: 00000000000001d9 R12: ffff888008ee1418
[  +0.000001] R13: 000000004d5e522e R14: 0000000000000020 R15: 0000000000000020
[  +0.000001] FS:  0000000000000000(0000) GS:ffff88807de80000(0000) knlGS:0000000000000000
[  +0.000001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 0000557c93ce65f0 CR3: 000000000d684000 CR4: 00000000000006f0
[  +0.000003] Call Trace:
[  +0.000004]  <NMI>
[  +0.000001]  ? nmi_cpu_backtrace+0xa5/0xd7
[  +0.000004]  ? __local_bh_enable_ip+0x1b/0x66
[  +0.000002]  ? nmi_cpu_backtrace_handler+0x8/0x10
[  +0.000004]  ? nmi_handle+0x52/0x132
[  +0.000001]  ? __local_bh_enable_ip+0x1b/0x66
[  +0.000002]  ? default_do_nmi+0x66/0x279
[  +0.000003]  ? exc_nmi+0xc4/0x13d
[  +0.000001]  ? end_repeat_nmi+0xf/0x60
[  +0.000005]  ? __pfx_o2net_listen_data_ready+0x10/0x10 [ocfs2_nodemanager]
[  +0.000007]  ? o2net_listen_data_ready+0x5b/0x78 [ocfs2_nodemanager]
[  +0.000006]  ? __local_bh_enable_ip+0x1b/0x66
[  +0.000002]  ? __local_bh_enable_ip+0x1b/0x66
[  +0.000002]  ? __local_bh_enable_ip+0x1b/0x66
[  +0.000002]  </NMI>
[  +0.000000]  <IRQ>
[  +0.000001]  o2net_listen_data_ready+0x5b/0x78 [ocfs2_nodemanager]
[  +0.000006]  tcp_data_queue+0x3fc/0x8e2
[  +0.000003]  tcp_rcv_established+0x33e/0x43f
[  +0.000001]  tcp_v4_do_rcv+0xb8/0x197
[  +0.000003]  tcp_v4_rcv+0x757/0xa1a
[  +0.000002]  ? raw_local_deliver+0x1a0/0x1cb
[  +0.000003]  ip_protocol_deliver_rcu+0x97/0x15f
[  +0.000002]  ip_local_deliver_finish+0x81/0x8f
[  +0.000002]  ip_sublist_rcv_finish+0x28/0x38
[  +0.000002]  ip_sublist_rcv+0x162/0x18a
[  +0.000003]  ip_list_rcv+0xe7/0x10f
[  +0.000002]  __netif_receive_skb_list_core+0xf1/0x119
[  +0.000002]  netif_receive_skb_list_internal+0x1e7/0x21e
[  +0.000002]  gro_normal_list+0x1d/0x3f
[  +0.000001]  napi_complete_done+0x76/0x11b
[  +0.000003]  virtnet_poll+0x272/0x38f [virtio_net]
[  +0.000005]  __napi_poll.constprop.0+0x26/0x119
[  +0.000002]  net_rx_action+0x110/0x20c
[  +0.000001]  __do_softirq+0x122/0x290
[  +0.000004]  common_interrupt+0x9b/0xc1
[  +0.000002]  </IRQ>
[  +0.000000]  <TASK>
[  +0.000001]  asm_common_interrupt+0x22/0x40
[  +0.000002] RIP: 0010:pv_native_safe_halt+0x13/0x18
[  +0.000003] Code: 90 90 90 90 90 0f 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 8b 05 55 52 06 01 85 c0 7e 07 0f 00 2d 34 9c 26 00 fb f4 <c3> cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48
[  +0.000001] RSP: 0018:ffffc900000a3ee8 EFLAGS: 00000246
[  +0.000001] RAX: 0000000000000000 RBX: ffff888001a2e040 RCX: 0000000000000000
[  +0.000001] RDX: 0000000000000000 RSI: ffffffff81f87bee RDI: 000000000009ca64
[  +0.000001] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  +0.000000] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[  +0.000001] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[  +0.000002]  default_idle+0x5/0x11
[  +0.000001]  default_idle_call+0x30/0x4f
[  +0.000002]  do_idle+0xb9/0x19d
[  +0.000003]  cpu_startup_entry+0x25/0x27
[  +0.000002]  start_secondary+0xf5/0xf5
[  +0.000002]  secondary_startup_64_no_verify+0x178/0x17b
[  +0.000003]  </TASK>
[  +0.000842] CPU: 3 PID: 319 Comm: kworker/u8:3 Not tainted 6.7.6-gentoo #3
[  +0.000008] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 4.2023.08-4 02/15/2024
[  +0.000002] Workqueue: o2net o2net_accept_many [ocfs2_nodemanager]
[  +0.000009] RIP: 0010:queued_spin_lock_slowpath+0x3d/0x27c
[  +0.000004] Code: 48 89 fb 85 c0 7f 1b 81 fe 00 01 00 00 75 2d b8 01 02 00 00 eb 1c f0 0f b1 13 85 c0 0f 84 3a 02 00 00 8b 03 85 c0 74 ee f3 90 <eb> f6 ff c8 74 5b f3 90 8b 33 81 fe 00 01 00 00 74 f0 81 fe ff 00
[  +0.000001] RSP: 0018:ffffc9000037fd40 EFLAGS: 00000202
[  +0.000002] RAX: 0000000000000001 RBX: ffff888008ee1298 RCX: 0000000000000000
[  +0.000001] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff888008ee1298
[  +0.000001] RBP: ffff888008ee1298 R08: 0000000000000000 R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888004ef2270
[  +0.000001] R13: ffffc9000037fdec R14: ffffffffa0508f88 R15: ffff88800bf00dc0
[  +0.000001] FS:  0000000000000000(0000) GS:ffff88807df80000(0000) knlGS:0000000000000000
[  +0.000001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 00007fff5175af04 CR3: 00000000074a8000 CR4: 00000000000006f0
[  +0.000004] Call Trace:
[  +0.000001]  <IRQ>
[  +0.000001]  ? rcu_dump_cpu_stacks+0x8f/0xb3
[  +0.000003]  ? rcu_sched_clock_irq+0x345/0x91b
[  +0.000003]  ? update_load_avg+0x383/0x3ac
[  +0.000003]  ? kthread_data+0x5/0xe
[  +0.000003]  ? wq_worker_tick+0x9/0xa3
[  +0.000003]  ? __pfx_tick_nohz_highres_handler+0x10/0x10
[  +0.000003]  ? update_process_times+0x4d/0x6c
[  +0.000003]  ? tick_nohz_highres_handler+0x90/0xbe
[  +0.000002]  ? __hrtimer_run_queues+0xec/0x197
[  +0.000003]  ? hrtimer_interrupt+0x97/0x169
[  +0.000001]  ? __sysvec_apic_timer_interrupt+0xc6/0x136
[  +0.000002]  ? sysvec_apic_timer_interrupt+0x80/0xa6
[  +0.000002]  </IRQ>
[  +0.000001]  <TASK>
[  +0.000001]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[  +0.000003]  ? queued_spin_lock_slowpath+0x3d/0x27c
[  +0.000003]  lock_sock_nested+0x19/0x3e
[  +0.000003]  inet_csk_accept+0x1fb/0x2ad
[  +0.000002]  ? inode_init_always+0x174/0x1cc
[  +0.000003]  ? preempt_latency_start+0x2b/0x46
[  +0.000002]  inet_accept+0x41/0x91
[  +0.000003]  o2net_accept_many+0xac/0x398 [ocfs2_nodemanager]
[  +0.000007]  ? __schedule+0x6a8/0x6f3
[  +0.000002]  process_scheduled_works+0x199/0x29d
[  +0.000003]  worker_thread+0x1c1/0x21b
[  +0.000002]  ? __pfx_worker_thread+0x10/0x10
[  +0.000003]  kthread+0xf2/0xfa
[  +0.000002]  ? __pfx_kthread+0x10/0x10
[  +0.000001]  ret_from_fork+0x1f/0x31
[  +0.000002]  ? __pfx_kthread+0x10/0x10
[  +0.000002]  ret_from_fork_asm+0x1b/0x30
[  +0.000003]  </TASK>

dmesg from a machine that has mounted the OCFS2 drive.
[Feb29 16:47] o2net: No connection established with node 1 after 30.0 seconds, check network and cluster configuration.
[  +3.072093] o2net: Connection to node Gentoo-OCFS (num 1) at 192.168.0.14:7777 shutdown, state 7
[  +3.071958] o2net: Connection to node Gentoo-OCFS (num 1) at 192.168.0.14:7777 shutdown, state 7
[Feb29 16:48] o2net: Connection to node Gentoo-OCFS (num 1) at 192.168.0.14:7777 shutdown, state 7
[  +3.072012] o2net: Connection to node Gentoo-OCFS (num 1) at 192.168.0.14:7777 shutdown, state 7
[  +3.072009] o2net: Connection to node Gentoo-OCFS (num 1) at 192.168.0.14:7777 shutdown, state 7
[  +3.072063] o2net: Connection to node Gentoo-OCFS (num 1) at 192.168.0.14:7777 shutdown, state 7
[  +3.072016] o2net: Connection to node Gentoo-OCFS (num 1) at 192.168.0.14:7777 shutdown, state 7
[  +3.072068] o2net: Connection to node Gentoo-OCFS (num 1) at 192.168.0.14:7777 shutdown, state 7
[  +3.072017] o2net: Connection to node Gentoo-OCFS (num 1) at 192.168.0.14:7777 shutdown, state 7
[  +3.071967] o2net: No connection established with node 1 after 30.0 seconds, check network and cluster configuration.
......

It would be greatly appreciated if someone would address the issue.
Comment 2 Mike Pagano gentoo-dev 2024-02-29 23:13:44 UTC
Please bring this upstream at https://bugzilla.kernel.org
Comment 3 Mike Pagano gentoo-dev 2024-03-01 12:42:28 UTC
Thanks, we'll watch the upstream bug and backport and fixes identified.