Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 671122 - sys-kernel/gentoo-sources-4.9.135 kernel bug: block io discard (TRIM) fails during mkfs.ext4 on lvm on bcache
Summary: sys-kernel/gentoo-sources-4.9.135 kernel bug: block io discard (TRIM) fails d...
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: AMD64 Linux
: Normal normal (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL: https://bugzilla.kernel.org/show_bug....
Whiteboard:
Keywords: Bug
Depends on:
Blocks:
 
Reported: 2018-11-14 11:07 UTC by progmachine
Modified: 2019-05-14 12:11 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description progmachine 2018-11-14 11:07:00 UTC
I have complex blk storage architecture on my two computers:
1) lvm volume group (/dev/vgroup0) on top of
2) set of bcache devices, which sits on top of
3) set of partitions (count is 26), which sits on top of
4) mdadm raid1/raid5 (/dev/md0), that sit on
5) 2 or 4 hard drives

Also there is mdadm raid1 (/dev/md2) made of pair of ssd's. this md2 is used as caching device for bcache devices.

After update of my gentoo-sources kernel from 4.9.96 to 4.9.135, appeared new problem: i can't sustainably create new lvm partitions and format them with mkfs.ext4 - exist very high probability that kernel will encounter bug inside block io subsystem. After that bug encounter system runs unstable, and can not be rebooted flawlessly, only hard reboot/poweroff is possible.
Also this bug exist in fresh 4.14 kernel.

Here is part of output of dmesg:

[  336.318610] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  336.321063] IP: blk_queue_split+0x17a/0x510
[  336.323471] PGD 0 P4D 0 
[  336.325790] Oops: 0000 [#1] SMP NOPTI
[  336.328090] Modules linked in:
[  336.330404] CPU: 4 PID: 3495 Comm: mkfs.ext4 Not tainted 4.14.78-gentoo #1
[  336.332750] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-1.fc28 04/01/2014
[  336.333718] task: ffff88007c1be000 task.stack: ffffc9000057c000
[  336.334569] RIP: 0010:blk_queue_split+0x17a/0x510
[  336.335399] RSP: 0018:ffffc9000057fc38 EFLAGS: 00010246
[  336.336226] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
[  336.337043] RDX: 0000000000080000 RSI: 0000000000000000 RDI: ffff8800788d8000
[  336.337877] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[  336.338700] R10: 0000000001400010 R11: 0000000000000000 R12: 0000000000000000
[  336.339509] R13: ffff8800788d8000 R14: 0000000000000000 R15: ffff880078378e20
[  336.340319] FS:  00007fbf5085b780(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[  336.341140] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  336.341951] CR2: 0000000000000008 CR3: 000000007ca00000 CR4: 00000000000006e0
[  336.342778] Call Trace:
[  336.343626]  ? cached_dev_make_request+0x88a/0xbd0
[  336.344461]  md_make_request+0x1e/0x140
[  336.345292]  generic_make_request+0xec/0x2b0
[  336.346125]  submit_bio+0x5f/0x120
[  336.346970]  ? next_bio+0x1e/0x50
[  336.347797]  ? __blkdev_issue_discard+0x165/0x1c0
[  336.348632]  submit_bio_wait+0x42/0x60
[  336.349470]  blkdev_issue_discard+0x6d/0xa0
[  336.350308]  blk_ioctl_discard+0x69/0x90
[  336.351147]  blkdev_ioctl+0x424/0x920
[  336.351985]  block_ioctl+0x34/0x40
[  336.352817]  do_vfs_ioctl+0xa0/0x610
[  336.353462]  SyS_ioctl+0x91/0xa0
[  336.354009]  do_syscall_64+0x69/0x120
[  336.354590]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  336.355196] RIP: 0033:0x7fbf4f6b7667
[  336.355736] RSP: 002b:00007ffe5249eeb8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[  336.356290] RAX: ffffffffffffffda RBX: 0000561d574cbdc0 RCX: 00007fbf4f6b7667
[  336.356872] RDX: 00007ffe5249eec0 RSI: 0000000000001277 RDI: 0000000000000003
[  336.357487] RBP: 000000000003f000 R08: 0000000000001000 R09: 0000000000000000
[  336.358047] R10: 0000000000000223 R11: 0000000000000206 R12: 00007ffe5249ef70
[  336.358605] R13: 0000000000040000 R14: 0000000000001000 R15: 0000000000000000
[  336.359161] Code: 00 00 00 00 31 db 41 83 e1 fb 45 31 f6 45 31 e4 44 89 c6 44 89 4c 24 0c 89 44 24 28 4c 89 7c 24 10 49 89 f1 49 c1 e1 04 4d 01 d9 <41> 8b 41 08 45 8b 51 0c 49 8b 29 29 c8 39 d0 0f 47 c2 41 01 ca 
[  336.360348] RIP: blk_queue_split+0x17a/0x510 RSP: ffffc9000057fc38
[  336.360945] CR2: 0000000000000008
[  336.361596] ---[ end trace df40edca705a2cdd ]---
[  336.362214] ------------[ cut here ]------------
[  336.362949] WARNING: CPU: 4 PID: 3495 at kernel/exit.c:771 do_exit+0x3a/0xaa0
[  336.362949] Modules linked in:
[  336.362951] CPU: 4 PID: 3495 Comm: mkfs.ext4 Tainted: G      D         4.14.78-gentoo #1
[  336.362951] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-1.fc28 04/01/2014
[  336.362952] task: ffff88007c1be000 task.stack: ffffc9000057c000
[  336.362953] RIP: 0010:do_exit+0x3a/0xaa0
[  336.362954] RSP: 0018:ffffc9000057fef0 EFLAGS: 00010283
[  336.362955] RAX: ffffc9000057fdf0 RBX: ffff88007c1be000 RCX: ffffc9000057fde0
[  336.362955] RDX: ffff88007c71a000 RSI: 0000000000000000 RDI: ffffffff82a462a0
[  336.362956] RBP: 0000000000000009 R08: 000000000000025b R09: 0000000000aaaaaa
[  336.362956] R10: 0000000000000001 R11: ffff88007c638046 R12: 0000000000000000
[  336.362957] R13: 0000000000000009 R14: ffff88007c1be000 R15: 0000000000000000
[  336.362959] FS:  00007fbf5085b780(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[  336.362960] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  336.362960] CR2: 0000000000000008 CR3: 000000007ca00000 CR4: 00000000000006e0
[  336.362962] Call Trace:
[  336.362964]  ? SyS_ioctl+0x91/0xa0
[  336.362966]  rewind_stack_do_exit+0x17/0x20
[  336.362967] Code: 8b 1c 25 40 4d 01 00 48 83 ec 30 48 89 df e8 3e 0f 05 00 48 8b 83 a0 06 00 00 48 85 c0 74 0e 48 8b 10 48 39 d0 0f 84 22 06 00 00 <0f> 0b 65 44 8b 2d 3c 38 f6 7e 41 81 e5 00 ff 1f 00 44 89 6c 24 
[  336.362979] ---[ end trace df40edca705a2cde ]---

Reproducible: Sometimes

Steps to Reproduce:
1. Create block storage lvm+bcache+mdadm
2. Create lvm volume
3. Format new volume several times using mkfs.ext4, bug probability is very high in 2 to 5 formatting retries.
Actual Results:  
mkfs.ext4 is stopping working with message "process killed"
dmesg output shown internal kernel bug with stacktrace

Expected Results:  
mkfs.ext4 run flawlessly and filesystem is ready to work
Comment 1 progmachine 2018-11-14 12:40:29 UTC
Just tested my little lvdbg VM, created specially for safely reproduce this bug. Tested storage setup like mentioned, but without bcache layer, and every thing worked fine.
It somehow connected to bcache discarding mechanism.
Also it seems to be the same bug as https://bugzilla.kernel.org/show_bug.cgi?id=196103
Comment 2 Alice Ferrazzi Gentoo Infrastructure gentoo-dev 2018-11-19 05:19:12 UTC
you should try to bisect it.
https://wiki.gentoo.org/wiki/Kernel_git-bisect
Comment 3 progmachine 2018-11-20 14:51:54 UTC
I tried different vanilla kernel versions, from the latest 4.20-rc3 to oldest that i can build on my testing VM (4.9.18) - this bug reproduces everywhere.
It seems that this bug was always there, but did not triggered because previous versions of mkfs.ext4 did not issue discard command before volume formatting.
Comment 4 progmachine 2019-05-14 12:11:20 UTC
Patch was accepted in mainline kernel tree and backported to stable kernels.
I personally use this patch for many months now, and it works flawlessly.
I think this bug report can be closed.