578756 – sys-kernel/hardened-sources-4.4.2 raid10 size overflow detected in function sync_request drivers/md/raid10.c:3178

Bug 578756 - sys-kernel/hardened-sources-4.4.2 raid10 size overflow detected in function sync_request drivers/md/raid10.c:3178

Summary: sys-kernel/hardened-sources-4.4.2 raid10 size overflow detected in function s...

Status:	RESOLVED OBSOLETE

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Hardened (show other bugs)
Hardware:	AMD64 Linux

Importance:	Normal critical
Assignee:	The Gentoo Linux Hardened Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-04-01 14:03 UTC by Dmitry Safonov
Modified:	2018-10-11 23:27 UTC (History)
CC List:	3 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
kernel log fragment (fail,43.43 KB, text/plain) 2016-04-01 14:04 UTC, Dmitry Safonov	Details
cat /proc/mdstat (mdstat,491 bytes, text/plain) 2016-04-01 14:06 UTC, Dmitry Safonov	Details
kernel config (kernfailconf,88.08 KB, text/plain) 2016-04-02 21:22 UTC, Dmitry Safonov	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dmitry Safonov 2016-04-01 14:03:59 UTC

While scrubbing 4TiB raid10 array pax size overflow occured rendering all IO on the array frozen.  There was some activity on the array (nfs) at the time of hang.

Reproducible: Sometimes

Steps to Reproduce:
1. echo check > /sys/block/md1/md/sync_action
2. pax size overflow
Actual Results:  
Hang.

Expected Results:  
Flawless completion of the scrub.

Comment 1 Dmitry Safonov 2016-04-01 14:04:55 UTC

Created attachment 429428 [details]
kernel log fragment

Comment 2 Dmitry Safonov 2016-04-01 14:06:18 UTC

Created attachment 429430 [details]
cat /proc/mdstat

Comment 3 Dmitry Safonov 2016-04-01 16:22:08 UTC

I was able to reproduce it reliably.

The mdstat is in exactly that state when size overflow strikes:
  Personalities : [raid10] 
  md1 : active raid10 sda2[0] sdd2[4] sdc2[2] sdb2[1]
        3789323264 blocks super 1.2 256K chunks 2 far-copies [4/4] [UUUU]
        [===========>.........]  check = 56.6% (2147483648/3789323264) finish=1762.4min speed=15524K/sec
        bitmap: 0/29 pages [0KB], 65536KB chunk

  md0 : active raid10 sda1[0] sdd1[4] sdc1[2] sdb1[1]
        117371904 blocks super 1.2 256K chunks 2 far-copies [4/4] [UUUU]
        bitmap: 1/1 pages [4KB], 65536KB chunk

  unused devices: <none>

Interestingly, the number 2147483648 in mdstat is exactly INT_MAX.

Comment 4 Dmitry Safonov 2016-04-02 21:22:00 UTC

Created attachment 429520 [details]
kernel config

Comment 5 Hongjiu Zhang 2016-06-01 07:41:09 UTC

I just triggered this problem, too.

The code reported in the error message:

r10_bio->sectors = (sector_nr | chunk_mask) - sector_nr + 1;

involves the "sectors" field of struct r10bio, defined at md/raid10.h:103, which is an "int".

Since the header file says "'private' RAID10 bio", Is it okay to simply change it to u64?

Comment 6 Anthony Basile gentoo-dev

2016-07-21 16:48:15 UTC

(In reply to Hongjiu Zhang from comment #5)
> I just triggered this problem, too.
> 
> The code reported in the error message:
> 
> r10_bio->sectors = (sector_nr | chunk_mask) - sector_nr + 1;
> 
> involves the "sectors" field of struct r10bio, defined at md/raid10.h:103,
> which is an "int".
> 
> Since the header file says "'private' RAID10 bio", Is it okay to simply
> change it to u64?

sorry i missed this bug earlier.  I think you're right about the type error.

Comment 7 PaX Team 2016-07-21 20:31:45 UTC

can you guys add a printk just before the offending statement to see what the values of sector_nr and chunk_mask are? while i agree that this looks like a real bug (integer truncation), gcc can sometimes do funny optimizations with such an expression that can also cause false positives, so knowing the actual runtime values would be helpful.

as for fixing it, an u64 would work but if it's a real bug then please report it upstream and have them come up with the best fix since this pattern may occur elsewhere that may need more thorough changes than just this one place. also they can best assess the consequences of this bug (e.g., can it cause data corruption?) and notify other affected trees.