While scrubbing 4TiB raid10 array pax size overflow occured rendering all IO on the array frozen. There was some activity on the array (nfs) at the time of hang. Reproducible: Sometimes Steps to Reproduce: 1. echo check > /sys/block/md1/md/sync_action 2. pax size overflow Actual Results: Hang. Expected Results: Flawless completion of the scrub.
Created attachment 429428 [details] kernel log fragment
Created attachment 429430 [details] cat /proc/mdstat
I was able to reproduce it reliably. The mdstat is in exactly that state when size overflow strikes: Personalities : [raid10] md1 : active raid10 sda2[0] sdd2[4] sdc2[2] sdb2[1] 3789323264 blocks super 1.2 256K chunks 2 far-copies [4/4] [UUUU] [===========>.........] check = 56.6% (2147483648/3789323264) finish=1762.4min speed=15524K/sec bitmap: 0/29 pages [0KB], 65536KB chunk md0 : active raid10 sda1[0] sdd1[4] sdc1[2] sdb1[1] 117371904 blocks super 1.2 256K chunks 2 far-copies [4/4] [UUUU] bitmap: 1/1 pages [4KB], 65536KB chunk unused devices: <none> Interestingly, the number 2147483648 in mdstat is exactly INT_MAX.
Created attachment 429520 [details] kernel config
I just triggered this problem, too. The code reported in the error message: r10_bio->sectors = (sector_nr | chunk_mask) - sector_nr + 1; involves the "sectors" field of struct r10bio, defined at md/raid10.h:103, which is an "int". Since the header file says "'private' RAID10 bio", Is it okay to simply change it to u64?
(In reply to Hongjiu Zhang from comment #5) > I just triggered this problem, too. > > The code reported in the error message: > > r10_bio->sectors = (sector_nr | chunk_mask) - sector_nr + 1; > > involves the "sectors" field of struct r10bio, defined at md/raid10.h:103, > which is an "int". > > Since the header file says "'private' RAID10 bio", Is it okay to simply > change it to u64? sorry i missed this bug earlier. I think you're right about the type error.
can you guys add a printk just before the offending statement to see what the values of sector_nr and chunk_mask are? while i agree that this looks like a real bug (integer truncation), gcc can sometimes do funny optimizations with such an expression that can also cause false positives, so knowing the actual runtime values would be helpful. as for fixing it, an u64 would work but if it's a real bug then please report it upstream and have them come up with the best fix since this pattern may occur elsewhere that may need more thorough changes than just this one place. also they can best assess the consequences of this bug (e.g., can it cause data corruption?) and notify other affected trees.