I run a script once a week to check my AMD Athlon 64-bit server's software RAID mirror array partitions (200GB split into 80GB partitions for LVM) for errors by echoing check to /sys/block/md*/md/sync_action . After upgrading to 2.7.27-hardened-r3 kernel, the new PAX refcount overflow detection code triggered an overflow during the second week of the consistency check. I also have a 32-bit Pentium 4 test server that I can also consistently reproduce the error on using a script that continously runs a check up to 1000 times. On a 20GB raid device, the overflow starts occuring after about 50 interations and once the overflow starts occuring it will continuously occur until the check finishes, or if idle is echoed to sync_action at which point it will stop, but it will start again immediately if another check is started before rebooting the system. The overflow also occurs on 2.7.27-hardened-r4 patched up to 2.6.27.11 Is this a real bug in the RAID system or a false alarm? Reproducible: Always Steps to Reproduce: 1. Run test_check_raid.sh md? (At least a 20GB RAID device) 2. Wait about 50 iterations 3. Look for revcount overflow messages in the logs
Created attachment 179426 [details] test_check_raid.sh This script checks a RAID device 1000 times to test if a refcount overflow will occur
Could you please try with hardened-sources-2.6.27-r7? If it still fails please post your emerge --info and kernel config. Thanks!
I am still getting the overflow with hardened-sources-2.6.27-r7 and also with vanilla-sources-2.6.27.13 patched with just the pax_refcount part of grsecurity. Here is the emerge --info for the 32-bit system: Portage 2.1.6.4 (default/linux/x86/2008.0, gcc-4.1.2, glibc-2.6.1-r0, 2.6.27.13 i686) ================================================================= System uname: Linux-2.6.27.13-i686-Intel-R-_Pentium-R-_4_CPU_3.00GHz-with-glibc2.0 Timestamp of tree: Sun, 25 Jan 2009 13:15:01 +0000 app-shells/bash: 3.2_p33 dev-lang/python: 2.4.4-r14, 2.5.2-r7 dev-python/pycrypto: 2.0.1-r6 dev-util/cmake: 2.4.6-r1 sys-apps/baselayout: 1.12.11.1 sys-apps/sandbox: 1.2.18.1-r2 sys-devel/autoconf: 2.13, 2.61-r2 sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10.1-r1 sys-devel/binutils: 2.18-r3 sys-devel/gcc-config: 1.4.0-r4 sys-devel/libtool: 1.5.26 virtual/os-headers: 2.6.27-r2 ACCEPT_KEYWORDS="x86" CBUILD="i686-pc-linux-gnu" CFLAGS="-O2 -march=prescott -fomit-frame-pointer -pipe" CHOST="i686-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/share/X11/xkb /var/bind" CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/gconf /etc/php/apache2-php5/ext-active/ /etc/php/cgi-php5/ext-active/ /etc/php/cli-php5/ext-active/ /etc/revdep-rebuild /etc/terminfo /etc/udev/rules.d" CXXFLAGS="-O2 -march=prescott -fomit-frame-pointer -pipe" DISTDIR="/usr/portage/distfiles" FEATURES="distlocks fixpackages parallel-fetch protect-owned sandbox sfperms strict unmerge-orphans userfetch" GENTOO_MIRRORS="http://mirror.internode.on.net/pub/gentoo" LANG="en_AU.utf8" LC_ALL="en_AU.utf8" LDFLAGS="-Wl,-O1" MAKEOPTS="-j3" PKGDIR="/usr/portage/packages" PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/usr/local/portage" SYNC="rsync://mirror.internode.on.net/gentoo-portage" USE="acl acpi alsa apache2 berkdb bzip2 caps cjk cli cracklib crypt cups dlloader dri fam fortran gdbm gpm iconv ipv6 isdnlog jpeg kerberos logrotate midi mng mudflap ncurses nls nptl nptlonly openmp pam pcre perl png pppd python qt readline reflection session spl ssl sysfs tcpd threads tiff unicode vhosts x86 xattr xinerama xorg zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1 emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic auth_digest authn_anon authn_dbd authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache dav dav_fs dav_lock dbd deflate dir disk_cache env expires ext_filter file_cache filter headers ident imagemap include info log_config logio mem_cache mime mime_magic negotiation proxy proxy_ajp proxy_balancer proxy_connect proxy_http rewrite setenvif so speling status unique_id userdir usertrack vhost_alias" ELIBC="glibc" INPUT_DEVICES="evdev keyboard mouse" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" USERLAND="GNU" VIDEO_CARDS="fbdev vesa" Unset: CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, FFLAGS, INSTALL_MASK, LINGUAS, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS
Created attachment 179915 [details] 2.6.27-hardened-r7_p4.config The 2.6.27 kernel config I am using on the 32-bit Pentium 4 system
Created attachment 179917 [details, diff] A custom patch containing only the pax_refcount parts of grsecurity
On vanilla kernel 2.6.27.13 patched with pax_refcount I got proper symbol output of the stack dump: Jan 28 02:52:07 testwww PAX: refcount overflow occured at: sync_request+0x582/0x5d1 Jan 28 02:52:07 testwww Modules linked in: lm85 hwmon_vid hwmon i2c_i801 iTCO_wdt e1000 ehci_hcd Jan 28 02:52:07 testwww Jan 28 02:52:07 testwww Pid: 17437, comm: md3_resync Not tainted (2.6.27.13 #1) Jan 28 02:52:07 testwww EIP: 0060:[<c02c3833>] EFLAGS: 00000a02 CPU: 0 Jan 28 02:52:07 testwww EIP is at sync_request+0x582/0x5d1 Jan 28 02:52:07 testwww EAX: f7b5d200 EBX: 00000000 ECX: 00000080 EDX: f78e6080 Jan 28 02:52:07 testwww ESI: 00000002 EDI: f6c21240 EBP: f7b39880 ESP: f6347e84 Jan 28 02:52:07 testwww DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Jan 28 02:52:07 testwww Process md3_resync (pid: 17437, ti=f6347000 task=f7258000 task.ti=f6347000) Jan 28 02:52:07 testwww Stack: 00000018 02629f00 00000000 019514f8 00000000 f7b9f800 02629f00 00000000 Jan 28 02:52:07 testwww 00000080 00000000 00000000 ffffffff 00000001 00000002 00000000 c16dbe20 Jan 28 02:52:07 testwww 00001000 00000380 c0491780 f7b9f800 02629f00 0000c7f3 c02d41c6 f6347fa0 Jan 28 02:52:07 testwww Call Trace: Jan 28 02:52:07 testwww [<c02d41c6>] md_do_sync+0x681/0xb1f Jan 28 02:52:07 testwww [<c037e471>] schedule_timeout+0x13/0x86 Jan 28 02:52:07 testwww [<c02d3b2f>] md_thread+0xb6/0xcc Jan 28 02:52:07 testwww [<c01343a0>] autoremove_wake_function+0x0/0x2d Jan 28 02:52:07 testwww [<c02d3a79>] md_thread+0x0/0xcc Jan 28 02:52:07 testwww [<c01342de>] kthread+0x38/0x5e Jan 28 02:52:07 testwww [<c01342a6>] kthread+0x0/0x5e Jan 28 02:52:07 testwww [<c0104447>] kernel_thread_helper+0x7/0x10 Jan 28 02:52:07 testwww ======================= On the hardened kernels the stack dump only had the addresses but no symbol names. I have also now tested vanilla kernel: 2.6.25 patched with pax_refcount and got a similar result to 2.6.27.13.
Looks to me like its probably a bug in the md layer and PAX_REFCOUNT is simply acting as a QA tool. ;) CCing PaX Team and kernel@.
(In reply to comment #6) > On vanilla kernel 2.6.27.13 patched with pax_refcount I got proper symbol > output of the stack dump: > Jan 28 02:52:07 testwww PAX: refcount overflow occured at: > sync_request+0x582/0x5d1 most interesting ;). i'll need your vmlinux and System.map files that correspond to this trace (send them directly to me, no need to bother bugzilla with such big attachments). also you could tell me which raid support you're using there as this sync_request function is implemented in 3 modules and from the trace i can't tell which one it was.
Both the Athlon 64-bit server and the Pentium 4 32-bit test server (both dual cores) are running RAID1 using two hard drives
this is what happens: towards the end of drivers/md/raid1.c:sync_request() there're two calls to md_sync_acct() which in turn do nothing but an atomic_add() on ....->bd_disk->sync_io. this ->sync_io field is used only in drivers/md/md.c:is_mddev_idle() along with a longish comment about how it's used. now i have no clue about the internals of md/raid/whatever, but it seems this field is simply used as a simple counter (i.e., not as a refcount) and the overflow detection may not need to apply to it, but given its use i'm not so sure. thing is, the arithmetic expression using this ->sync_io field contains two other fields, both of unsigned long type. on 64 bit archs this means that we're doing arithmetic on mixed 64/32 bit integers, that always looks like trouble, especially here when we now know that the 32 bit integer field can overflow and, no doubt, eventually wrap around (on 32 bit archs even the unsigned long fields can wrap around, i guess). so i'd say someone should ask the md/raid developers about this, as i can't make a judgement call. as a workaround, i'll add special handling for this particular atomic_t field, but that'll shut up the PaX warning only, it won't fix the underlying issue (if there's one at all, of course).
after having discussed it with Neil Brown, i fixed this problem, give the latest test patches a try please.
It looks like that workaround has fixed the problem. I ran test_check_raid.sh for 400 iterations on one of the 32-bit test server's 20GB RAID1 partitions and didn't get any errors or crashes.
Should be fixed in sys-kernel/hardened-sources-2.6.27-r8. Waiting to close the bug until: 1. a release with the fix has been marked stable. 2. a fixed 2.6.28 release is in the tree.
> Should be fixed in sys-kernel/hardened-sources-2.6.27-r8. > > Waiting to close the bug until: > 1. a release with the fix has been marked stable. > 2. a fixed 2.6.28 release is in the tree. > Requirements met awhile ago, closing as fixed.
It looks like this bug has returned in hardened-sources-2.6.28-r7. http://www.grsecurity.com/test/grsecurity-2.1.14-2.6.29.2-200905040840.patch appears to have the atomic_inc_unchecked workaround fix for kernel 2.6.29 though.
same here...2.6.28-hardened-r7...rather annoying bug...
(In reply to comment #15) > It looks like this bug has returned in hardened-sources-2.6.28-r7. > http://www.grsecurity.com/test/grsecurity-2.1.14-2.6.29.2-200905040840.patch > appears to have the atomic_inc_unchecked workaround fix for kernel 2.6.29 > though. it seems to be a merge problem in grsec, PaX has the chunk in all .28 patches. in general, you should now move to .29 though, we don't touch .28 anymore.
Fixed again in =sys-kernel/hardened-sources-2.6.28-r8. Thanks for re-opening the bug.