After upgrade from hardned-sources-2.6.25-r9 I get random oops/softlockups/panics. So far, I have been testing hardned-sources-2.6.2{6-r9,7-r7,8} and all have generated some oops/softlockups/panics after seeding torrents for a couple of hours. Torrents that I have been seeding are those for the Debian Lenny release (iso-files for x86 and amd64) so it may be relate to the large amount of simultaneous connections? Or maybe just the intense I/O. With 2.6.25-r9 I have been seeding big torrents like this for about months without any problem. Reproducible: Didn't try Steps to Reproduce: 1. emerge -av '>=sys-kernel/hadened-sources-2.6.26' 2. start rtorrent (using version 0.8.4-r1) and seed some big files. 3. wait until the oops comes (generally on my system; less than 24h). Actual Results: Kernel oops in dmesg and sometimes complete lockup. Expected Results: No lockups/oops/panics :) Portage 2.1.6.4 (hardened/amd64/multilib, gcc-3.4.6, glibc-2.6.1-r0, 2.6.28-hardened x86_64) ================================================================= System uname: Linux-2.6.28-hardened-x86_64-Intel-R-_Core-TM-2_Duo_CPU_E6850_@_3.00GHz-with-glibc2.3.2 Timestamp of tree: Tue, 17 Feb 2009 20:45:02 +0000 distcc 3.0 x86_64-pc-linux-gnu [disabled] app-shells/bash: 3.2_p39 dev-lang/python: 2.4.4-r14, 2.5.2-r7 dev-python/pycrypto: 2.0.1-r6 dev-util/cmake: 2.4.8 sys-apps/baselayout: 1.12.11.1 sys-apps/sandbox: 1.2.18.1-r2 sys-devel/autoconf: 2.63 sys-devel/automake: 1.7.9-r1, 1.9.6-r2, 1.10.2 sys-devel/binutils: 2.18-r3 sys-devel/gcc-config: 1.4.0-r4 sys-devel/libtool: 1.5.26 virtual/os-headers: 2.6.27-r2 ABI="amd64" ACCEPT_KEYWORDS="amd64" ALSA_CARDS="" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" ARCH="amd64" AUTOCLEAN="yes" CBUILD="x86_64-pc-linux-gnu" CDEFINE_amd64="__x86_64__" CDEFINE_x86="__i386__" CFLAGS="-march=nocona -O2 -pipe -fforce-addr -ggdb" CFLAGS_amd64="" CFLAGS_x86="-m32 -L/emul/linux/x86/lib -L/emul/linux/x86/usr/lib" CHOST="x86_64-pc-linux-gnu" CHOST_amd64="x86_64-pc-linux-gnu" CHOST_x86="i686-pc-linux-gnu" CLEAN_DELAY="5" COLLISION_IGNORE="/lib/modules" CONFIG_PROTECT="/etc /var/bind" CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/php/apache2-php5/ext-active/ /etc/php/cgi-php5/ext-active/ /etc/php/cli-php5/ext-active/ /etc/revdep-rebuild /etc/terminfo /etc/udev/rules.d" CVS_RSH="ssh" CXXFLAGS="-march=nocona -O2 -pipe -fforce-addr -ggdb" DCCC_PATH="/usr/lib64/distcc/bin" DEFAULT_ABI="amd64" DISTCC_LOG="" DISTCC_VERBOSE="0" DISTDIR="/usr/portage/distfiles" EDITOR="/usr/bin/vim" ELIBC="glibc" EMERGE_DEFAULT_OPTS="--verbose" EMERGE_WARNING_DELAY="10" FEATURES="autoconfig distlocks fixpackages parallel-fetch protect-owned sandbox sfperms strict unmerge-orphans userfetch" FETCHCOMMAND="/usr/bin/wget -t 5 -T 60 --passive-ftp -O "${DISTDIR}/${FILE}" "${URI}"" GCC_SPECS="" GENTOO_MIRRORS="http://trumpetti.atm.tut.fi/gentoo/" HOME="/root" INFOPATH="/usr/share/info:/usr/share/binutils-data/x86_64-pc-linux-gnu/2.18/info:/usr/share/gcc-data/x86_64-pc-linux-gnu/3.4.6/info" INPUT_DEVICES="mouse keyboard evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LC_ADDRESS="sv_SE.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="sv_SE.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_MEASUREMENT="sv_SE.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="sv_SE.UTF-8" LC_NAME="sv_SE.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_PAPER="sv_SE.UTF-8" LC_TELEPHONE="sv_SE.UTF-8" LC_TIME="en_GB.UTF-8" LDFLAGS="" LDFLAGS_amd64="-m elf_x86_64" LDFLAGS_x86="-m elf_i386 -L/emul/linux/x86/lib -L/emul/linux/x86/usr/lib" LESS="-R -M --shift 5" LESSOPEN="|lesspipe.sh %s" LIBDIR_amd64="lib64" LIBDIR_x86="lib32" LOGNAME="root" LS_COLORS="no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.pdf=00;32:*.ps=00;32:*.txt=00;32:*.patch=00;32:*.diff=00;32:*.log=00;32:*.tex=00;32:*.doc=00;32:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:" MAKEOPTS="-j4" MANPATH="/usr/local/share/man:/usr/share/man:/usr/share/binutils-data/x86_64-pc-linux-gnu/2.18/man:/usr/share/gcc-data/x86_64-pc-linux-gnu/3.4.6/man:/usr/lib64/php5/man/" MULTILIB_ABIS="x86 amd64" MULTILIB_STRICT_DENY="64-bit.*shared object" MULTILIB_STRICT_DIRS="/lib /usr/lib /usr/kde/*/lib /usr/qt/*/lib /usr/X11R6/lib" MULTILIB_STRICT_EXEMPT="(perl5|gcc|gcc-lib|eclipse-3|debug|portage)" NETBEANS="apisupport cnd groovy gsf harness ide identity j2ee java mobility nb php profiler soa visualweb webcommon websvccommon xml" NOCOLOR="true" OLDPWD="/home/azoff" PAGER="/usr/bin/less" PATH="/sbin:/bin:/usr/sbin:/usr/bin" PKGDIR="/usr/portage/packages" PORTAGE_ARCHLIST="ppc s390 amd64 x86 ppc64 x86-fbsd m68k arm sparc sh mips ia64 alpha hppa sparc-fbsd" PORTAGE_BINHOST_CHUNKSIZE="3000" PORTAGE_BIN_PATH="/usr/lib64/portage/bin" PORTAGE_COMPRESS_EXCLUDE_SUFFIXES="css gif htm[l]? jp[e]?g js pdf png" PORTAGE_CONFIGROOT="/" PORTAGE_COUNTER_HASH="922ef19b9843be88fd48994115d738d1" PORTAGE_DEBUG="0" PORTAGE_DEPCACHEDIR="/var/cache/edb/dep" PORTAGE_ELOG_CLASSES="warn error log" PORTAGE_ELOG_MAILFROM="portage@localhost" PORTAGE_ELOG_MAILSUBJECT="[portage] ebuild log for ${PACKAGE} on ${HOST}" PORTAGE_ELOG_MAILURI="root" PORTAGE_ELOG_SYSTEM="save_summary echo" PORTAGE_FETCH_CHECKSUM_TRY_MIRRORS="5" PORTAGE_FETCH_RESUME_MIN_SIZE="350K" PORTAGE_GID="250" PORTAGE_INST_GID="0" PORTAGE_INST_UID="0" PORTAGE_PYM_PATH="/usr/lib64/portage/pym" PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages" PORTAGE_RSYNC_RETRIES="3" PORTAGE_TMPDIR="/var/tmp" PORTAGE_VERBOSE="1" PORTAGE_WORKDIR_MODE="0700" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/usr/local/portage" PROFILE_ONLY_VARIABLES="ARCH ELIBC KERNEL USERLAND" PWD="/home/azoff/oops" RESUMECOMMAND="/usr/bin/wget -c -t 5 -T 60 --passive-ftp -O "${DISTDIR}/${FILE}" "${URI}"" ROOT="/" ROOTPATH="/opt/bin:/usr/x86_64-pc-linux-gnu/gcc-bin/3.4.6" RPMDIR="/usr/portage/rpm" SHELL="/bin/bash" SHLVL="2" SSH_CLIENT="192.168.20.5 35538 22" SSH_CONNECTION="192.168.20.5 35538 192.168.20.50 22" SSH_TTY="/dev/pts/0" STAGE1_USE="hardened pic" SYMLINK_LIB="yes" SYNC="rsync://rsync.europe.gentoo.org/gentoo-portage" TERM="rxvt-unicode" USE="amd64 berkdb bzip2 cracklib crypt cups curl hardened jpeg justify midi nls nptl nptlonly pam pic png readline sasl smp sse sse2 ssl tcpd tetex tiff truetype unicode urandom vim-syntax xorg zlib" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" ELIBC="glibc" INPUT_DEVICES="mouse keyboard evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" USERLAND="GNU" VIDEO_CARDS="apm ark chips cirrus cyrix dummy fbdev glint i128 i810 intel mach64 mga neomagic nv r128 radeon rendition s3 s3virge savage siliconmotion sis sisusb tdfx tga trident tseng v4l vesa vga via vmware voodoo" USER="root" USERLAND="GNU" USE_EXPAND="ALSA_CARDS ALSA_PCM_PLUGINS APACHE2_MODULES APACHE2_MPMS CAMERAS CROSSCOMPILE_OPTS DVB_CARDS ELIBC FCDSL_CARDS FOO2ZJS_DEVICES FRITZCAPI_CARDS INPUT_DEVICES KERNEL LCD_DEVICES LINGUAS LIRC_DEVICES MISDN_CARDS NETBEANS_MODULES USERLAND VIDEO_CARDS" USE_EXPAND_HIDDEN="CROSSCOMPILE_OPTS ELIBC KERNEL USERLAND" USE_ORDER="env:pkg:conf:defaults:pkginternal:env.d" VIDEO_CARDS="apm ark chips cirrus cyrix dummy fbdev glint i128 i810 intel mach64 mga neomagic nv r128 radeon rendition s3 s3virge savage siliconmotion sis sisusb tdfx tga trident tseng v4l vesa vga via vmware voodoo" _="/usr/bin/emerge"
Created attachment 182474 [details] Decoded kernel oops from sys-kernel/hardened-sources-2.6.28
Created attachment 182476 [details] Kernel config for sys-kernel/hardened-sources-2.6.28 Note: I have replace the content of CONFIG_GRKERNSEC_PROC_GID, CONFIG_GRKERNSEC_AUDIT_GID and CONFIG_GRKERNSEC_TPE_GID to procect those GIDs.
Created attachment 182478 [details] Output of 'lspci -vv'
Created attachment 182480 [details] Output of 'lshw'
can you post the original dmesg you decoded, please? it seems that some information is missing from the decoded version, e.g., the oops report should also contain a page table dump of the faulting address at least which in this case would be quite important to see. i'd also like to get your vmlinux (the uncompressed image) if possible (probably email it as it tends to be big) because your compiler generates quite different code from my 4.3.3. the other question is if you could try a vanilla kernel as an experiment just to see if it's not a bug in mainline? thing is, i don't really change anything that low-level in file mapping handling (which is where this bug occured) on amd64 so i wouldn't even know where to start to debug this right now. on the other hand, if it happens in vanilla as well, you can bisect it, even if that's somewhat painful as it'd take a day for each run...
Created attachment 183611 [details] The undecoded dmesg
Created attachment 183614 [details] System.map
(In reply to comment #5) > can you post the original dmesg you decoded, please? it seems that some > information is missing from the decoded version, e.g., the oops report should > also contain a page table dump of the faulting address at least which in this > case would be quite important to see. Ok, I've attached the undecoded dmesg (with additional info as the system has been up for a couple more days. I also attached the System.map as you probalby wanna see that one too :) > i'd also like to get your vmlinux (the > uncompressed image) if possible (probably email it as it tends to be big) > because your compiler generates quite different code from my 4.3.3. I would preferably not do this, however, if you insist, I could think about it once more. > the other question is if you could try a vanilla kernel as an experiment just > to see if it's not a bug in mainline? Ok, as you didn't specify any particular version, I did run for the latest amd64 (sys-kernel/vanilla-sources-2.6.27.10). Will boot that kernel in the morning and let it run for one or two days. > thing is, i don't really change anything > that low-level in file mapping handling (which is where this bug occured) on > amd64 so i wouldn't even know where to start to debug this right now. on the > other hand, if it happens in vanilla as well, you can bisect it, even if > that's somewhat painful as it'd take a day for each run... It's really painful as it's my main server here ;) If you need anything else, please let me know. I'll get back to you about the vanilla kernel in 72h or so.
I have been seeding both Debian Lenny and Fedora 11 Alpha Dvd for the past 60h and there are still no oops or problems recorded in the dmesg or logs other than some alot of "TCP: Treason uncloaked!" messages. I would consider this kernel (sys-kernel/vanilla-sources-2.6.27.10), therefor it would sugest that something is wrong in the sys-kernel/hardened-sources-2.6.2{6-r9,7-r7,8}. I'll try to run sys-kernel/hardened-sources-2.6.28 without PaX and grsec to see if it's those or maybe something else that cause this problem.
I have now also concluded that there are no problems with sys-kernel/hardened-sources-2.6.28 when compiled without PaX and without Grsecurity. I have now moved on to include PaX but still leave Grsecurity out. If I hit any of the problems, I will test the sys-kernel/hardened-sources-2.6.28-r1 package with the same config and see if the problem might be fixed there.
Were just about to cheer, but this morning I got another oops :( This time, it were with grsec disabled and PaX enabled. From what I can tell, it looks more or less as the same oops I posted in #1. From what the raw dmesg tells me, it has something to do with dm-6 so I ran an xfs_check(8) on it without any errors or warnings. May it have something to do with that I have placed that file system inside lvm2 and the lvm2 is inside a LUKS partition?
Created attachment 185334 [details] dmesg output 2009-03-17 dmesg ouput 2009-03-17 running sys-kernel/hardened-sources-2.6.28 built 2009-03-10
Created attachment 185335 [details] Decoded kernel oops from sys-kernel/hardened-sources-2.6.28 (2009-03-17) Decoded dmesg output 2009-03-17 running sys-kernel/hardened-sources-2.6.28 built 2009-03-10
Created attachment 185336 [details] Kernel config for sys-kernel/hardened-sources-2.6.28 (2009-03-10) Kernel config for sys-kernel/hardened-sources-2.6.28 built 2009-03-10. This one is without grsec but with PaX.
Created attachment 185337 [details] System.map for sys-kernel/hardened-sources-2.6.28 built 2009-03-10 System.map for sys-kernel/hardened-sources-2.6.28 built 2009-03-10. This one is without grsec but with PaX.
I forgot to mention in comment #11, I have now (2009-03-17) moved on to test sys-kernel/hardened-sources-2.6.28-r3. Lets hope for the best ;)
i've looked some more and i really think it's a vanilla bug exposed by PaX. to test my theory out, can you run a known buggy kernel (i.e., something that you won't have to wait forever to oops) but with SANITIZE turned off? the idea is that in all this lockless page cache lookup code there's a very subtle use-after-free race bug that gets exposed under SANITIZE as it zeroes memory pages out on free immediately and if you look at the oops addresses, one was an almost valid heap address (one extra bit set) or 0s with two extra bits set, all a sign of a reuse somewhere else.
then a countertest against vanilla would be to extract the SANITIZE chunks only from PaX and apply them to vanilla and see if you get the same problem or not. something like grepdiff --output-matching=hunk SANITIZE pax-linux-2.6.28.8-test23.patch will get you the chunks, just ignore the Kconfig part and remove the ifdef's from the rest.
Created attachment 185574 [details, diff] sanitize only patch of pax I couldn't follow your statement about the extra bits, but I did as you asked. The grepdiff of SANITIZE didn't include all the chunks needed for the sanitize feature. This patch includes all that I *think* is needed and it's in fact the one I'm running on now (2.6.28.8 + this patch). I think this is one of the first time I really am looking forward to get an oops ;) I have also found one vmlinux that did oops, is this still of interest?
(In reply to comment #19) > I couldn't follow your statement about the extra bits, but I did as you asked. > The grepdiff of SANITIZE didn't include all the chunks needed for the sanitize > feature. ah yeah, i forgot about those bits but that's what we have a compiler for to figure out and complain about ;). > I have also found one vmlinux that did oops, is this still of interest? if you still have the oops itself, you can post it (decoded) as it's one more data point.
(In reply to comment #20) > if you still have the oops itself, you can post it (decoded) as it's one more > data point. I have already attached those oops. Since I rebooted into 2.6.28.8 with that patch applied, I have gotten two warnings, they look more or less the same but doesn't appear to be related to the oops I'm trying to debug. Unfortunately there are no more information added when running this through ksymoops :/ ------------[ cut here ]------------ WARNING: at net/core/dev.c:1536 skb_gso_segment+0x1b1/0x230() Pid: 0, comm: swapper Tainted: G W 2.6.28.8-pax23 #1 Call Trace: <IRQ> [<ffffffff8025cd1a>] warn_on_slowpath+0x5a/0x80 [<ffffffff805f89f2>] ? __nf_conntrack_find+0x172/0x180 [<ffffffff806939c0>] ? _read_unlock_bh+0x10/0x20 [<ffffffff8063fbcc>] ? ipt_do_table+0x39c/0x420 [<ffffffff80231709>] ? read_tsc+0x9/0x20 [<ffffffff805e07b1>] skb_gso_segment+0x1b1/0x230 [<ffffffff805e0a64>] dev_hard_start_xmit+0x1a4/0x290 [<ffffffff805f198e>] __qdisc_run+0x1ce/0x280 [<ffffffff805e1016>] dev_queue_xmit+0x4c6/0x510 [<ffffffff8060c56b>] ip_finish_output+0x10b/0x2d0 [<ffffffff8060c7e8>] ip_output+0xb8/0xc0 [<ffffffff8060b1f0>] ip_local_out+0x20/0x30 [<ffffffff8060ba83>] ip_queue_xmit+0x3d3/0x440 [<ffffffff805f91cc>] ? __nf_ct_refresh_acct+0xdc/0x150 [<ffffffff802668a6>] ? lock_timer_base+0x36/0x70 [<ffffffff805d8eba>] ? __copy_skb_header+0x7a/0x190 [<ffffffff805d8ff9>] ? __skb_clone+0x29/0x110 [<ffffffff8061ea4d>] tcp_transmit_skb+0x3cd/0x6a0 [<ffffffff8061fc92>] __tcp_push_pending_frames+0x1a2/0x7b0 [<ffffffff8061f547>] ? tcp_current_mss+0xb7/0xe0 [<ffffffff8061d42c>] tcp_rcv_established+0x38c/0x610 [<ffffffff806245ca>] tcp_v4_do_rcv+0x18a/0x230 [<ffffffff80624d4e>] tcp_v4_rcv+0x6de/0x7e0 [<ffffffff80608302>] ip_local_deliver_finish+0xb2/0x230 [<ffffffff80608520>] ip_local_deliver+0xa0/0xb0 [<ffffffff80608634>] ip_rcv_finish+0x104/0x330 [<ffffffff80608a10>] ip_rcv+0x1b0/0x2e0 [<ffffffff805e163f>] netif_receive_skb+0x2af/0x330 [<ffffffff8051d085>] sky2_poll+0x5a5/0xd20 [<ffffffff805e18b0>] net_rx_action+0x90/0x160 [<ffffffff80261cd0>] __do_softirq+0x90/0x170 [<ffffffff8022b45c>] call_softirq+0x1c/0x30 [<ffffffff8022ce49>] do_softirq+0x49/0xa0 [<ffffffff80261e65>] irq_exit+0x45/0x50 [<ffffffff8022ccfc>] do_IRQ+0xcc/0x1d0 [<ffffffff8022a716>] ret_from_intr+0x0/0xa <EOI> [<ffffffff80231fa0>] ? mwait_idle+0x40/0x50 [<ffffffff80228062>] ? enter_idle+0x22/0x30 [<ffffffff80228124>] ? cpu_idle+0x54/0x70 ---[ end trace 4d750d5d67772fde ]---
Finally, after 11 days I got the OOPS again. The kernel running is the 2.6.28.8 + 2.6.28.8_sanitize_pax23.patch. I'll attach the OOPS, if you need anything else, let me know.
Created attachment 186839 [details] Complete dmesg dump.
Created attachment 186841 [details] Decoded kernel oops from 2.6.28.8 with partial pax patch.
Created attachment 186842 [details] System.map for 2.6.28.8 with partial pax patch.
(In reply to comment #22) > Finally, after 11 days I got the OOPS again. The kernel running is the 2.6.28.8 > + 2.6.28.8_sanitize_pax23.patch. I'll attach the OOPS, if you need anything > else, let me know. thank you, i think it pretty much establishes that there's a very subtle race somewhere in vanilla linux, probably in the lockless pagecache code. i'll call in the bigger guns as it's pretty much beyond me ;).
(In reply to comment #26) > thank you, i think it pretty much establishes that there's a very subtle race > somewhere in vanilla linux, probably in the lockless pagecache code. i'll call > in the bigger guns as it's pretty much beyond me ;). So, do you have anything else you wanna catch from the OOPSed kernel? Or can I reboot into a sane kernel? I think I'll return to running sys-kernel/hardened-sources-2.6.28-r7 for now without sanitize unless you got anything else you want to get tested.
(In reply to comment #27) > (In reply to comment #26) > > thank you, i think it pretty much establishes that there's a very subtle race > > somewhere in vanilla linux, probably in the lockless pagecache code. i'll call > > in the bigger guns as it's pretty much beyond me ;). > > So, do you have anything else you wanna catch from the OOPSed kernel? Or can I > reboot into a sane kernel? I think I'll return to running > sys-kernel/hardened-sources-2.6.28-r7 for now without sanitize unless you got > anything else you want to get tested. > Torbjörn, nice work and thank you for all the testing. I think that is reasonable at this time. Reassigning to kernel@g.o - this one needs to go up the mainline chain.
(In reply to comment #27) > So, do you have anything else you wanna catch from the OOPSed kernel? not really, but Nick Piggin, who i emailed earlier today, probably does, let's hope he'll show up soon ;).
After 36 days of no OOPS I finally got another one today. I do not know if this is related to the other oops in this bug, but I think they can be related due to the system_call_fastpath function being in the trace.
Created attachment 190530 [details] dmesg from sys-kernel/hardened-sources-2.6.28-r7 (2009-05-06)
Created attachment 190532 [details] Decoded kernel oops from sys-kernel/hardened-sources-2.6.28-r7 (2009-05-06)
Created attachment 190534 [details] System.map for sys-kernel/hardened-sources-2.6.28-r7 built 2009-03-31
Created attachment 190535 [details] Kernel config for sys-kernel/hardened-sources-2.6.28-r7 (built 2009-03-31)
(In reply to comment #30) > After 36 days of no OOPS I finally got another one today. I do not know if this > is related to the other oops in this bug, but I think they can be related due > to the system_call_fastpath function being in the trace. it's a separate problem: from what i managed to decode, it's a NULL deref in net/ipv4/tcp.c:tcp_sendmsg() where skb = tcp_write_queue_tail(sk); returns a NULL skb which is not checked later (other parts of the kernel seem to check it against NULL, although not everywhere). i have no idea what the right solution to this is, you should contact upstream about it. also it's likely a security problem given that probably a non-root user managed to trigger it and that this piece of code does a lot of things with that skb, something is likely good enough for more than an oops.
Hi Torbjörn, We had been working to diagnose your RAM on IRC a bit. Is it completely, 100% confirmed that this turns out to be a bad RAM issue after all? Can this bug be closed out?
(In reply to comment #36) > Hi Torbjörn, > > We had been working to diagnose your RAM on IRC a bit. Is it completely, 100% > confirmed that this turns out to be a bad RAM issue after all? Can this bug be > closed out? > I were just about to post a comment about the current state of this bug. As you know Gordon, I have switched the memory with another identical setup (my workstation) and then the problem moved to the workstation. I have since then bought new memory and did the replacement late yesterday. After the replacement I ran memtest86+ v 2.01 for one pass just to see if there were any obvious problems with the new memory and then restarted the 'emerge -B1 kdelibs' loop. So far, it have passed 58 builds, but I want to wait a couple days more so that I can at least merge 100 times, but so far, it's looking good. It's kinda strange that memtest86+ doesn't find the faulty cell/cells, even after 62 passes of std test. As the new memory were delayed by 24h, I just let the memtest run as long as I didn't have any memory to switch to. Conclusion: If no errors appear in the next day or so, I will close this bug before Monday. Does this sound ok Gordon? Thanks for all the time and effort you have put in this bug. It's much appreciated!
Closing as invalid as the root cause for this bug appears to be bad hw (has worked fine since I replaced the memory). kdelibs has now been merged 347 times without failures. Thanks for all your help and sorry for wasting your time.