I noticed that "emerge tetex" would trigger a kernel panic. I've tried rebuilding gentoo-sources to remove anything that looked potentially bleeding-edge, and I've also (since) built vanilla-sources without so much as USB or DMA over my IDE. <puzzled> However the truly freaky aspect of this was when, in my frustration, I went back to my mandrake 9.1 system and tried emerging tetex inside a chroot of the gentoo partitions (btw, I mounted "--bind" both proc and dev first). This mandrake system has been there a long time and I have developped on it every day for a *long* time and never seen any indication of this scale of problem (so I think this argues against the obvious answer of "bad ram"). So what happens is that "emerge tetex" inside the gentoo chroot actually sent the mandrake kernel into panic too! I then got a loop of kernel errors from the USB subsystem until I powered off, hence my attempt to go back into gentoo, build a non-USB kernel, reboot, and see if that made a difference. It didn't. <even more puzzled> I'm not sure how best to capture kernel crash information, but the last time I did this, this is what I captured (by hand) from the console - it includes the last couple of relevant lines from the tetex compilation, just for context and/or trying to reproduce this error; make [3]: Leaving [......]/work/tetex-src-2.0.2/texk/web2c [...] gcc -DHAVE_CONFIG_H [...] -O3 -march=athlon -pipe -fomit-frame-pointer -c weave.c Unable to handle kernel NULL dereference at virtual address 00000000 *pde = 00000000 Oops: 0000 CPU: 0 EIP: 0010 [<c0150022>] Not tainted EFLAGS: 00010292 eax: fffffff8 ebx: fffffff8 ecx: c030a9e8 edx: f7e6ffd0 esi: f7e7f400 edi: 00000000 ebp: 00000000 esp: f7e6ff08 ds: 0018 es: 0018 ss: 0018 Process kupdated (pid: 7, stackpage: f7e6f000) Stack: f7e6ff14 f7e6e000 f7e7f45c f7e7f400 f7e6e560 Call Trace: [<c014ef95>] [<c013eb78>] [<c013eeb4>] [<c0107382>] [...] Code: 8b 43 08 86 57 04 89 50 04 89 02 c7 47 04 00 00 00 00 8b 93 <1> Unable to handle NULL [...] [...] <0> Kernel panic: Attempted to kill the idle task! In idle task - not syncing <1> Unable to [...] [...] <0> Kernel panic: Aiee, killing interrupt handler! In interrupt handler - not syncing As I say, I jotted that down by handle from the console so may have typo'd something, and I skipped bits. This problem has generated different-looking panics before. Eg. I recall that at one point it mentioning a line number from the kernel's vmm code. This leads me (naively perhaps) to believe that the panic itself may be a couple of dominos down the line from the cause of the problem. <and finally, an overpowering puzzlement descends...> Any help would be gratefully received, even if it's just some terse (but explanatory) information on how to diagnose this myself. Reproducible: Always Steps to Reproduce: 1. emerge tetex Actual Results: kernel panic Expected Results: Compilation, or at least soft errors in user-space. The gcc compilation triggered what appears to be a vmm bug in the kernel, or at least a hardware bug that the kernel does not work around. I've had this bug trigger every time I tried emerging tetex - and given the length of time gcc spent on the file in question, this is probably just "reliable" because the tetex source gives gcc an unusually high workout. But as this also triggered on my mandrake dev system, which has built untold numbers of large (and debugging versions of) projects, I suspect it could be helped by gentoo generating an aggressive glibc?! Just a thought. Anyway, I've had this happen on a pretty stock build of gentoo-sources all the way through various other customisations, up to (a) vanilla-sources without USB, hotplug, alsa (and no X running), and (b) a chroot shell onto mounted gentoo partitions from inside an historically-stable mandrake 9.1 system. This is the first time I've seen the mandrake kernel panic since I've been using it - and that's a long time, and includes daily thrashing of gnu tool chain (I'm a developer). I can honestly say that I have no idea where to look next.
perhaps under cooled cpu ?
This might be a little < or very > problematic. Can you try writing down the whole OOPS message and then typing it in by hand, attempting to use the same spacing as you get on the screen so 'ksymoops' nicely reads it. Alternatively, if you can, take a photo of your OOPS and I'll turn it into text. You seem to have this with both a USB kernel and a non-USB kernel. If you can generate OOPS data for both, that might be better just in case the USB subsystem is causing this. Occasionally, extra components can simply mess things up < especially framebuffers! > or they can help, so it's good to have backtraces for both. Does this happen at a random point in time during compilation or at a fixed source file / Makefile step? Can you please: * *** IMPORTANT: Do this on the kernel you got each OOPS from! *** * $> emerge ksymoops * $> ksymoops < file_containing_your_OOPS > file_with_output * Paste file_with_output to bugzilla, along with the kernel .config and the kernel version that OOPS is from. >> This is the first time I've seen the mandrake kernel panic since I've been >> using it - and that's a long time, and includes daily thrashing of gnu tool >> chain (I'm a developer). I can honestly say that I have no idea >> where to look next. Yes, it's horrible when your trusted toolchains randomly die. On different kernels. With what-looks-like-working hardware. We can't do that much without backtraces.
Created attachment 21514 [details] ksymoops output
Created attachment 21515 [details] the original oops dump (including the gcc line that triggered it)
Thanks for the feedback. I've followed your suggestion and have just attached the details to the bug. Now, there are a variety of naive leads I thought about chasing on my own now, including bundling all manner of extra information into the bug report, but I figure that I should just give you what you asked for ASAP so you can tell me what I need to do next. FWIW, here are a couple of associated notes. (1) Always the same place I've reproduced the crash a couple of times since sending the bug, and it's always the exact same place in the tetex build where the crash happens. (2) Always the same oops At least I think so. The initial oops/panic sets off a series of subsequent panics down to where the init process starts to crap out. There are sometimes pauses of a couple of seconds between the panics, and the series of panics are different each time. However, the *first* oops/panic is the same, and occurs in 'kupdated'. After running ksymoops and seeing the trace for myself, I took a hunch and figured it was reiserfs-related - I had been mounting / and /usr as "noatime" and without "notail" (reiser's efficient storage of small files is the main reason I'm using it). But wondering if this was an inode-handling bug, I tried a "mount -o remount" for the filesystems, removed /var/tmp/portage/tetex-.../, and ran emerge tetex again. Not only did I get a matching oops at the same place, but the binary details of that first oops were an *exact* match with the one I'm including below. However, all the oops/panics that followed it were totally different (including the number of them, the processes they occured in, and the timing/pauses). The ksymoops input was typed by hand. The obvious caveat is that there could be a typo, but I certainly started off pretty carefully (subsequent oops got increasingly more touch-typing treatment). Also, I checked this first oops later when I got a reproduction of it and there was an exact match, so that first one at least *should* be reliable. My (uneducated) guess is that the subsequent oops/panics are less informative, and are just getting punished randomly for the initial problem. I have got this bug not only with and without USB, but in gentoo-sources, vanilla-sources, and inside a chroot shell on mandrake 9.1 (with the stock kernel). For this reason, I didn't try producing different oops output from these other environments, as I feel you may have more useful things for me to try by getting this oops output to you now. As for kernel .config - yup, I'll attach that too in a sec. The kernel is vanilla-sources (2.4.23), compiled for athlon.
Created attachment 21516 [details] kernel config for the kernel that generated the oops
OK, here go some random ideas from a quick analysis of the OOPSes... ------------------------------------------------------------------------------- 1) VM: killing process cc1 swap_free: Unused swap offset entry 00100000 Looks like something strange is happening with memory allocation and it seems to be chewing up all your RAM *and* swap. Run a "top -d 0.25" process in a console while the tetex build is running and see if anything strange happens with your "Mem:" and "Swap:" >> {standard input}: Assembler messages: >> {standard input}:4181: Warning: end of file not at end of a line; newline inserted >> {standard input}:4571: Error: bad register name `%e' And that's GCC dying as a result. The EOF at the end of a line is either the assembler process dying and GCC complains about it, or the I/O subsystem is beginning to get a headache and starts closing file throughput. >>EIP; c0150022 <update_atime+162/2a0> <===== >>ecx; c030a9e8 <files_lock+24/23c> >>Trace; c014ef95 <__mark_inode_dirty+2d5/430> ...corresponds with GCC dying and cleaning up some *.o files. 2) <0>journal-003: journal_end: j_start (-122459872) is too high (device unknown-block(104,8)) ... looks like a sign of an FS problem ... 3) memory.c:100: bad pmd 00400000. < continued > ... RAM situation getting desperate. 'init' will start killing anything it can. 4) >>eax; c022c940 <scsi_free+7340/e170> Trace; c014ed56 <__mark_inode_dirty+96/430> Trace; c014ff2b <update_atime+6b/2a0> Trace; c029abe8 <ip_nat_helper_unregister+2898/6e50> Trace; c0133df4 <__alloc_pages+64/270> Trace; c013d8bb <generic_commit_write+4b/90> Trace; c029b04e <ip_nat_helper_unregister+2cfe/6e50> Trace; c024e5cd <sock_create+58d/1240> Trace; c024d58f <sock_map_fd+14f/180> Trace; c014c8ed <lease_get_mtime+67d/c90> Trace; c024e19d <sock_create+15d/1240> Trace; c024f0c7 <sock_create+1087/1240> Trace; c01073cf <__up_wakeup+12bf/1690> ... init is dead; the kernel should now clear up all modules and try to shut down the I/O subsystems to preserve data. Result: Wham, system is dead. ------------------------------------------------------------------------------- 6) The persistent problem here is RAM/Virtual Memory running mega-low: As a result, init should start eating up CPU and systematically shut down processes to free RAM. An alternative is a corrupted swap, causing the system to flush the swap caches and thus run out of memory as swap is disabled -- if you look, you'll see that you get memory allocation errors * after * getting swap errors. Thus, your RAM may be perfectly fine and you might have a corrupted swap. Can I suggest that you do some more * very heavy * compiling which is also a good RAM/VM test?
I've just come back from running this again. I did as you suggested and ran "top -d 0.25" in a separate console to monitor things. Note, I normally mount tmpfs on /tmp, but I avoided that this time just in case it helped. Wrt memory, I stopped compiling himem support into the kernel ages ago (hoping that would fix the problem) - so as a result linux only sees 960Mb of my 1.3Gb of RAM. But that should still be enough for most compilations, I'd hope. Anyways, same problem, no dice. Swap was never touched, and of the 960Mb of RAM, the highest I saw usage get was around 500Mb (during the compilation of bibtex.c if I'm not mistaken). During weave.c, the same oops happened at the same place and time, and top didn't seem to see any extreme mem consumption at all (again, swap was never touched and it mem usage came nowhere near the max). BTW: w.r.t. trying out some heavy compilations - the rest of this gentoo system has been built using the same default compiler/glibc, mostly in mandrake/chroot but some of it from inside the gentoo system itself. As I've seen this tetex-triggered oops in both, I would have thought that any workload-related bug would have already shown up. It has already built its way through kdelibs and a few other heavy things like that, and if I want something heavy to try out, I wouldn't find a much better selection than the stuff I've built so far. Also, that the "weave.c" file in tetex that triggers gcc to oops the kernel doesn't even seem to be the most demanding file in the tetex compilation. Mind you, I can't qualify that with data, but it does "seem" that way. Any advice on how to proceed would be most welcome. In the mean time, I might try my hand at building a UML kernel and producing a root fs image to try and reproduce this problem in user space - maybe that will choke on the same thing and give me a chance to fire up gdb? ... <shrug> What else? Should I rebuild a kernel with debugging? (how?) Are there any obvious things to eliminate by disabling or tweaking? Are there any good mem/swap testers you could recommend?
Can you try manually building [ just compile; no need to run "make install" ] tetex in user-space without using portage in case this is a portage / sandbox bug? Can you try doing "emerge tetex" in a TTY console? Can you try * enabling * highmem support just in case; although I don't think this would yield a result.
This doesn't appear to be a sandbox problem. I reproduced the kernel crash using a non-X console on mandrake, chroot'ed into the gentoo partitions, and building tetex as a non-root user from source without the use of emerge. And even if it had been a sandbox problem, running it all within a virtual UML kernel should have at worst trapped out to the user space kernel. This suggests to me that there is something at the instruction level that is simply trapped badly by the kernel (or is a bug in my hardware). emerge inside a UML gentoo kernel/system running as a non-root user on my mandrake host still fried the native host kernel. It doesn't get much worse than that IMHO. Note, if I tone down my CFLAGS, this problem isn't triggered - so the CFLAGS have some role in exposing the problem ("-O3 -march=athlon -fomit-frame-pointer -pipe" fails, but changing -O3 to -O2 is OK). However, the fact this can trigger kernel explosions from non-root processes is frightening - moreso because it works in the gentoo kernels and the mandrake ones. I'm worried that even if I get compilation to behave, this problem can show up again at any time - ie. there exist a sequence of instructions that if executed as any user, will completely kill my system. What next? If someone wants to look at this, make contact with me and I can let you into my system and I'll copy the UML+root-image for you to play with (it'll still crash my host system if you let it, so I'd need to sync up). If this isn't of interest, tomorrow I will simply /dev/null everything and rebuild my gentoo system from scratch using softer CFLAGS and forget about it. It still sucks, but if I get a system that is stable (like mandrake is "stable", even if vulnerable to this problem) then I'll make do.
Can I please have versions of anything and everything compiler related? I would like to produce some similar object code and see if I can track down the problem here and whether I would get similar problems.
This turned out to be a very perplexing and strange RAM hardware fault. Thanks to Geoff for all his patience and efforts :-)