My machine is used for mythtv with a 64-bit Gentoo install. I also have a 32-bit chroot environment I use to build 32-bit applications. Ever since I upgraded from kernel gentoo-sources 2.6.12-gentoo-r9 to gentoo-sources 2.6.18-gentoo-r6 the machine locks up constantly and consistently when entering the 32-bit chroot Gentoo environment and using emerge. I recently tried kernel gentoo-sources 2.6.19-gentoo-r4 and have the same behavior. Reproducible: Always Steps to Reproduce: 1. login as root 2. linux32 chroot /mnt/gentoo32 /bin/bash 3. env-update && source /etc/profile 4. while [ 1 ]; do emerge --info; done Actual Results: Within about 2 minutes the machine locks up, the keyboard, mouse, network all stop responding, no crash in the logs, no crash on the console. Expected Results: Machine should not lockup! Using the 64-bit environment does not cause any lockups. I tried running some 32-bit applications natively from the 64-bit environment and this does not cause a lockup either.
Created attachment 109949 [details] Kernel config, gentoo-sources-2.6.18-r6
Created attachment 109951 [details] lspci -vvv
Created attachment 109953 [details] emerge --info for 64-bit install
Created attachment 109954 [details] emerge --info for 32-bit environment
Do you have access to a serial console to debug this? When it has crashed, have you tried using magic sysrq to reboot the system? Can you reproduce this on 2.6.20?
I recompiled the kernel with magic sysrq but unfortunately when it locks up the keyboard doesn't respond at all. This machine doesn't have a serial port either. However, I think I can borrow a serial card so I will get back to you.
You can use the following process to track this down. Be warned, it will take quite a bit of time. First, test the latest development kernel (currently 2.6.21-rc4) to ensure the bug hasn't already been fixed. Next, some basic verification to confirm that vanilla 2.6.12 is OK and vanilla 2.6.18 is not: Check out the kernel tree from git. # emerge -n dev-util/git # git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git linux-git # cd linux-git # git branch test # git checkout test # git reset --hard v2.6.12 At this point you have the kernel sources for vanilla 2.6.12 in front of you. Build the kernel as usual, boot into it, and confirm that you don't have the lockup issue. Assuming that works OK, go back into that directory and # git reset --hard v2.6.18 Now you have vanilla 2.6.18 in front of you. Build that as usual, boot into it, and confirm it locks up in the way you describe here. Assuming the above verification produces the expected results, go back into linux-git and start a bisection: # git checkout master # git bisect start # git bisect bad v2.6.18 # git bisect good v2.6.12 Now follow the instructions here: http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches and you should end up with the exact commit which introduced the bug.
I tested my old 2.6.12 kernel again before running through your instructions, ran my benchmark test and it locked up in the 32-bit chroot. It did not lockup as fast as the newer kernels do though, took about 30 minutes. I rebooted and tried again and had the same result, locked up after awhile. Next I tried running my benchmark in the 64-bit environment (not in the chroot). I ran the test for hours and no lockup. I think its safe to say this issue is only in the 32-bit chroot. Where should I go from here?
Hmm OK, don't bother with the bisection then. I was under the impression that 2.6.12 was working fine. This sounds like it may be a hardware problem.
Can you try running memtest off a gentoo livecd for a few passes?
24 hours, 52 passes, 0 errors on memtest86. I've been working on this here and there trying to troubleshoot it but I'm running out of ideas.
Cory: emerge is the *only* command that causes this lockup in the 32-bit environment? I would be inclined to think that this is a hardware problem as well, except for the fact that nothing weird is happening inside the 64-bit env. I know this sounds like a stretch, but I would personally back up my 32-bit environment and reinstall it (or simply install another 32 bit environment on the system and leave the original in tact) to see if the problem is indeed reproducible. I'm not sure you're willing to go that far to get everything fixed up, but IMHO that's a relatively logical next-step in the troubleshooting process.
I actually already tried that recently. I downloaded the latest 32-bit stage3 install and it locked up when I tested.
Can you try this again using the latest Gentoo kernel source (2.6.21) and see if the problem persists? There's always the possibility that something changed upstream may have fixed the problem.
Could you add the output from dmesg after startup?
I tried kernel 2.6.21-gentoo but it still locks up, within a few seconds. I tried testing on the 64-bit install again, ran the test for 4 days straight with no lockups at all. I tried another 32-bit application inside the chroot instead of emerge to see if it would lockup. I ran mplayer on a loop playing an audio file, ran for over a day with no lockups. I don't think it is quite as resource intensive as emerge is though, any other suggestions on what I can benchmark in the 32-bit chroot?
Created attachment 118001 [details] kernel config for 2.6.21-gentoo
Created attachment 118002 [details] dmesg for kernel 2.6.21-gentoo
Do you have another machine that you can use while testing? If so you should be able to use netconsole to catch any last words from the kernel. From your config it appears you are already compiling it in. Its configuration and use is described in Documentation/networking/netconsole.txt under your kernel source tree. It might also be interesting to see exactly what is happening when the crash occurs. Try emerging dev-util/strace and running "strace -f emerge --info > /dev/null" instead of just "emerge --info". What is the last few lines on the console when it dies and is it consistent at all? Is the chroot on an XFS filesystem? If so, and you have some spare space lying around, it might be worth creating an ext3 filesystem, copying the data across, and seeing whether it is reproducible on that. One last thing (for now). You have MCE support compiled into your kernel. Could you emerge app-admin/mcelog and run /usr/sbin/mcelog (as root) after a crash. That might tell you if something is going wrong with your hardware.
Sorry, Daniel just kindly pointed out to me that mcelog is non-persistent, so scratch that idea. It won't help us with this bug.
I turned on logging via netconsole and turned up the logging to 9. Unfortunately it didn't send any information when the machine locked up. I tried using strace but unfortunately it wasn't consistent after a few tries. I have a 32-bit chroot on both xfs and ext3, same result on both.
With XFS ruled out, no messages from the kernel and no particular syscall causing the crash it sounds very much like a hardware issue. If you have the opportunity you could try swapping in a different CPU. Other than that I'm out of ideas, sorry.
I agree with the above and I do suspect hardware here. It may sound odd, but I have seen faulty processors in the past which work fine in 32bit mode but crash horribly in 64bit mode, and I guess the opposite is not impossible. If you're convinced this must be a kernel bug, I suggest you continue this by mailing the linux kernel mailing list. See http://www.tux.org/lkml There's not a lot more we can do downstream -- sorry.