166512 – >=gentoo-sources 2.6.18-gentoo-r6 - hard lockup on 32bit chroot

Bug 166512 - >=gentoo-sources 2.6.18-gentoo-r6 - hard lockup on 32bit chroot

Summary: >=gentoo-sources 2.6.18-gentoo-r6 - hard lockup on 32bit chroot

Status:	RESOLVED CANTFIX

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	AMD64 Linux

Importance:	High critical
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-02-12 15:12 UTC by Cory Coager
Modified:	2007-05-10 22:40 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Kernel config, gentoo-sources-2.6.18-r6 (config-2.6.18-gentoo-r6.log,38.54 KB, text/plain) 2007-02-12 15:13 UTC, Cory Coager	Details
lspci -vvv (lspci.log,25.26 KB, text/plain) 2007-02-12 15:13 UTC, Cory Coager	Details
emerge --info for 64-bit install (emerge_64-bit.log,4.05 KB, text/plain) 2007-02-12 15:14 UTC, Cory Coager	Details
emerge --info for 32-bit environment (emerge_32-bit.log,3.98 KB, text/plain) 2007-02-12 15:14 UTC, Cory Coager	Details
kernel config for 2.6.21-gentoo (config-2.6.21-gentoo.txt,38.62 KB, text/plain) 2007-05-02 23:51 UTC, Cory Coager	Details
dmesg for kernel 2.6.21-gentoo (dmesg.txt,15.53 KB, text/plain) 2007-05-02 23:52 UTC, Cory Coager	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Cory Coager 2007-02-12 15:12:26 UTC

My machine is used for mythtv with a 64-bit Gentoo install.  I also have a 32-bit chroot environment I use to build 32-bit applications.

Ever since I upgraded from kernel gentoo-sources 2.6.12-gentoo-r9 to gentoo-sources 2.6.18-gentoo-r6 the machine locks up constantly and consistently when entering the 32-bit chroot Gentoo environment and using emerge.  I recently tried kernel gentoo-sources 2.6.19-gentoo-r4 and have the same behavior.

Reproducible: Always

Steps to Reproduce:
1. login as root
2. linux32 chroot /mnt/gentoo32 /bin/bash
3. env-update && source /etc/profile
4. while [ 1 ]; do emerge --info; done
Actual Results:  
Within about 2 minutes the machine locks up, the keyboard, mouse, network all stop responding, no crash in the logs, no crash on the console.

Expected Results:  
Machine should not lockup!

Using the 64-bit environment does not cause any lockups.  I tried running some 32-bit applications natively from the 64-bit environment and this does not cause a lockup either.

Comment 1 Cory Coager 2007-02-12 15:13:21 UTC

Created attachment 109949 [details]
Kernel config, gentoo-sources-2.6.18-r6

Comment 2 Cory Coager 2007-02-12 15:13:40 UTC

Created attachment 109951 [details]
lspci -vvv

Comment 3 Cory Coager 2007-02-12 15:14:07 UTC

Created attachment 109953 [details]
emerge --info for 64-bit install

Comment 4 Cory Coager 2007-02-12 15:14:33 UTC

Created attachment 109954 [details]
emerge --info for 32-bit environment

Comment 5 Daniel Drake (RETIRED) gentoo-dev

2007-02-24 18:28:07 UTC

Do you have access to a serial console to debug this? When it has crashed, have you tried using magic sysrq to reboot the system? Can you reproduce this on 2.6.20?

Comment 6 Cory Coager 2007-03-05 02:05:10 UTC

I recompiled the kernel with magic sysrq but unfortunately when it locks up the keyboard doesn't respond at all.  This machine doesn't have a serial port either.  However, I think I can borrow a serial card so I will get back to you.

Comment 7 Daniel Drake (RETIRED) gentoo-dev

2007-03-24 17:30:28 UTC

You can use the following process to track this down. Be warned, it will take quite a bit of time.

First, test the latest development kernel (currently 2.6.21-rc4) to ensure the bug hasn't already been fixed.

Next, some basic verification to confirm that vanilla 2.6.12 is OK and vanilla 2.6.18 is not:

Check out the kernel tree from git.
 # emerge -n dev-util/git
 # git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git linux-git
 # cd linux-git
 # git branch test
 # git checkout test
 # git reset --hard v2.6.12

At this point you have the kernel sources for vanilla 2.6.12 in front of you. Build the kernel as usual, boot into it, and confirm that you don't have the lockup issue.

Assuming that works OK, go back into that directory and
 # git reset --hard v2.6.18

Now you have vanilla 2.6.18 in front of you. Build that as usual, boot into it, and confirm it locks up in the way you describe here.

Assuming the above verification produces the expected results, go back into linux-git and start a bisection:

 # git checkout master
 # git bisect start
 # git bisect bad v2.6.18
 # git bisect good v2.6.12

Now follow the instructions here:
http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches

and you should end up with the exact commit which introduced the bug.

Comment 8 Cory Coager 2007-04-03 13:18:04 UTC

I tested my old 2.6.12 kernel again before running through your instructions, ran my benchmark test and it locked up in the 32-bit chroot.  It did not lockup as fast as the newer kernels do though, took about 30 minutes.  I rebooted and tried again and had the same result, locked up after awhile.

Next I tried running my benchmark in the 64-bit environment (not in the chroot).  I ran the test for hours and no lockup.  I think its safe to say this issue is only in the 32-bit chroot.

Where should I go from here?

Comment 9 Daniel Drake (RETIRED) gentoo-dev

2007-04-03 14:01:36 UTC

Hmm OK, don't bother with the bisection then. I was under the impression that 2.6.12 was working fine.

This sounds like it may be a hardware problem.

Comment 10 Daniel Drake (RETIRED) gentoo-dev

2007-04-25 15:36:05 UTC

Can you try running memtest off a gentoo livecd for a few passes?

Comment 11 Cory Coager 2007-04-26 21:35:08 UTC

24 hours, 52 passes, 0 errors on memtest86.

I've been working on this here and there trying to troubleshoot it but I'm running out of ideas.

Comment 12 James 2007-05-01 01:30:49 UTC

Cory: emerge is the *only* command that causes this lockup in the 32-bit environment?

I would be inclined to think that this is a hardware problem as well, except for the fact that nothing weird is happening inside the 64-bit env.

I know this sounds like a stretch, but I would personally back up my 32-bit environment and reinstall it (or simply install another 32 bit environment on the system and leave the original in tact) to see if the problem is indeed reproducible.  I'm not sure you're willing to go that far to get everything fixed up, but IMHO that's a relatively logical next-step in the troubleshooting process.

Comment 13 Cory Coager 2007-05-01 03:21:04 UTC

I actually already tried that recently.  I downloaded the latest 32-bit stage3 install and it locked up when I tested.

Comment 14 James 2007-05-01 14:25:55 UTC

Can you try this again using the latest Gentoo kernel source (2.6.21) and see if the problem persists?  There's always the possibility that something changed upstream may have fixed the problem.

Comment 15 Duane Griffin 2007-05-01 23:05:32 UTC

Could you add the output from dmesg after startup?

Comment 16 Cory Coager 2007-05-02 23:49:34 UTC

I tried kernel 2.6.21-gentoo but it still locks up, within a few seconds.

I tried testing on the 64-bit install again, ran the test for 4 days straight with no lockups at all.

I tried another 32-bit application inside the chroot instead of emerge to see if it would lockup.  I ran mplayer on a loop playing an audio file, ran for over a day with no lockups.  I don't think it is quite as resource intensive as emerge is though, any other suggestions on what I can benchmark in the 32-bit chroot?

Comment 17 Cory Coager 2007-05-02 23:51:38 UTC

Created attachment 118001 [details]
kernel config for 2.6.21-gentoo

Comment 18 Cory Coager 2007-05-02 23:52:54 UTC

Created attachment 118002 [details]
dmesg for kernel 2.6.21-gentoo

Comment 19 Duane Griffin 2007-05-03 01:36:01 UTC

Do you have another machine that you can use while testing? If so you should be able to use netconsole to catch any last words from the kernel. From your config it appears you are already compiling it in. Its configuration and use is described in Documentation/networking/netconsole.txt under your kernel source tree.

It might also be interesting to see exactly what is happening when the crash occurs. Try emerging dev-util/strace and running "strace -f emerge --info > /dev/null" instead of just "emerge --info". What is the last few lines on the console when it dies and is it consistent at all?

Is the chroot on an XFS filesystem? If so, and you have some spare space lying around, it might be worth creating an ext3 filesystem, copying the data across, and seeing whether it is reproducible on that.

One last thing (for now). You have MCE support compiled into your kernel. Could you emerge app-admin/mcelog and run /usr/sbin/mcelog (as root) after a crash. That might tell you if something is going wrong with your hardware.

Comment 20 Duane Griffin 2007-05-03 21:41:54 UTC

Sorry, Daniel just kindly pointed out to me that mcelog is non-persistent, so scratch that idea. It won't help us with this bug.

Comment 21 Cory Coager 2007-05-09 00:37:03 UTC

I turned on logging via netconsole and turned up the logging to 9.  Unfortunately it didn't send any information when the machine locked up.

I tried using strace but unfortunately it wasn't consistent after a few tries.

I have a 32-bit chroot on both xfs and ext3, same result on both.

Comment 22 Duane Griffin 2007-05-10 14:42:54 UTC

With XFS ruled out, no messages from the kernel and no particular syscall causing the crash it sounds very much like a hardware issue. If you have the opportunity you could try swapping in a different CPU. Other than that I'm out of ideas, sorry.

Comment 23 Daniel Drake (RETIRED) gentoo-dev

2007-05-10 22:40:14 UTC

I agree with the above and I do suspect hardware here. It may sound odd, but I have seen faulty processors in the past which work fine in 32bit mode but crash horribly in 64bit mode, and I guess the opposite is not impossible.

If you're convinced this must be a kernel bug, I suggest you continue this by mailing the linux kernel mailing list. See http://www.tux.org/lkml
There's not a lot more we can do downstream -- sorry.