Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 165175 - 2006.1 64Bit Live CD kernel 2.6.17-gentoo-r7 bug
Summary: 2006.1 64Bit Live CD kernel 2.6.17-gentoo-r7 bug
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: AMD64 Linux
: High critical (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
Depends on:
Reported: 2007-02-03 21:54 UTC by chad heuschober
Modified: 2007-02-06 19:05 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---

System specs (sys.specs.txt,246 bytes, text/plain)
2007-02-03 21:55 UTC, chad heuschober
Dump from error1 (error1.dump.txt,1.63 KB, text/plain)
2007-02-03 21:55 UTC, chad heuschober

Note You need to log in before you can comment on or make changes to this bug.
Description chad heuschober 2007-02-03 21:54:48 UTC
During seemingly random but cpu-intensive tasks the system faults with either a Segmentation Fault, CPU LOCK, General Protection Fault, or Kernel BUG.

Thus far, a more complete output cannot be ascertained as the system is not stable enough for a kernel to be built or for grub to be installed.

Kernel is booted:$ gentoo acpi=off noapic nolapic nosmp
Bios options disabled: APIOC, AMD Cool'N'Quiet, ACPI, [All unused peripherals] 
RAM: Underclocked from DDR2 800 1T to DDR2 533 2T
LiveCD verified against ISO, ISO checked against md5 sum.

Quotes below on [actual results] are two examples of the problem. However the error message has been known to change. And there are fairly regular non-critical segmentation faults.

Reproducible: Always

Steps to Reproduce:
During any of the following it /can/ happen and most usually does during an emerge operation with a roughly tracked percentage of occurance.

1. Boot (no output given but system freezes) (8%)
2. Untar'ing stage (8%)
3. Untar'ing portage (4%)
4. emerge --sync (8%)
5. emerge [program] (68%)
6. compilation of kernel (4% -- usually I don't get this far)

Because this is on a live-cd and I'm fairly new to the world of debugging I'm just typing the output I get onto the screen and hoping I don't misstype. If there's a way to capture this output I'd love to hear it. I can reproduce these errors insofar as it's only a matter of time with use.
Actual Results:  
Two errors I had the time to capture included below. Others have occurred before I became convinced this was a software problem and were not tracked.

ERROR0 (2006-02-01)
general protection fault [1] SMP
Modules linked in: raid0 ipv6 pcspkr forcedeth dm_mirror dm_mod ata_piix sata_nv libata ohci_hcd usb_storage usbhid usbcore
Pid 31552, comm: sh Not tainted 2.6.17-gentoo-r7 #1
..... {dump} .... 

ERROR1* (2006-02-03)
Kernel BUG at mm/page_alloc.c:304
invalid opcode: 0000 [6] SMP
Modules linked in: usb_storage raid0 forcedeth dm_mirror dm_mod ata_piix sata_nv libata ohci_hcd usbhid usbcore
pid: 24371, comm: cc1plus Not tainted 2.6.17-gentoo-r7#1
.... (dump) ....

* There appear to be multiple errors but being as how the system is frozen I can't scroll back to see the errors above it beyond the fact that the process in question is udevd.

This was initially believed to be a hardware issue over five months ago. And hardware was exchanged for different pieces or replaced multiple times.

Hardware has /never/ failed memtest86 or cpuburn, however, replaced has occurred regardless. CPU temps are also good averaging at around 39C and immediately after failure only touching 41C at max. All filesystems pass fsck before mounting and use. 

CPU: AMD X2 3800+ (65W AM2) - 1x
RAM: OCZ Gold Series DDR2 800 - 2x
PSU: Zalman 460W - 1x
HDD: WD Raptor 36GB - 1x

MOBO: Gigabyte GA-M55Plus --> ASRock ALiveXFire-eSATA2
MOBO: ASRock ALiveXFire-eSATA2 --> MSI K9N Platinum
PSU: Zalman 460w --> Antec Truepower 380W
PSU: Antec Trupower 380W --> Zalman 460W (rep)
DVD: NEC 3520 --> Pioneer DVR-110DBK
HDD: WD Raptor 36GB --> Samsung Spinpoint-P 160GB
HDD: Samsung Spinpoint-P 160GB --> Hitachi Deskstar 80GB
HDD: Hitachi Deskstart 80GB --> WD Raptor 36GB (rep)
Comment 1 chad heuschober 2007-02-03 21:55:28 UTC
Created attachment 109047 [details]
System specs
Comment 2 chad heuschober 2007-02-03 21:55:56 UTC
Created attachment 109048 [details]
Dump from error1
Comment 3 chad heuschober 2007-02-06 06:58:16 UTC
I wanted to add an update. Until recently 'nix has been my only possible testing platform.

And both memtest86 and cpuburn have worked without error -- (cpu burn run in knoppix5.1.1 -- kernel 2.6.19)

Today I found time enough to pull an old pata drive out (circa 1997) and put XP on it with hopes of comparing stability woes (and whether or not this really /is/ a hardware problem).

Unfortunately this isn't ideal for comparative testing because it is a different drive on a different bus (IDE) on a different arch (32-bit) but I am desperate for leads and will later try 32bit gentoo on the same drive.

The results?

Installation proceeded without a problem. Upgrade to Service Pack 2 driver an d cd burning software also without failure.

Bios on my mb was 3 subversions behind and the board lacks a floppy header so a windows installation is necessary for a BIOS upgrade. This failed on an error that it couldn't 'map main memory'.

I tried superpi -- the first few small iterations were fine but when I started with a medium sized iteration I received a failure. This failure then seemed to affect subsequent iterations of superpi regardless of calculation size. Runs that completed successfully before now failed.

That said this /seems/ more like a hardware issue and something I shouldn't trouble you with but the replaced hardware list says otherwise as does the stability of cpuburn in knoppix over a 48 hour period.

Any help is appreciated.
Comment 4 Daniel Drake (RETIRED) gentoo-dev 2007-02-06 19:05:57 UTC
Sorry, I don't have any ideas. As you have observed, this doesn't really make any sense but does smell like a hardware problem.

If you do decide this is a Linux kernel issue, then the next step would be to figure out how you can boot the latest development kernel (currently 2.6.20) on this system, then post a complete report of the first few oops's to the Linux kernel mailing list. Someone should be able to assess whether they are bogus (i.e. hardware is screwed) or not.