Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 158492 - HP ML-150, Was: Kernel Crashes, multiple kernels, possibly XFS or MD
Summary: HP ML-150, Was: Kernel Crashes, multiple kernels, possibly XFS or MD
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: x86 Linux
: Highest critical (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-12-18 13:36 UTC by John Huttley
Modified: 2007-02-19 01:05 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments
Kernel 2.6.18-gentoo-r4.config (kernel-config-x86-2.6.18-gentoo-r4,42.00 KB, text/plain)
2006-12-18 13:43 UTC, John Huttley
Details
2.6.18-gentoo-r4 Console Log (smp) (bootlog-18-gentoo-r4-smp.txt,30.57 KB, text/plain)
2006-12-18 13:45 UTC, John Huttley
Details
System map for Gentoo (System.map-genkernel-x86-2.6.18-gentoo-r4,963.12 KB, text/plain)
2006-12-18 13:48 UTC, John Huttley
Details
2.6.19.1 Config (UP) (kernel-config-x86-2.6.19.1,42.46 KB, text/plain)
2006-12-18 13:49 UTC, John Huttley
Details
2.6.19.1 (UP) Console log (bootlog-19.1.txt,81.18 KB, text/plain)
2006-12-18 13:50 UTC, John Huttley
Details
2.6.19.1 (UP) system.map (System.map-genkernel-x86-2.6.19.1,983.11 KB, text/plain)
2006-12-18 13:51 UTC, John Huttley
Details
2.6.19.1 (SMP) Config (kernel-config-x86-2.6.19.1-smp,42.67 KB, text/plain)
2006-12-18 13:51 UTC, John Huttley
Details
2.6.19.1 (SMP) Console (bootlog-19.1-smp.txt,45.63 KB, text/plain)
2006-12-18 13:52 UTC, John Huttley
Details
2.6.19.1 (SMP) system.map (System.map-genkernel-x86-2.6.19.1-smp.zip,264.27 KB, application/octet-stream)
2006-12-18 13:56 UTC, John Huttley
Details
Boot and Oops when starting samba (console.txt,24.49 KB, text/plain)
2007-01-30 20:47 UTC, John Huttley
Details
crash log on starting the 2006.1 X64 Live DVD (x64-console.txt,8.97 KB, text/plain)
2007-01-31 01:31 UTC, John Huttley
Details
Crash on 2.6.20-rc7 (20-rc7.txt,35.82 KB, text/plain)
2007-02-01 00:58 UTC, John Huttley
Details
crash on 2.6.15 (15.txt,18.61 KB, text/plain)
2007-02-01 01:12 UTC, John Huttley
Details

Note You need to log in before you can comment on or make changes to this bug.
Description John Huttley 2006-12-18 13:36:29 UTC
This covers two systems, at least 3 kernels, so I'll start by giving some background.

A company is using linux as a file server. They do video production so they need lots of space. 500Gb in one share.

I installed a 805 system (SMP, x86)with 1Gb ram and 4x 325GB WD SATA disks.
These were setup as a RAID1 MD parition to hold the system and a RAID5 MD for data.
Managed by evms, filesystem XFS. 

I also tried compiling without SMP.

This proved very unreliable so I replaced the PSU, added another fan, put on a UPS. It still kept crashing so I replaced the MB and the RAM. Still crashed.

We had had enough of this so they bought a HP ML-150 (G3) with SAS storage (LSI LOGIC).
This has a 6 bay hot plug tray that can take SAS and SATA drives. The controller can create raid 0 arrays in hardware.
It has 2X 36Gb, 15K SAS drives as a RAID1 array, and 4 SATA drives. These are all HP OEM drives. The system is on the SAS drives.

The first cut has the 2 SAS drives as a RAID1 logical drive with the system on it (evms, XFS).

4 SATA drives as  RAID5, evms, XFS. As soon as any data was on the array it started to crash. So I tried a vanilla 2.6.19.1 kernel in UP and SMP modes and it still crashes. (I have the system maps, config and console logs for these three kernels).
I got so bad it would crash as soon as the array was accessed.
I suspected something nasty in the RAID5 implementation.

So I used the SAS controller to make 2 of the SATA drives into a RAID1 logical drive. Then made the two remaining SATA drives into a RAID0 array in evms.

It seems slightly better, however only a short way into copying data into the RAID0 share, the system locks up with no error.

I have marked the bug as priority 1 and Critical because it covers, multiple kernels, multiple machines and the company really really needs that storage.

I will start uploading the supporting data shortly.
--John
Comment 1 John Huttley 2006-12-18 13:41:29 UTC
emerge --info

Gentoo Base System version 1.12.6
Portage 2.1.1-r2 (default-linux/x86/2006.1, gcc-4.1.1, glibc-2.4-r4, 2.6.18-gentoo-r4 i686)
=================================================================
System uname: 2.6.18-gentoo-r4 i686 Intel(R) Xeon(R) CPU            5110  @ 1.60GHz
Last Sync: Mon, 18 Dec 2006 20:00:02 +0000
ccache version 2.3 [enabled]
app-admin/eselect-compiler: [Not Present]
dev-java/java-config: [Not Present]
dev-lang/python:     2.4.3-r4
dev-python/pycrypto: 2.0.1-r5
dev-util/ccache:     2.3
dev-util/confcache:  [Not Present]
sys-apps/sandbox:    1.2.17
sys-devel/autoconf:  2.13, 2.60
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2
sys-devel/binutils:  2.16.1-r3
sys-devel/gcc-config: 1.3.14
sys-devel/libtool:   1.5.22
virtual/os-headers:  2.6.17-r2
ACCEPT_KEYWORDS="x86"
AUTOCLEAN="yes"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-march=pentium4 -O2 -pipe"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /etc/postfix /etc/samba /etc/sasl2 /etc/squid"
CONFIG_PROTECT_MASK="/etc/env.d /etc/gconf /etc/revdep-rebuild /etc/terminfo"
CXXFLAGS="-march=pentium4 -O2 -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoconfig ccache distlocks metadata-transfer parallel-fetch sandbox sfperms strict"
GENTOO_MIRRORS="http://distfiles.gentoo.org http://distro.ibiblio.org/pub/linux/distributions/gentoo"
MAKEOPTS="-j3"
PKGDIR="/usr/portage/packages"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude='/distfiles' --exclude='/local' --exclude='/packages'"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
SYNC="rsync://192.168.1.1/gentoo-portage"
USE="x86 acpi apache2 crypt dbus elibc_glibc gif gmp gpm hal input_devices_evdev input_devices_keyboard input_devices_mouse jpeg kernel_linux mmx nptl pam pcre png readline sasl sse ssl tiff truetype usb userland_GNU video_cards_apm video_cards_ark video_cards_ati video_cards_chips video_cards_cirrus video_cards_cyrix video_cards_dummy video_cards_fbdev video_cards_glint video_cards_i128 video_cards_i740 video_cards_i810 video_cards_imstt video_cards_mga video_cards_neomagic video_cards_nsc video_cards_nv video_cards_rendition video_cards_s3 video_cards_s3virge video_cards_savage video_cards_siliconmotion video_cards_sis video_cards_sisusb video_cards_tdfx video_cards_tga video_cards_trident video_cards_tseng video_cards_v4l video_cards_vesa video_cards_vga video_cards_via video_cards_vmware video_cards_voodoo xml xml2 zlib"
Unset:  CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, LINGUAS, PORTAGE_RSYNC_EXTRA_OPTS, PORTDIR_OVERLAY

Comment 2 John Huttley 2006-12-18 13:43:34 UTC
Created attachment 104311 [details]
Kernel 2.6.18-gentoo-r4.config
Comment 3 John Huttley 2006-12-18 13:45:13 UTC
Created attachment 104312 [details]
2.6.18-gentoo-r4 Console Log (smp)
Comment 4 John Huttley 2006-12-18 13:48:20 UTC
Created attachment 104313 [details]
System map for Gentoo

I got a little confused as to which was smp or not. Anyway I decided you would prefer a vanilla kernel if it has to go upstream.
Comment 5 John Huttley 2006-12-18 13:49:48 UTC
Created attachment 104314 [details]
2.6.19.1 Config (UP)
Comment 6 John Huttley 2006-12-18 13:50:27 UTC
Created attachment 104315 [details]
2.6.19.1 (UP) Console log
Comment 7 John Huttley 2006-12-18 13:51:20 UTC
Created attachment 104316 [details]
2.6.19.1 (UP) system.map
Comment 8 John Huttley 2006-12-18 13:51:54 UTC
Created attachment 104317 [details]
2.6.19.1 (SMP) Config
Comment 9 John Huttley 2006-12-18 13:52:35 UTC
Created attachment 104318 [details]
2.6.19.1 (SMP) Console
Comment 10 John Huttley 2006-12-18 13:56:14 UTC
Created attachment 104319 [details]
2.6.19.1 (SMP) system.map
Comment 11 Daniel Drake (RETIRED) gentoo-dev 2006-12-19 12:57:08 UTC
(In reply to comment #0)
> I installed a 805 system (SMP, x86)with 1Gb ram and 4x 325GB WD SATA disks.
> These were setup as a RAID1 MD parition to hold the system and a RAID5 MD for
> data.
> Managed by evms, filesystem XFS. 
> 
> I also tried compiling without SMP.
> 
> This proved very unreliable

In what way?

Please post the crash logs.
Comment 12 John Huttley 2006-12-19 17:52:02 UTC
(In reply to comment #11)

> 
> In what way?
> 
> Please post the crash logs.
> 

I did. Look at all the attachments I uploaded.
--John
Comment 13 Daniel Drake (RETIRED) gentoo-dev 2006-12-19 18:25:20 UTC
The crashes look random. Can you run memtest86 for 24 hours?
Comment 14 Daniel Drake (RETIRED) gentoo-dev 2006-12-19 18:37:09 UTC
This is almost certainly a hardware problem. Just found this gem in the 2.6.18 logs:

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078
 printing eip:
c0160afd
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: pcspkr tg3 nfs lockd sunrpc ata_piix ahci sata_svw libata ohci_hcd
CPU:    9
EIP:    0060:[<c0160afd>]    Not tainted VLI
EFLAGS: 00210086   (2.6.18-gentoo-r4 #1) 
EIP is at kmem_cache_alloc+0x15/0x42
eax: 00000009   ebx: 00200206   ecx: dffd5d40   edx: 00000078
esi: dfdaab00   edi: 00011210   ebp: dfdaab1c   esp: f6814f60
ds: 007b   es: 007b   ss: 0068
Process 
Comment 15 Daniel Drake (RETIRED) gentoo-dev 2006-12-19 18:37:09 UTC
This is almost certainly a hardware problem. Just found this gem in the 2.6.18 logs:

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078
 printing eip:
c0160afd
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: pcspkr tg3 nfs lockd sunrpc ata_piix ahci sata_svw libata ohci_hcd
CPU:    9
EIP:    0060:[<c0160afd>]    Not tainted VLI
EFLAGS: 00210086   (2.6.18-gentoo-r4 #1) 
EIP is at kmem_cache_alloc+0x15/0x42
eax: 00000009   ebx: 00200206   ecx: dffd5d40   edx: 00000078
esi: dfdaab00   edi: 00011210   ebp: dfdaab1c   esp: f6814f60
ds: 007b   es: 007b   ss: 0068
Process ¡ö¡ö (pid: 1166114143, ti=f6814000 task=f680a000 task.ti=f6814000)

Check out the funky process name and PID...
Comment 16 John Huttley 2006-12-19 19:00:05 UTC
(In reply to comment #13)
> The crashes look random. Can you run memtest86 for 24 hours?
> 

Ok, I did run it for a short time as part of my pre-installation tests.
The machine does have ECC ram.

I have phoned through and asked for a long run of memtest86.

Note that I had severe issues with the previous, 805 based machine. Unfortunately I don't have any supporting info on it at the moment.

--John
Comment 17 John Huttley 2006-12-20 10:00:51 UTC
Only a little while after asking for memtest to be run I received a call saying "memtest has gone crazy".

HP service will have a look at it. However the date being what it is, it may be a little while before its fixed.

Therefore I'll close off the bug.

Thankyou all for your help.

Regards,

John
Comment 18 John Huttley 2007-01-30 20:33:58 UTC
I have some more information. I think the issue lies, in part, with the design of the ML 150. (Xeon 5110 CPU, Intel chipset with PCI-X and PCI-E, LSI Logic SAS controller card, 1Gb RAM using the new FB DIMMS).

Memtest. Memtest fails after a few minutes at location 04fe. This has occured on 3 of these machines. A google search brought up this little gem. If a machine does legacy USB mapping and then the BIOS doesn't mark the region as "unavailable", memtest will hit it, show an error AND crash the machine. The bios is at the most current update. The bios has no options for changing USB settings and memtest has no options for ignoring a region of ram.

I have now bought a second ML 150, and its showing exactly the same flaws as the first. In fact it is worse.

I booted from the 2006.1 x86 install DVD and managed to compile a whole system, with the ~x86 accept flag.

I compiled and installed the 2.6.19-gentoo-r4 kernel and it booted ok.
I started samba and it oops'd instantly.
I rebooted the machine and set up a serial console. I started samba and it oops'd instantly. But this time I have a log which I shall post in a moment.

I tried with the memmap=0x10$0x4f0  kernel option, without noticable effect.

I rebooted from the DVD with kernel 2.6.17-gentoo-r7, chrooted and started samba without problems. So I started xfering data. After about 15 mins it stopped. No oops, no disk IO, no response to keyboard or Alt-Sys.

I shall attempt to install earlier vanilla kernels to locate this smaba instant death.

In other news.... neither the first ML 150 or this one will boot the x64 install DVD. It crashes part way through. I shall try and get a log of that as well.

The ML 150 is a superb machine, and the components are all kosher but there is something unusual about it. It comes preinstalled with SBS 2003. That works fine, whcih is annoying.
Comment 19 John Huttley 2007-01-30 20:47:30 UTC
Created attachment 108685 [details]
Boot and Oops when starting samba
Comment 20 John Huttley 2007-01-31 01:31:06 UTC
Created attachment 108702 [details]
crash log on starting the 2006.1 X64 Live DVD
Comment 21 John Huttley 2007-02-01 00:58:43 UTC
Created attachment 108808 [details]
Crash on 2.6.20-rc7

It booted and ran for an hour with a lot of data being copied in via samba.
I was doing several compiles when it died.
Comment 22 John Huttley 2007-02-01 01:12:03 UTC
Created attachment 108810 [details]
crash on 2.6.15

I thought I'd try an old kernel. This mostly booted but the system hung when starting net.eth0. It oops when I pressed ^C
Comment 23 John Huttley 2007-02-12 01:51:05 UTC
64B issue partly solved.
The system has a bios setting that controls 8042 (keyb) emulation.
This must be OFF for a 64b kernel to boot.
I don't know if a 64B system would be anymore reliable than 32b.

Note on memtest.
If the emulation is off, memtest86 won't start.
If its on, it will error at 0x4fe due to (incorrectly) unreserved memory.
This is known techically as a "lose lose" situation. However I have a contact in HP that may be able to help get a BIOS update done.
Comment 24 John Huttley 2007-02-19 01:05:06 UTC
I now have a solution.
32bit kernels will not run for long, very likely due to the 8042 emulation.
64bit kernel (2.6.20-gentoo) works fine if the 8042 emulation is disabled in the bios.
I've been running this for a couple of days and copying a substantial amount of data onto it. There have been no problems.

Other things to note. The motherboard has 6 AHCI  SATA ports which can run disks or a SATA DVD drive. However only ports 1 to 4 are live. They might be live on other models that don't have the SAS-SATA Hot plug bay.