This covers two systems, at least 3 kernels, so I'll start by giving some background. A company is using linux as a file server. They do video production so they need lots of space. 500Gb in one share. I installed a 805 system (SMP, x86)with 1Gb ram and 4x 325GB WD SATA disks. These were setup as a RAID1 MD parition to hold the system and a RAID5 MD for data. Managed by evms, filesystem XFS. I also tried compiling without SMP. This proved very unreliable so I replaced the PSU, added another fan, put on a UPS. It still kept crashing so I replaced the MB and the RAM. Still crashed. We had had enough of this so they bought a HP ML-150 (G3) with SAS storage (LSI LOGIC). This has a 6 bay hot plug tray that can take SAS and SATA drives. The controller can create raid 0 arrays in hardware. It has 2X 36Gb, 15K SAS drives as a RAID1 array, and 4 SATA drives. These are all HP OEM drives. The system is on the SAS drives. The first cut has the 2 SAS drives as a RAID1 logical drive with the system on it (evms, XFS). 4 SATA drives as RAID5, evms, XFS. As soon as any data was on the array it started to crash. So I tried a vanilla 2.6.19.1 kernel in UP and SMP modes and it still crashes. (I have the system maps, config and console logs for these three kernels). I got so bad it would crash as soon as the array was accessed. I suspected something nasty in the RAID5 implementation. So I used the SAS controller to make 2 of the SATA drives into a RAID1 logical drive. Then made the two remaining SATA drives into a RAID0 array in evms. It seems slightly better, however only a short way into copying data into the RAID0 share, the system locks up with no error. I have marked the bug as priority 1 and Critical because it covers, multiple kernels, multiple machines and the company really really needs that storage. I will start uploading the supporting data shortly. --John
emerge --info Gentoo Base System version 1.12.6 Portage 2.1.1-r2 (default-linux/x86/2006.1, gcc-4.1.1, glibc-2.4-r4, 2.6.18-gentoo-r4 i686) ================================================================= System uname: 2.6.18-gentoo-r4 i686 Intel(R) Xeon(R) CPU 5110 @ 1.60GHz Last Sync: Mon, 18 Dec 2006 20:00:02 +0000 ccache version 2.3 [enabled] app-admin/eselect-compiler: [Not Present] dev-java/java-config: [Not Present] dev-lang/python: 2.4.3-r4 dev-python/pycrypto: 2.0.1-r5 dev-util/ccache: 2.3 dev-util/confcache: [Not Present] sys-apps/sandbox: 1.2.17 sys-devel/autoconf: 2.13, 2.60 sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2 sys-devel/binutils: 2.16.1-r3 sys-devel/gcc-config: 1.3.14 sys-devel/libtool: 1.5.22 virtual/os-headers: 2.6.17-r2 ACCEPT_KEYWORDS="x86" AUTOCLEAN="yes" CBUILD="i686-pc-linux-gnu" CFLAGS="-march=pentium4 -O2 -pipe" CHOST="i686-pc-linux-gnu" CONFIG_PROTECT="/etc /etc/postfix /etc/samba /etc/sasl2 /etc/squid" CONFIG_PROTECT_MASK="/etc/env.d /etc/gconf /etc/revdep-rebuild /etc/terminfo" CXXFLAGS="-march=pentium4 -O2 -pipe" DISTDIR="/usr/portage/distfiles" FEATURES="autoconfig ccache distlocks metadata-transfer parallel-fetch sandbox sfperms strict" GENTOO_MIRRORS="http://distfiles.gentoo.org http://distro.ibiblio.org/pub/linux/distributions/gentoo" MAKEOPTS="-j3" PKGDIR="/usr/portage/packages" PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude='/distfiles' --exclude='/local' --exclude='/packages'" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" SYNC="rsync://192.168.1.1/gentoo-portage" USE="x86 acpi apache2 crypt dbus elibc_glibc gif gmp gpm hal input_devices_evdev input_devices_keyboard input_devices_mouse jpeg kernel_linux mmx nptl pam pcre png readline sasl sse ssl tiff truetype usb userland_GNU video_cards_apm video_cards_ark video_cards_ati video_cards_chips video_cards_cirrus video_cards_cyrix video_cards_dummy video_cards_fbdev video_cards_glint video_cards_i128 video_cards_i740 video_cards_i810 video_cards_imstt video_cards_mga video_cards_neomagic video_cards_nsc video_cards_nv video_cards_rendition video_cards_s3 video_cards_s3virge video_cards_savage video_cards_siliconmotion video_cards_sis video_cards_sisusb video_cards_tdfx video_cards_tga video_cards_trident video_cards_tseng video_cards_v4l video_cards_vesa video_cards_vga video_cards_via video_cards_vmware video_cards_voodoo xml xml2 zlib" Unset: CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, LINGUAS, PORTAGE_RSYNC_EXTRA_OPTS, PORTDIR_OVERLAY
Created attachment 104311 [details] Kernel 2.6.18-gentoo-r4.config
Created attachment 104312 [details] 2.6.18-gentoo-r4 Console Log (smp)
Created attachment 104313 [details] System map for Gentoo I got a little confused as to which was smp or not. Anyway I decided you would prefer a vanilla kernel if it has to go upstream.
Created attachment 104314 [details] 2.6.19.1 Config (UP)
Created attachment 104315 [details] 2.6.19.1 (UP) Console log
Created attachment 104316 [details] 2.6.19.1 (UP) system.map
Created attachment 104317 [details] 2.6.19.1 (SMP) Config
Created attachment 104318 [details] 2.6.19.1 (SMP) Console
Created attachment 104319 [details] 2.6.19.1 (SMP) system.map
(In reply to comment #0) > I installed a 805 system (SMP, x86)with 1Gb ram and 4x 325GB WD SATA disks. > These were setup as a RAID1 MD parition to hold the system and a RAID5 MD for > data. > Managed by evms, filesystem XFS. > > I also tried compiling without SMP. > > This proved very unreliable In what way? Please post the crash logs.
(In reply to comment #11) > > In what way? > > Please post the crash logs. > I did. Look at all the attachments I uploaded. --John
The crashes look random. Can you run memtest86 for 24 hours?
This is almost certainly a hardware problem. Just found this gem in the 2.6.18 logs: BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078 printing eip: c0160afd *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: pcspkr tg3 nfs lockd sunrpc ata_piix ahci sata_svw libata ohci_hcd CPU: 9 EIP: 0060:[<c0160afd>] Not tainted VLI EFLAGS: 00210086 (2.6.18-gentoo-r4 #1) EIP is at kmem_cache_alloc+0x15/0x42 eax: 00000009 ebx: 00200206 ecx: dffd5d40 edx: 00000078 esi: dfdaab00 edi: 00011210 ebp: dfdaab1c esp: f6814f60 ds: 007b es: 007b ss: 0068 Process
This is almost certainly a hardware problem. Just found this gem in the 2.6.18 logs: BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078 printing eip: c0160afd *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: pcspkr tg3 nfs lockd sunrpc ata_piix ahci sata_svw libata ohci_hcd CPU: 9 EIP: 0060:[<c0160afd>] Not tainted VLI EFLAGS: 00210086 (2.6.18-gentoo-r4 #1) EIP is at kmem_cache_alloc+0x15/0x42 eax: 00000009 ebx: 00200206 ecx: dffd5d40 edx: 00000078 esi: dfdaab00 edi: 00011210 ebp: dfdaab1c esp: f6814f60 ds: 007b es: 007b ss: 0068 Process ¡ö¡ö (pid: 1166114143, ti=f6814000 task=f680a000 task.ti=f6814000) Check out the funky process name and PID...
(In reply to comment #13) > The crashes look random. Can you run memtest86 for 24 hours? > Ok, I did run it for a short time as part of my pre-installation tests. The machine does have ECC ram. I have phoned through and asked for a long run of memtest86. Note that I had severe issues with the previous, 805 based machine. Unfortunately I don't have any supporting info on it at the moment. --John
Only a little while after asking for memtest to be run I received a call saying "memtest has gone crazy". HP service will have a look at it. However the date being what it is, it may be a little while before its fixed. Therefore I'll close off the bug. Thankyou all for your help. Regards, John
I have some more information. I think the issue lies, in part, with the design of the ML 150. (Xeon 5110 CPU, Intel chipset with PCI-X and PCI-E, LSI Logic SAS controller card, 1Gb RAM using the new FB DIMMS). Memtest. Memtest fails after a few minutes at location 04fe. This has occured on 3 of these machines. A google search brought up this little gem. If a machine does legacy USB mapping and then the BIOS doesn't mark the region as "unavailable", memtest will hit it, show an error AND crash the machine. The bios is at the most current update. The bios has no options for changing USB settings and memtest has no options for ignoring a region of ram. I have now bought a second ML 150, and its showing exactly the same flaws as the first. In fact it is worse. I booted from the 2006.1 x86 install DVD and managed to compile a whole system, with the ~x86 accept flag. I compiled and installed the 2.6.19-gentoo-r4 kernel and it booted ok. I started samba and it oops'd instantly. I rebooted the machine and set up a serial console. I started samba and it oops'd instantly. But this time I have a log which I shall post in a moment. I tried with the memmap=0x10$0x4f0 kernel option, without noticable effect. I rebooted from the DVD with kernel 2.6.17-gentoo-r7, chrooted and started samba without problems. So I started xfering data. After about 15 mins it stopped. No oops, no disk IO, no response to keyboard or Alt-Sys. I shall attempt to install earlier vanilla kernels to locate this smaba instant death. In other news.... neither the first ML 150 or this one will boot the x64 install DVD. It crashes part way through. I shall try and get a log of that as well. The ML 150 is a superb machine, and the components are all kosher but there is something unusual about it. It comes preinstalled with SBS 2003. That works fine, whcih is annoying.
Created attachment 108685 [details] Boot and Oops when starting samba
Created attachment 108702 [details] crash log on starting the 2006.1 X64 Live DVD
Created attachment 108808 [details] Crash on 2.6.20-rc7 It booted and ran for an hour with a lot of data being copied in via samba. I was doing several compiles when it died.
Created attachment 108810 [details] crash on 2.6.15 I thought I'd try an old kernel. This mostly booted but the system hung when starting net.eth0. It oops when I pressed ^C
64B issue partly solved. The system has a bios setting that controls 8042 (keyb) emulation. This must be OFF for a 64b kernel to boot. I don't know if a 64B system would be anymore reliable than 32b. Note on memtest. If the emulation is off, memtest86 won't start. If its on, it will error at 0x4fe due to (incorrectly) unreserved memory. This is known techically as a "lose lose" situation. However I have a contact in HP that may be able to help get a BIOS update done.
I now have a solution. 32bit kernels will not run for long, very likely due to the 8042 emulation. 64bit kernel (2.6.20-gentoo) works fine if the 8042 emulation is disabled in the bios. I've been running this for a couple of days and copying a substantial amount of data onto it. There have been no problems. Other things to note. The motherboard has 6 AHCI SATA ports which can run disks or a SATA DVD drive. However only ports 1 to 4 are live. They might be live on other models that don't have the SAS-SATA Hot plug bay.