I have dual opteron storage server. I just upgraded it to use the gentoo-2.6.11-r7 the system doesnt not boot and always freezes with machine check exception. The error displayed on the console is the following. CPU 0: Machine Check Exception 7 Bank 4: b400000000a13 RIP 10:<ffffffff8010c444> {default_idle+0x24/0x30} TSC fc233b962 ADDR 3d520070 Kernel panic - not syncing : Uncorrected machine check I tried booting using nomce feature but system keeps rebooting. Currently its running the Reproducible: Always Steps to Reproduce: 1. Reboot machine 2. 3. Actual Results: System freezes with machine check exception Expected Results: System should boot normally Portage 2.0.51.19 (default-linux/amd64/2004.3, gcc-3.3.4, glibc-2.3.4.20041102-r1, 2.6.5-gentoo-r1 x86_64) ================================================================= System uname: 2.6.5-gentoo-r1 x86_64 AMD Opteron(tm) Processor 244 Gentoo Base System version 1.4.16 Python: dev-lang/python-2.3.5 [2.3.5 (#1, May 12 2005, 20:35:16)] distcc 2.18.3 x86_64-pc-linux-gnu (protocols 1 and 2) (default port 3632) [disabled] ccache version 2.3 [enabled] dev-lang/python: 2.3.5 sys-apps/sandbox: [Not Present] sys-devel/autoconf: 2.13, 2.59-r6 sys-devel/automake: 1.9.5, 1.5, 1.8.5-r3, 1.6.3, 1.7.9-r1, 1.4_p6 sys-devel/binutils: 2.15.92.0.2-r8 sys-devel/libtool: 1.5.16 virtual/os-headers: 2.6.8.1-r4 ACCEPT_KEYWORDS="amd64" AUTOCLEAN="yes" CFLAGS="-pipe -O2" CHOST="x86_64-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3/share/config /usr/lib/X11/xkb /usr/share/config /var/qmail/control" CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d" CXXFLAGS="-pipe -O2" DISTDIR="/usr/portage/distfiles" FEATURES="autoaddcvs autoconfig ccache distlocks sandbox sfperms strict" GENTOO_MIRRORS="http://gentoo.mirrors.pair.com/ http://mirror.datapipe.net/gentoo ftp://mirrors.tds.net/gentoo" MAKEOPTS="-j4" PKGDIR="/usr/portage/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" SYNC="rsync://rsync.gentoo.org/gentoo-portage" USE="amd64 acpi alsa berkdb bitmap-fonts cdr crypt cups font-server fortran gdbm gif gpm imlib ipv6 jp2 jpeg ldap lzw lzw-tiff motif mp3 ncurses nls opengl oss pam perl png python readline slang ssl tcpd tiff truetype truetype-fonts type1-fonts usb userlocales xml2 xpm xrandr xv zlib userland_GNU kernel_linux elibc_glibc" Unset: ASFLAGS, CBUILD, CTARGET, LANG, LC_ALL, LDFLAGS, LINGUAS, PORTDIR_OVERLAY
Please test development-sources-2.6.12_rc5
See comment #1
I installed 2.6.12-rc6 instead of rc5. I still see the machine check exception the TSC and ADDR values differ but the bank and RIP values are the same. I downloaded the machine check exception parser from http://www.codemonkey.org.uk/parsemce/ but I couldnt get it to parse the input. Not sure how to use it. Any help appreciated
So it is not possible to boot the machine to a console at all, even with the nomce option?
When I use the nomce option and try booting using 2.6.11-r7 or the 2.6.12-rc6 then the system goes in a loop of rebooting. It doesnt reboot after a fixed time so which makes it difficult for me. If I dont use the nomce option it keeps freezing with the machine check exception with a different TSC and ADDR. This freeze also doesnt occur after a fixed time. I dont see the problem with the 2.6.5-r1 kernel which seems just work fine as its been working for over the last one year. I am not sure how to use parsemce tool found at link I posted in comment #3.
This doesn't really answer my question, since you haven't said at which point the reboot/freeze actually occurs.
The reboot occurs at random places. Sometimes while loading modules, sometimes while mounting the filesystem. Sometimes anywhere in between when the services are being started. There is no fixed place where it freezes/reboots.
OK, if you boot with init=/bin/bash as a kernel parameter (without nomce), do you get some time on the console before it reboots? I think we should be able to parse the MCEs that way.
The machine still reboots but I managed to get a better example to run with parsemce and the output of parsemce is as follows jeeves ~ # ./parsemce -b 4 -s b442200000000a13 -e 0000000000000007 -a 0 Status: (7) Machine Check in progress. Error IP valid Restart IP valid. parsebank(4): b442200000000a13 @ 0 External tag parity error Uncorrectable ECC error Address in addr register valid Error enabled in control register Error not corrected. Bus and interconnect error Participation: Local processor responded to request Timeout: Request did not timeout Request: Generic error Transaction type : Instruction Memory/IO : Other
Not sure what to make of this. Could you please report it at http://bugzilla.kernel.org as it seems to be an upstream bug. Please post the new bug URL here.
Filed bug upstream. http://bugme.osdl.org/show_bug.cgi?id=4861
If you have time, please test with 2.6.13_rc6 (soon to be released) and update the usptream bug as appropriate. If it is still an issue I will attempt to get the bug listed on Andrew Morton's to-be-fixed list :)
Fixed in upstream patch, will include in next gentoo-sources release.
Fixed in gentoo-sources-2.6.13