352372 – Kernel 2.6.37 with MCE enabled writes unwanted messages to all terminals

Bug 352372 - Kernel 2.6.37 with MCE enabled writes unwanted messages to all terminals

Summary: Kernel 2.6.37 with MCE enabled writes unwanted messages to all terminals

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	AMD64 Linux

Importance:	High normal
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-01-21 20:37 UTC by Philip Webb
Modified:	2011-07-27 11:58 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
output of 'emerge --info' (emerge.d1,2.96 KB, text/plain) 2011-01-25 05:57 UTC, Philip Webb	Details
kernel .config as requested (.config,46.31 KB, text/plain) 2011-01-25 05:58 UTC, Philip Webb	Details
diff of .config for 2.6.33 & 2.6.37 (diff.d1,16.19 KB, text/plain) 2011-01-25 17:26 UTC, Philip Webb	Details
screenshot of unwanted messages (screenshot1.png,35.53 KB, image/png) 2011-04-01 01:27 UTC, Philip Webb	Details
screenshot of unwanted messages (screenshot.png,112.75 KB, image/png) 2011-04-01 01:28 UTC, Philip Webb	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Philip Webb 2011-01-21 20:37:42 UTC

I recently upgraded my kernel from 2.6.33 to 2.6.37 & began to receive multiple messages in every terminal (Konsole & XFCE's Terminal) saying

kernel: [Hardware Error]: No human readable MCE decoding support on this CPU type.
kernel: [Hardware Error]: Run the message through 'mcelog --ascii' to decode.

these messages also occur in the Syslog file with the added line

kernel: [Hardware Error]: Machine check events logged

This 3rd line used to occur in Syslog also when using 2.6.33, but not the others & none of the lines was written to any terminal I was running.

The messages can be eradicated by booting with 'append="nomce"', but that appears to stop the whole MCE process, not just the messages.

Looking through the kernel configuration with 'make menuconfig' suggests that the problem is that CONFIG_EDAC_DECODE_MCE is what is needed, but as the help states "Decode MCEs in human-readable form (only on AMD for now)". My machine has an Intel Core2 Duo processor.

Reproducible: Always

Steps to Reproduce:
1. Install kernel 2.6.37 configured with MCE enabled with an Intel processor
2. Wait a few minutes & check any virtual terminals which are running.
3.

Actual Results:
The 2 messages above will appear several times in the terminal.

Expected Results:
No such messages should occur in any terminal.

I don't know whether this is a bug in the Linux kernel, whose code for MCE was updated in the 2.6.36 release, which I haven't tested, or whether the unwanted messages can be suppressed by some step available elsewhere in Gentoo. It should not be necessary for users to suppress the whole MCE process, which may be necessary on some hardware to prevent overheating or data corruption, simply in order to avoid the nuisance messages in terminals.

If it is a kernel bug, I'm not sure is an ordinary user can report it to the kernel developers & in any case it's more likely to be taken seriously by them if it's reported by the devs of a well-established distro like Gentoo.

Comment 1 Jeroen Roovers (RETIRED) gentoo-dev

2011-01-25 00:40:59 UTC

Please attach your kernel .config, and paste your `emerge --info' output in a comment.

Comment 2 Philip Webb 2011-01-25 05:57:46 UTC

Created attachment 260639 [details]
output of 'emerge --info'

Output of 'emerge --info' as requested

Comment 3 Philip Webb 2011-01-25 05:58:57 UTC

Created attachment 260640 [details]
kernel .config as requested

kernel .config for 2.6.37

Comment 4 Jeroen Roovers (RETIRED) gentoo-dev

2011-01-25 16:27:40 UTC

Did you previously have CONFIG_EARLY_PRINTK=y too? I am asking because I doubt that enabling MCE would /cause/ messages to flood your consoles. Maybe a diff between your old and new .config will tell what else got enabled by default.

Comment 5 Philip Webb 2011-01-25 17:26:23 UTC

Created attachment 260696 [details]
diff of .config for 2.6.33 & 2.6.37

This is the diff requested.  CONFIG_EARLY_PRINTK=y in both versions.

Comment 6 Wormo (RETIRED) gentoo-dev

2011-01-26 06:31:51 UTC

Searching through your logs, do you see any machine check messages along these lines:

CPU 0: Machine Check Exception:   0 Bank 0: b200004000000800
TSC 0
PROCESSOR 0:6fb TIME 1288829692 SOCKET 0 APIC 0

If so, the go ahead and try feeding it to 'mcelog --ascii' for decoding. If there are no such messages, then this bug goes to the kernel team, who might ask for more info and can give advice on reporting it upstream.

Comment 7 Philip Webb 2011-01-27 04:25:18 UTC

The contents of my  /var/log/  are

  auth.log    daemon.log  emerge-fetch.log  faillog    lastlog   mail.err
  mail.warn  ntpd.log  syslog       syslog.2.gz  syslog.5.gz  uucp.log 
  Xorg.0.log.old  ConsoleKit  debug       emerge.log        imapd.log  
  lpr.log         mail.info  messages   ppp.log   syslog.0     syslog.3.gz  
  syslog.6.gz  wtmp  cups        dmesg       emerge-logs       kern.log
  lvm2-setup.log  mail.log   news       sandbox
  syslog.1.gz  syslog.4.gz  user.log     Xorg.0.log

I looked at  kern.log  messages  syslog*  using '(z)less'
& none of them contains the string 'Machine Check Exception'.
Would you like me to do any other searches ?

Something which changed in 2.6.36 (others have reported the problem there) caused the messages to be broadcast to all running terminals.  I do not have the expertise to know whether that could be directly due to the kernel or whether it would involve several steps, some of which might be outside the kernel & therefore correctable by some change eg in a system file.

Thanks for your prompt responses.

Comment 8 Wormo (RETIRED) gentoo-dev

2011-01-29 06:23:11 UTC

It looks like nowadays the actual MCE messages go to a separate device, so they wouldn't necessarily end up in syslog:

"When you see the "Machine check errors logged" message in the system
log then mcelog should run to collect and decode machine check entries
from /dev/mcelog. Normally mcelog should be run regularly from a cronjob."

linux/Documentation/x86/x86_64

So I'm still concerned that your hardware is really generating machine check errors rather than being a false alarm. Please install app-admin/mcelog and see if you get MCE details in /var/log/mcelog. If the errors are legitimate, then the fact that they are getting prominently displayed to terminals seems like a feature rather than a bug...

Comment 9 Philip Webb 2011-01-31 12:23:13 UTC

(In reply to comment #8)
> It looks like nowadays the actual MCE messages go to a separate device,
> so they wouldn't necessarily end up in syslog:
> "When you see the "Machine check errors logged" message in the system log
> then mcelog should run to collect and decode machine check entries
> from /dev/mcelog. Normally mcelog should be run regularly from a cronjob."
> (linux/Documentation/x86/x86_64)
> I'm still concerned your hardware is really generating machine check errors
> rather than being a false alarm.

Yes, there's no doubt it's generating real error messages,
the issue is where they should be delivered (smile).

> Please install app-admin/mcelog
> and see if you get MCE details in /var/log/mcelog.

I installed it, rebooted without the 'nomce' option
& waited for the messages to appear in my user terminals, which they have,
but the file/dir 'mcelog' hasn't been created.

I ran 'mcelog', which gives the following output :

root:506 log> mcelog
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 0 BANK 3 
TIME 1296475963 Mon Jan 31 07:12:43 2011
MCG status:
MCi status:
Error enabled
Threshold based error status: green
MCA: corrected filtering (some unreported errors in same region)
Level-2 Generic memory hierarchy error
STATUS 902000420320100e MCGSTATUS 0
MCGCAP 806 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 15
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 1
CPU 1 BANK 3 
ADDR 1023480 
TIME 1296475963 Mon Jan 31 07:12:43 2011
MCG status:
MCi status:
Error enabled
MCi_ADDR register valid
Threshold based error status: green
MCA: Generic CACHE Level-2 Generic Error
STATUS 942000c20501010a MCGSTATUS 0
MCGCAP 806 APICID 1 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 15

I have not encountered any problems in everyday use of my machine.

> If the errors are legitimate,
> the fact they are getting prominently displayed to terminals
> seems like a feature rather than a bug...

No (another smile)! -- such messages should never be displayed
in a user's terminal, as they are appearing here.
Nor should they really be displayed in root's terminal
unless s/he has made some move to have them displayed there.
They should be logged, eg in  /var/log/mcelog .

Please remember that this phenomenon began only with kernel 2.6.37 for me,
though others have reported it starting with 2.6.36.
It was a revision of the MCE code in the kernel which caused it,
not some new defect in my hardware.

Comment 10 Mike Pagano gentoo-dev

2011-03-17 10:58:13 UTC

Anything different with 2.6.38 kernels?

Comment 11 Philip Webb 2011-04-01 01:25:47 UTC

Sorry for the delay: an infected tooth needed urgent repair.

There is no change with kernel 2.6.38 : the 'nomce' flag is needed
or messages are spread on all running virtual terminals: see screenshot.

Comment 12 Philip Webb 2011-04-01 01:27:45 UTC

Created attachment 268079 [details]
screenshot of unwanted messages

screenshot of effect on Mutt

Comment 13 Philip Webb 2011-04-01 01:28:43 UTC

Created attachment 268081 [details]
screenshot of unwanted messages

screenshot of effect on terminal

Comment 14 Mike Pagano gentoo-dev

2011-04-01 12:37:15 UTC

Please take this upstream at http://bugzilla.kernel.org and post the url back here

Comment 15 Philip Webb 2011-04-11 07:55:55 UTC

This has already been reported as Kernel bug 30662 .
I have added my own experience as a comment there.

Comment 16 Mike Pagano gentoo-dev

2011-04-13 00:12:52 UTC

We'll follow the upstream bug and backport fixes as identified.

Comment 17 Stratos Psomadakis (RETIRED) gentoo-dev

2011-07-27 11:58:27 UTC

Upstream closed the bug as fixed in 3.0. Closing.