Newer X86_64 2.6 kernels require to regularly run mcelog as provided by the app-admin/mcelog package. While this works after emerging said package, it should be an unconditional part of the distribution, and therefore the core system should depend on it. Detail can be found in Andi Kleen's (X86_64 kernel maintainer) lkml post: http://marc.theaimsgroup.com/?l=linux-kernel&m=113121225914384&w=2 Reproducible: Always Steps to Reproduce: 1. 2. 3.
Well I don't have it running on mine so clearly it is not required. However, the official x86-64.org site does suggest running it from a cronjob on a regular basis. And they aren't very ambigous about it either. ftp://ftp.x86-64.org/pub/linux/tools/mcelog/README
At the moment mcelog is marked ~amd64, so we would have to stabilize this before putting it in the system profile. What would we do about the case where someone is running a cron-less system? This seems to compile and run error-free on all my amd64 systems, though it hasn't produced any output on any of them.
beside the dependency problem (package in system depends on virtual/cron, so cron has to be in system which is a non-trivial change) i lack to see why this package has to be in system at all. worst thing that can happen to you is that you never read all the messages. i've been running my system for about 2 years without mcelog, and didn't notice any problems, so i don't think it belongs to system.. do you have another source that clearly states not running mcelog will break your system or cause any problems?
Well, the problem is not so much that the system won't work or even fail without this running, but that you will fail to notice that your system is possibly building up hardware issues. x86_64 kernels are special in this way, as they do not log such events to the usual syslog, but use the mcelog facility. Here's some more detailed package description from the author (SuSE package description): "Linux x86-64 kernels since 2.6.4 don't print recoverable machine check errors to the kernel log anymore. Instead they are saved into a special kernel buffer accessible using /dev/mcelog. mcelog reads /dev/mcelog and prints the stored machine check records to stdout. Then the stored machine check records in the kernel buffer are deleted." Of course this is a border case, since it only becomes important when the hardware becomes unstable, but for the reasonable admin the should at least be a hint in the documentation. If integrating this requires the core to depend on virtual/cron, it's probably not worth the fuzz.
problem still comes down to the USER has to add support in the kernel for this to even work. I do not see a reason to install a package that is not even gonna work unless kernel support is there to begin with.
Andreas: Good idea, we should mention it in our docs.
This seems like the best solution to me too - add a section to our docs as we cannot force kernel configurations on users as other distros do.
just FYI: i stablized 0.4-r1 a few days ago
Okay, so we need to have something like: ... The x86_64 kernel maintainer strongly recommends users enable MCE features so that they are able to be notified of any hardware problems. This requires the app-admin/mcelog package. (Choose appropriate) Processor type and features ---> [*] Intel MCE features [*] AMD MCE features ... In the 'Configuring the Kernel' section. Have I missed anything?
> The x86_64 kernel maintainer strongly recommends users enable MCE features so > that they are able to be notified of any hardware problems. This requires the > app-admin/mcelog package. This sounds like 'kernel maintainers are watching you' to me :D You probably meant '... that *you* are able to ...', right? You should probably explain that AMD64 users need mcelog because those messages aren't printed to dmesg but to /dev/mcelog. I think this is important to know because if you think your hardware is buggy, first place you look into is probably the output of dmesg and /var/log/messages
(In reply to comment #10) > > The x86_64 kernel maintainer strongly recommends users enable MCE features so > > that they are able to be notified of any hardware problems. This requires the > > app-admin/mcelog package. > > This sounds like 'kernel maintainers are watching you' to me :D You probably > meant '... that *you* are able to ...', right? I sort of thought about this, I wasn't sure. I think you're probably right. > You should probably explain that > AMD64 users need mcelog because those messages aren't printed to dmesg but to > /dev/mcelog. I think this is important to know because if you think your > hardware is buggy, first place you look into is probably the output of dmesg > and /var/log/messages Yeah, very true.
Will include this in the latest handbook
Fixed for 2006.0. I'm guessing we need to backport this as well to the other handbooks?
not sure if it's worth the hassle as it's really not a critical issue, but it would certainly be nice to have :)
Done :)