Problems in 2.6.24/2.6.25 oops-messages: First: Apr 27 14:35:32 myth BUG: unable to handle kernel Apr 27 14:35:32 myth NULL pointer dereference Apr 27 14:35:32 myth at 0000000000000010 Apr 27 14:35:32 myth IP: Apr 27 14:35:32 myth [<ffffffff803874eb>] rb_insert_color+0x2b/0xf0 Apr 27 14:35:32 myth PGD 61c68067 Apr 27 14:35:32 myth PUD 6e057067 Apr 27 14:35:32 myth PMD 0 Apr 27 14:35:32 myth Apr 27 14:35:32 myth Oops: 0000 [1] Second: Apr 27 14:53:03 myth Eeek! page_mapcount(page) went negative! (-1) Apr 27 14:53:03 myth page pfn = 52d70 Apr 27 14:53:03 myth page->flags = 5200000000001c Apr 27 14:53:03 myth page->count = 0 Apr 27 14:53:03 myth page->mapping = 0000000000000000 Apr 27 14:53:03 myth vma->vm_ops = 0x0 Apr 27 14:53:03 myth ------------[ cut here ]------------ Apr 27 14:53:03 myth kernel BUG at mm/rmap.c:669! Apr 27 14:53:03 myth invalid opcode: 0000 [1] Due to some limitations i have only been able to test this on 2.6.24 and 2.6.25 but both seem to experience the same issues.. Booting system via the network with AoE, and that has worked perfectly. The issue was found after i switched to a AM2+ cpu. The issue seems to arise whenever i go to or beyond 2.4Ghz and here is the strange thing. It works perfectly at anything lower without any hangs or crashes but as soon as i go beyond 2.4 it panics within a second or 2. Tests (with max_freq set to 2.2Ghz) - 1 hour burn-test with 2 burkK7 processes (no issues) - Successful pass of memtest86+ (with freq at 2.5Ghz.) Things that has been checked: - Temp, at max load it reaches around 45C before the system panics, idle temp is around 35-40 depending on the ambient temp. System temp is around 30-35C$ - Swiched cpu-cooler just to get it a bit lower. Before it was around 50-55 at load and 40-45 at idle. One thing i have found that sounds a bit strange is that the vid-value reported from lm-sensors is always on 1.550V, and that sounds a bit high for the 65W cpu, at least when running at the lower speeds. But since that is also reported without cool'n'quiet enabled it sounds a bit more like a lm-sensors fluke. http://en.wikipedia.org/wiki/Athlon_64_X2#Brisbane_.2865_nm_SOI.29 specifies 1.25-1.35V for the cpu but this might just be a fluke in lm-sensors or it might be that the kernel does not correctly read the what voltages it should set on the cpu, but the values it sets does not look totally screwed up: powernow-k8: Found 1 AMD Athlon(tm) 64 X2 Dual Core Processor 4800+ processors (2 cpu cores) (version 2.20.00) powernow-k8: 0 : fid 0x11 (2500 MHz), vid 0x8 powernow-k8: 1 : fid 0x10 (2400 MHz), vid 0x9 powernow-k8: 2 : fid 0xe (2200 MHz), vid 0xb powernow-k8: 3 : fid 0xc (2000 MHz), vid 0xd powernow-k8: 4 : fid 0xa (1800 MHz), vid 0xf powernow-k8: 5 : fid 0x2 (1000 MHz), vid 0x12 Also if booting without cool'n'quiet enabled it reports the VCore at 1.35V while if manually setting the frequency to 2.5Ghz when cool'n'quiet is enabled it only reported 1.28V. This also feels like the primary suspect since if booting without cool'n'quiet a load test does not cause a panic. Another suspect in this could be the atl1 driver for the network card, and since i'm booting via that i might get corrupt data that is causing all of this. But that does not sound too probable since it crashes already running processes and i'm not using any swap so that theory also falls and why would it only accur at 2.4Ghz with cool'n'quiet enabled. So to summarize, - While on cool'n'quiet the system panics whenever it goes beyond 2.4Ghz and the system is loaded. - Without cool'n'quiet the system works like a charm even under massive stress-tests. - Hardware works as expected for a full memtest86+. - System temps are well below the thermal limits. - Confused person as a result. =) Reproducible: Always Steps to Reproduce: 1.boot with cool'n'quiet enabled. 2.run 2 instances of burnk7 and a panic will occur as soon as it reaches 2.4Ghz System-setup. Asus M3N - AMD 770 / SB600 chipset AMD64 X2 4800+ 65W AM2+ ( http://en.wikipedia.org/wiki/Athlon_64_X2#Brisbane_.2865_nm_SOI.29 ) 2Gig of ram.
Are the panics consistent at all? This sounds like broken hardware causing random memory corruption, in which case there isn't much we can do, sorry. You could try checking for BIOS upgrades, cool'n'quiet requires BIOS support and bugs there could be causing trouble. If you want to continue investigation here please post your kernel config and full dmesg output from shortly after boot. Which motherboard you have might be useful to know, too.
Created attachment 153127 [details] dmesg with cool'n'quiet enabled
Created attachment 153129 [details] dmesg with cool'n'quiet disabled
(In reply to comment #1) > Are the panics consistent at all? Yes, when cool'n'quiet is enabled it panics ONLY when it goes beyond 2.2Ghz, or manually setting to 2.4 or 2.5Ghz, and it does that within 1-2 seconds during 100% load. If running at 100% on both cores at or below 2.2Ghz (limiting via scaling_max_freq) it works perfectly. > This sounds like broken hardware causing random memory corruption, in which case there isn't much we can do, sorry. I disagree there to a degree. If it would be broken hardware it should also display the same issues while running at 2.5Ghz without cool'n'quiet and/or display some strange behaviour when running with cool'n'quiet enabled but below 2.2Ghz. Shure it could be some strange obsure issue with the multiplier setting, but then it should probably behave this way when setting those values in the bios with CNQ disabled (have tried all problematic multipliersettings via the bios). > You > could try checking for BIOS upgrades, cool'n'quiet requires BIOS support and bugs there could be causing trouble. Yep, and that's what i suspect. Most it might be that it's reading out the wrong vid it should set and causing problem. > If you want to continue investigation here please post your kernel config and > full dmesg output from shortly after boot. Which motherboard you have might be > useful to know, too. System-setup. Asus M3N - AMD 770 / SB600 chipset AMD64 X2 4800+ 65W AM2+ ( http://en.wikipedia.org/wiki/Athlon_64_X2#Brisbane_.2865_nm_SOI.29 )
Created attachment 153131 [details] kernelconf
Spelling-correction Yep, and that's what i suspect. Might be that it's reading out the wrong vid it should set and that might be causing problems.
Sorry, by consistent panics I meant panicing in a consistent place in the kernel. I.e. something that would point to a kernel issue instead of a BIOS/hardware issue. As it is there probably isn't much that the kernel can do about it. I suppose it could blacklist affected chips/states, but I'm not sure whether that would be feasible. BTW, if you turn on CPU_FREQ_DEBUG the kernel will print details of state transitions, so you should be able to tell if it happens consistently on entry to a certain state. Anyway, googling around there seem to be lots of similar reports of instability with AM2+ chips using CnQ, on a variety of motherboards, especially at high clock speeds. Some are overclockers but there also seem to be plenty running at rated speeds.
(In reply to comment #7) > Sorry, by consistent panics I meant panicing in a consistent place in the > kernel. I.e. something that would point to a kernel issue instead of a > BIOS/hardware issue. As it is there probably isn't much that the kernel can do > about it. I suppose it could blacklist affected chips/states, but I'm not sure > whether that would be feasible. mm. i agree that the kernel probably dont have much to do with the actual panic but if it does set some strange mode for the cpu then it would atlest be the source of the problem.. > > BTW, if you turn on CPU_FREQ_DEBUG the kernel will print details of state > transitions, so you should be able to tell if it happens consistently on entry > to a certain state. Ah, that was something i forgot... I'll try that out as soon as i can, but not shure if it will be able to log anything via the netconsole but it's a good thing to try. I'll post a log as soon as i can. Also, do know of any tool to collect the current power-settings (vid/fid etc) from the system while in cnq and non-cnq mode? Might be good to see what values the bios do set and what values we get while on cnq mode? > > Anyway, googling around there seem to be lots of similar reports of instability > with AM2+ chips using CnQ, on a variety of motherboards, especially at high > clock speeds. Some are overclockers but there also seem to be plenty running at > rated speeds. > Oh, google'd some myself but did not find anything specific about it, atleast not with this behaviour. (i'm also running at stock-speed just to confirm)
Any updates here?
Yep, but i dropped the issue. Latest ubuntu kernel does not contain the same behaviour and the slow updates in certin areas on gentoo got me to reconsider running an ubuntu-deriviate (i know, it's also slow, but you don't have to wait for half a day to do a big upgrade :) and after the switch the system has been running quietly with cool'n'quiet enabled. I did try to do some logging before the switch but it seems like what happens is that the chipset or something goes out of sync and that causes memory-corruption that then causes the machine to panic that in turns causes the kernel to be corrupt. Might be that it missreads the voltages in the dsdt table but did not see any big changes in the ACPI when this started to appear. I did notice the same (but much more seldom, and always during full load) issues on a similar system (M2N-e mainboard with a AM2 X2 4200+ cpu) and that system is also happily running ubuntu with cnq enabled. But for my part you could close this case, if nobody else has experienced the same issues.
ok, thanks for responding