219464 – Strange panics.

Bug 219464 - Strange panics.

Summary: Strange panics.

Status:	RESOLVED CANTFIX

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	AMD64 Linux

Importance:	High major
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2008-04-27 14:28 UTC by pakar
Modified:	2008-08-19 18:55 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
dmesg with cool'n'quiet enabled (dmesg.with-coolandquiet.txt,26.43 KB, text/plain) 2008-05-14 17:06 UTC, pakar	Details
dmesg with cool'n'quiet disabled (dmesg.without-coolandquiet.txt,26.15 KB, text/plain) 2008-05-14 17:07 UTC, pakar	Details
kernelconf (kernelconf.txt,56.68 KB, text/plain) 2008-05-14 17:18 UTC, pakar	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description pakar 2008-04-27 14:28:31 UTC

Problems in 2.6.24/2.6.25

oops-messages:
First:
Apr 27 14:35:32 myth BUG: unable to handle kernel
Apr 27 14:35:32 myth NULL pointer dereference
Apr 27 14:35:32 myth at 0000000000000010
Apr 27 14:35:32 myth IP:
Apr 27 14:35:32 myth [<ffffffff803874eb>] rb_insert_color+0x2b/0xf0
Apr 27 14:35:32 myth PGD 61c68067
Apr 27 14:35:32 myth PUD 6e057067
Apr 27 14:35:32 myth PMD 0
Apr 27 14:35:32 myth
Apr 27 14:35:32 myth Oops: 0000 [1]
Second:
Apr 27 14:53:03 myth Eeek! page_mapcount(page) went negative! (-1)
Apr 27 14:53:03 myth page pfn = 52d70
Apr 27 14:53:03 myth page->flags = 5200000000001c
Apr 27 14:53:03 myth page->count = 0
Apr 27 14:53:03 myth page->mapping = 0000000000000000
Apr 27 14:53:03 myth vma->vm_ops = 0x0
Apr 27 14:53:03 myth ------------[ cut here ]------------
Apr 27 14:53:03 myth kernel BUG at mm/rmap.c:669!
Apr 27 14:53:03 myth invalid opcode: 0000 [1]

Due to some limitations i have only been able to test this on 2.6.24 and 2.6.25 but both seem to experience the same issues..

Booting system via the network with AoE, and that has worked perfectly.
The issue was found after i switched to a AM2+ cpu.

The issue seems to arise whenever i go to or beyond 2.4Ghz and here is the strange thing. It works perfectly at anything lower
without any hangs or crashes but as soon as i go beyond 2.4 it panics within a second or 2.

Tests (with max_freq set to 2.2Ghz)
- 1 hour burn-test with 2 burkK7 processes (no issues)
- Successful pass of memtest86+ (with freq at 2.5Ghz.)

Things that has been checked:
- Temp, at max load it reaches around 45C before the system panics, idle temp is around 35-40 depending on the ambient temp. System temp is around 30-35C$
- Swiched cpu-cooler just to get it a bit lower. Before it was around 50-55 at load and 40-45 at idle.

One thing i have found that sounds a bit strange is that the vid-value reported from lm-sensors is always on 1.550V, and that sounds a
bit high for the 65W cpu, at least when running at the lower speeds. But since that is also reported without cool'n'quiet enabled it sounds a
bit more like a lm-sensors fluke.
http://en.wikipedia.org/wiki/Athlon_64_X2#Brisbane_.2865_nm_SOI.29
specifies 1.25-1.35V for the cpu but this might just be a fluke in lm-sensors or it might be that the kernel does not correctly read the what
voltages it should set on the cpu, but the values it sets does not look totally screwed up:
powernow-k8: Found 1 AMD Athlon(tm) 64 X2 Dual Core Processor 4800+ processors (2 cpu cores) (version 2.20.00)
powernow-k8: 0 : fid 0x11 (2500 MHz), vid 0x8
powernow-k8: 1 : fid 0x10 (2400 MHz), vid 0x9
powernow-k8: 2 : fid 0xe (2200 MHz), vid 0xb
powernow-k8: 3 : fid 0xc (2000 MHz), vid 0xd
powernow-k8: 4 : fid 0xa (1800 MHz), vid 0xf
powernow-k8: 5 : fid 0x2 (1000 MHz), vid 0x12

Also if booting without cool'n'quiet enabled it reports the VCore at 1.35V while if manually setting the frequency to 2.5Ghz when cool'n'quiet
is enabled it only reported 1.28V.
This also feels like the primary suspect since if booting without cool'n'quiet a load test does not cause a panic.

Another suspect in this could be the atl1 driver for the network card, and since i'm booting via that i might get corrupt data that is causing
all of this. But that does not sound too probable since it crashes already running processes and i'm not using any swap so that theory also falls and
why would it only accur at 2.4Ghz with cool'n'quiet enabled.

So to summarize,
- While on cool'n'quiet the system panics whenever it goes beyond 2.4Ghz and the system is loaded.
- Without cool'n'quiet the system works like a charm even under massive stress-tests.
- Hardware works as expected for a full memtest86+.
- System temps are well below the thermal limits.
- Confused person as a result. =)

Reproducible: Always

Steps to Reproduce:
1.boot with cool'n'quiet enabled.
2.run 2 instances of burnk7 and a panic will occur as soon as it reaches 2.4Ghz

System-setup.
Asus M3N - AMD 770 / SB600 chipset
AMD64 X2 4800+ 65W AM2+ ( http://en.wikipedia.org/wiki/Athlon_64_X2#Brisbane_.2865_nm_SOI.29 )
2Gig of ram.

Comment 1 Duane Griffin 2008-05-14 15:00:58 UTC

Are the panics consistent at all? This sounds like broken hardware causing random memory corruption, in which case there isn't much we can do, sorry. You could try checking for BIOS upgrades, cool'n'quiet requires BIOS support and bugs there could be causing trouble.

If you want to continue investigation here please post your kernel config and full dmesg output from shortly after boot. Which motherboard you have might be useful to know, too.

Comment 2 pakar 2008-05-14 17:06:41 UTC

Created attachment 153127 [details]
dmesg with cool'n'quiet enabled

Comment 3 pakar 2008-05-14 17:07:13 UTC

Created attachment 153129 [details]
dmesg with cool'n'quiet disabled

Comment 4 pakar 2008-05-14 17:17:49 UTC

(In reply to comment #1)
> Are the panics consistent at all?
Yes, when cool'n'quiet is enabled it panics ONLY when it goes beyond 2.2Ghz, or manually setting to 2.4 or 2.5Ghz, and it does that within 1-2 seconds during 100% load. If running at 100% on both cores at or below 2.2Ghz (limiting via scaling_max_freq) it works perfectly.

> This sounds like broken hardware causing random memory corruption, in which case there isn't much we can do, sorry.

I disagree there to a degree. If it would be broken hardware it should also display the same issues while running at 2.5Ghz without cool'n'quiet and/or display some strange behaviour when running with cool'n'quiet enabled but below 2.2Ghz. Shure it could be some strange obsure issue with the multiplier setting, but then it should probably behave this way when setting those values in the bios with CNQ disabled (have tried all problematic multipliersettings via the bios).

> You > could try checking for BIOS upgrades, cool'n'quiet requires BIOS support and bugs there could be causing trouble.

Yep, and that's what i suspect. Most it might be that it's reading out the wrong vid it should set and causing problem.

> If you want to continue investigation here please post your kernel config and
> full dmesg output from shortly after boot. Which motherboard you have might be
> useful to know, too.


System-setup.
Asus M3N - AMD 770 / SB600 chipset  
AMD64 X2 4800+ 65W AM2+ (
http://en.wikipedia.org/wiki/Athlon_64_X2#Brisbane_.2865_nm_SOI.29 )

Comment 5 pakar 2008-05-14 17:18:06 UTC

Created attachment 153131 [details]
kernelconf

Comment 6 pakar 2008-05-14 17:20:50 UTC

Spelling-correction

Yep, and that's what i suspect. Might be that it's reading out the
wrong vid it should set and that might be causing problems.

Comment 7 Duane Griffin 2008-05-14 20:04:36 UTC

Sorry, by consistent panics I meant panicing in a consistent place in the kernel. I.e. something that would point to a kernel issue instead of a BIOS/hardware issue. As it is there probably isn't much that the kernel can do about it. I suppose it could blacklist affected chips/states, but I'm not sure whether that would be feasible.

BTW, if you turn on CPU_FREQ_DEBUG the kernel will print details of state transitions, so you should be able to tell if it happens consistently on entry to a certain state.

Anyway, googling around there seem to be lots of similar reports of instability with AM2+ chips using CnQ, on a variety of motherboards, especially at high clock speeds. Some are overclockers but there also seem to be plenty running at rated speeds.

Comment 8 pakar 2008-05-15 17:45:01 UTC

(In reply to comment #7)
> Sorry, by consistent panics I meant panicing in a consistent place in the
> kernel. I.e. something that would point to a kernel issue instead of a
> BIOS/hardware issue. As it is there probably isn't much that the kernel can do
> about it. I suppose it could blacklist affected chips/states, but I'm not sure
> whether that would be feasible.

mm. i agree that the kernel probably dont have much to do with the actual panic but if it does set some strange mode for the cpu then it would atlest be the source of the problem..

> 
> BTW, if you turn on CPU_FREQ_DEBUG the kernel will print details of state
> transitions, so you should be able to tell if it happens consistently on entry
> to a certain state.

Ah, that was something i forgot... I'll try that out as soon as i can, but not shure if it will be able to log anything via the netconsole but it's a good thing to try. I'll post a log as soon as i can. Also, do know of any tool to collect the current power-settings (vid/fid etc) from the system while in cnq and non-cnq mode? Might be good to see what values the bios do set and what values we get while on cnq mode?

> 
> Anyway, googling around there seem to be lots of similar reports of instability
> with AM2+ chips using CnQ, on a variety of motherboards, especially at high
> clock speeds. Some are overclockers but there also seem to be plenty running at
> rated speeds.
> 

Oh, google'd some myself but did not find anything specific about it, atleast not with this behaviour. (i'm also running at stock-speed just to confirm)

Comment 9 Mike Pagano gentoo-dev

2008-08-19 00:46:46 UTC

Any updates here?

Comment 10 pakar 2008-08-19 18:36:29 UTC

Yep, but i dropped the issue. Latest ubuntu kernel does not contain the same behaviour and the slow updates in certin areas on gentoo got me to reconsider running an ubuntu-deriviate (i know, it's also slow, but you don't have to wait for half a day to do a big upgrade :) and after the switch the system has been running quietly with cool'n'quiet enabled.

I did try to do some logging before the switch but it seems like what happens is that the chipset or something goes out of sync and that causes memory-corruption that then causes the machine to panic that in turns causes the kernel to be corrupt. Might be that it missreads the voltages in the dsdt table but did not see any big changes in the ACPI when this started to appear. I did notice the same (but much more seldom, and always during full load) issues on a similar system (M2N-e mainboard with a AM2 X2 4200+ cpu) and that system is also happily running ubuntu with cnq enabled.

But for my part you could close this case, if nobody else has experienced the same issues.

Comment 11 Mike Pagano gentoo-dev

2008-08-19 18:55:53 UTC

ok, thanks for responding