184852 – [2.6.21.2 regression] S3 ACPI wakeup from suspend broken

Bug 184852 - [2.6.21.2 regression] S3 ACPI wakeup from suspend broken

Summary: [2.6.21.2 regression] S3 ACPI wakeup from suspend broken

Status:	VERIFIED UPSTREAM

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	New packages (show other bugs)
Hardware:	All Linux

Importance:	High major (vote)
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:	linux-2.6.21.2-regression
Keywords:

Depends on:
Blocks:

Reported:	2007-07-10 15:44 UTC by devsk
Modified:	2008-12-04 13:55 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
dmesg immediately after booting into 2.6.21-r2 (the failed kernel) (dmesg.r2,30.54 KB, text/plain) 2007-07-11 04:28 UTC, devsk	Details
2117_sata-via-suspend.patch (2117_sata-via-suspend.patch,583 bytes, patch) 2007-07-20 19:56 UTC, Maarten Bressers (RETIRED)	Details \| Diff
1001_linux-2.6.21.2.patch (1001_linux-2.6.21.2.patch,81.73 KB, patch) 2007-07-20 19:57 UTC, Maarten Bressers (RETIRED)	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description devsk 2007-07-10 15:44:13 UTC

I can't resume from suspend-to-ram while using 2.6.21-r2 and subsequent kernels including 2.6.22. r1 and all previous gentoo-sources work flawlessly for my hardware. Most likely regression candidate is the patch 1001_linux-2.6.21.2.patch .

AMD3800+ X2
Nvidia Nforce4
sata_nv and amd IDE driver compiled in statically
No software or hardware change between successful resume and failed resume.

The suspend script is used which does unload all modules. But it should not matter what script does because every release before r2 has worked with the same software and hardware.

Comment 1 Daniel Drake (RETIRED) gentoo-dev

2007-07-10 21:13:26 UTC

Please post dmesg output after boot.
Please list all of the kernels you have tried.
Please elaborate on "I can't resume" -- what happens when you try?

Comment 2 devsk 2007-07-10 21:49:15 UTC

Kernels I have tried (and some I have in /lib/modules and /boot currently), all gentoo-sources:

2.6.17-r[47], 2.6.19-r[24], 2.6.21-r[123], 2.6.22-rc4 (vanilla with squashfs, vesafb-tng patches from gentoo-sources-2.6.21-r1) and 2.6.22.

Working kernels: all <= 2.6.21-r1

What happens when >= 2.6.21-r2 is powered on after suspend-to-ram: The reset LED on PC turns on and stays on, monitor comes to life (from orange, it becomes green), and that's about it. No messages on screen. Keyboard and mouse doesn't work, not even sysrq or three finger salute (no keyboard, duh!). I have to hit the reset button.

Output of dmesg after boot: I will have to be home and boot into r2 to post that.

Also, please understand that this suspend-resume cycle typically means that '/' and /home are uncleanly shutdown and have to be fsck'ed during next boot and I have lost a couple of files (nothing important). Although I do have backups, I would like to keep the experiments to the minimum. So, please let me know what other things you want me to do before I boot into the failing kernel.

Comment 3 devsk 2007-07-11 04:28:51 UTC

Created attachment 124507 [details]
dmesg immediately after booting into 2.6.21-r2 (the failed kernel)

It doesn't differ from the dmesg for 2.6.21-r1 (the good kernel) in any significant way. Most diffs are in sata drive descriptions formatting and some other digits like clock speed, migrtion cost etc.

Comment 4 devsk 2007-07-12 01:42:58 UTC

dsd, can you please tell me if you found something useful? Do I need to take this issue upstream?

Comment 5 Maarten Bressers (RETIRED) gentoo-dev

2007-07-20 19:55:11 UTC

I'm attaching two patches, 2117_sata-via-suspend.patch and 1001_linux-2.6.21.2.patch. Can you please revert these against gentoo-sources-2.6.21-r2 like this:

# patch -p1 -R < 2117_sata-via-suspend.patch
# patch -p1 -R < 1001_linux-2.6.21.2.patch

And see if that fixes the problem for you?

Comment 6 Maarten Bressers (RETIRED) gentoo-dev

2007-07-20 19:56:34 UTC

Created attachment 125486 [details, diff]
2117_sata-via-suspend.patch

Comment 7 Maarten Bressers (RETIRED) gentoo-dev

2007-07-20 19:57:09 UTC

Created attachment 125488 [details, diff]
1001_linux-2.6.21.2.patch

Comment 8 devsk 2007-07-22 21:27:39 UTC

Please give me some time to try this out. But, if I remember correctly, my earlier attempt had pointed out 1001_linux-2.6.21.2.patch as the culprit.

Comment 9 Daniel Drake (RETIRED) gentoo-dev

2007-08-05 22:23:40 UTC

Please reopen when you have identified which patch introduces the bug.

Comment 10 Micael Beronius 2007-08-29 18:28:11 UTC

OK, so I tested to revert the the 1001_linux-2.6.21.2.patch on a clean 2.6.21-r2, and it did fix my issue with acpi suspend to ram.
I had the following issue; One cycle suspend (power on, suspend, wakeup) did work, but any following suspend would make the disks spin down and then the machine would hang somewhere, where a reset where the only resort.

I have tested 2.6.22 and 2.6.23rc kernels and they always given me the same problems.

AMDx2, nvidia chipset.

Can't re-open, I'm a newbie at bugzilla... ;-)

Comment 11 devsk 2007-08-29 18:59:20 UTC

Basically, that confirms what I thought I arrived at as well. Reopening.

Comment 12 Maarten Bressers (RETIRED) gentoo-dev

2007-09-18 22:47:10 UTC

Micael, can you please test with the latest development kernel, 2.6.23-rc6 as of this writing? Can you post your kernel .config and dmesg output (from after the succesful suspend cycle)? Please provide a little more detail about your hardware.

devsk, do you mean to say that you're experiencing the exact same symptoms as Micael? Can you also test with latest development kernel and post your .config and dmesg output?

Comment 13 Daniel Drake (RETIRED) gentoo-dev

2007-09-19 11:09:43 UTC

Assuming that 2.6.23-rc is still broken, the next step is to do a bisection. 2.6.21.2 is a large patch over 2.6.21.1 so it's not obvious which change caused the problem, which we do need to find out.

It's a little time consuming (maybe 5 reboots?) but will almost certainly find the exact patch which caused the problem.

http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches/

In this case, the git URL you need to use is:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.21.y.git

use v2.6.21.1 as good and v2.6.21.2 as bad.

Thanks for your help figuring this out!

Comment 14 Micael Beronius 2007-09-19 17:58:12 UTC

Maarten,

unfortunately, as this is a server that is headless, and not very easy to get to, I cannot do any tests right now. If no other volonters, I'll try to move this box into an environment where I can connect a monitor+keyboard.

This is the data of the box, however;
Motherboard "Asus M2N32WS Professional". 1G ram, AMD64x2. 
http://www.asus.com.tw/products.aspx?l1=3&l2=101&l3=300&l4=0&model=1207&modelmenu=2

lspci givs;
00:00.0 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.1 RAM memory: nVidia Corporation C51 Memory Controller 0 (rev a2)
00:00.2 RAM memory: nVidia Corporation C51 Memory Controller 1 (rev a2)
00:00.3 RAM memory: nVidia Corporation C51 Memory Controller 5 (rev a2)
00:00.4 RAM memory: nVidia Corporation C51 Memory Controller 4 (rev a2)
00:00.5 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.6 RAM memory: nVidia Corporation C51 Memory Controller 3 (rev a2)
00:00.7 RAM memory: nVidia Corporation C51 Memory Controller 2 (rev a2)
00:04.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:08.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a1)
00:09.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a2)
00:09.1 SMBus: nVidia Corporation MCP55 SMBus (rev a2)
00:09.2 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
00:0a.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
00:0a.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
00:0c.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
00:0d.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:0d.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:0d.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:0e.0 PCI bridge: nVidia Corporation Unknown device 0370 (rev a2)
00:10.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2)
00:11.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2)
00:12.0 PCI bridge: nVidia Corporation Unknown device 0376 (rev a2)
00:14.0 PCI bridge: nVidia Corporation Unknown device 0374 (rev a2)
00:15.0 PCI bridge: nVidia Corporation Unknown device 0378 (rev a2)
00:16.0 PCI bridge: nVidia Corporation Unknown device 0375 (rev a2)
00:17.0 PCI bridge: nVidia Corporation Unknown device 0377 (rev a2)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
02:06.0 VGA compatible controller: S3 Inc. ViRGE/DX or /GX (rev 01)
03:00.0 PCI bridge: NEC Corporation uPD720400 PCI Express - PCI/PCI-X Bridge (rev 06)
03:00.1 PCI bridge: NEC Corporation uPD720400 PCI Express - PCI/PCI-X Bridge (rev 06)
06:00.0 RAID bus controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller (rev 01)
07:00.0 RAID bus controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller (rev 01)
08:00.0 SATA controller: Marvell Technology Group Ltd. Unknown device 6141 (rev 01)

The VGA board and the Sil3132 can be excluded from the quest, since I tested without these boards installed.

Comment 15 Maarten Bressers (RETIRED) gentoo-dev

2007-10-06 22:52:09 UTC

Micael / devsk, please see Comment #13. If/when (one of) you can provide this information (ie. the patch that causes this), please reopen.

Comment 16 devsk 2007-12-29 03:14:25 UTC

Since, I am not able to go to any kernel later than 2.6.21-r1 because of this bug, I took the bisect plunge (with instructions from DSD's page and URL from this bug report) and this is where it ended. Can someone please tell me how to find what change (diff) this commit introduced? If "Makefile" is any indication, I believe I messed up during bisect.

# git bisect good
7682ffa25c68221cff1122b3ce26a05640a54898 is first bad commit
commit 7682ffa25c68221cff1122b3ce26a05640a54898
Author: Chris Wright <chrisw@sous-sol.org>
Date:   Wed May 23 14:33:55 2007 -0700

    Linux 2.6.21.2

:100644 100644 58c08d1b738fd6bb2dd9e679a78bd45fa43ec4a3 22839cb65557b6f572fa8b660a274cb9a5102e76 M      Makefile

Comment 17 devsk 2007-12-29 03:39:14 UTC

that commit just changes EXTRAVERSION from 1 to 2. This really means a bad interaction of base fixes in 2.6.21.2 and gentoo specific libata patches because none of the libata patches were applied in 2.6.21.1.

Testing the taking out of "scsi handles suspend/resume" patches (211*) now.

Comment 18 devsk 2007-12-29 03:57:01 UTC

Ok, took out all the scsi related patches and the suspend-to-ram works in 2.6.21-r4 as it was working in 2.6.21-r1.

2118_scsi-constants.patch
2117_sata-via-suspend.patch
2116_libata-remove-spindown-compat.patch
2115_libata-spindown-status.patch
2114_libata-shutdown-warning.patch
2113_libata-spindown-compat.patch
2112_libata-suspend.patch
2111_sd-start-stop.patch
2110_scsi-sd-printing.patch

It looks like these are merged upstream because they are not applied by 2.6.22 or 2.6.23 gentoo-sources ebuilds and the problem happens in those kernels too.

Does anybody know when were these patches merged in 2.6.22?

Comment 19 devsk 2007-12-29 05:41:55 UTC

OK. After several reboots and resets, the patch that killed the suspend-to-ram on my machine is 2112_libata-suspend.patch . This patch was put into 2.6.22-rc1, hence all newer kernels don't work for me either.

What I notice is that when the suspend to ram is done, and resume hangs, the BIOS  does not find one of my disks on "RESET". I have to RESET again to find it. Its maxtor 300GB sata drive with higher than average spin-up time. Is there a limit on how fast the drive is supposed to spin up after resume? It looks like 3 of my other drives have < 5ms spin up time and they are found by the BIOS upon RESET from a bad resume, whereas my maxtor drive has a spin up time of > 25ms. So, this comment in the commit is relevant "Resume now has to wait for disk to spin up before proceeding."

I have a feeling that resume hangs waiting to detect this drive. Does anybody know how this patch could break drive detection and resume?

Comment 20 devsk 2007-12-29 06:47:13 UTC

Lastly, I removed the libata-suspend patch from 2.6.22-rc1 and suspend-to-ram starts to work. So, it seems like an upstream issue now.

Comment 21 devsk 2007-12-29 15:49:56 UTC

Filed bug http://bugzilla.kernel.org/show_bug.cgi?id=9659