Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 444137 - sys-kernel/gentoo-sources-3.5.7: kernel panic on Sony Vaio VPCS13C5E
Summary: sys-kernel/gentoo-sources-3.5.7: kernel panic on Sony Vaio VPCS13C5E
Status: RESOLVED UPSTREAM
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: AMD64 Linux
: Normal normal
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL:
Whiteboard: linux-bugzilla-pending linux-3.5-regr...
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-21 10:18 UTC by Oliver Deppert
Modified: 2012-12-02 13:25 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
Kernel Config from "running" 3.4.9 (kernel_config_3_4_9.txt,72.85 KB, text/plain)
2012-11-21 10:18 UTC, Oliver Deppert
Details
Kernel Config from "paniced" 3.5.7 (kernel_config_3_5_7.txt,74.20 KB, text/plain)
2012-11-21 10:18 UTC, Oliver Deppert
Details
"lspci -k" from "running" 3.4.9 (lspci_3_4_9.txt,4.18 KB, text/plain)
2012-11-21 10:19 UTC, Oliver Deppert
Details
dmesg from "running" 3.4.9 (dmesg_3_4_9.txt,15.19 KB, text/plain)
2012-11-21 10:19 UTC, Oliver Deppert
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Oliver Deppert 2012-11-21 10:18:12 UTC
Created attachment 330121 [details]
Kernel Config from "running" 3.4.9

versions used: gentoo-sources-3.4.9, 3.5.7 and 3.6.4 on x86_64
attached: -lspci -k from "running" 3.4.9
          -dmesg from "running" 3.4.9; dmesg from paniced 3.5.7 not possible

Hi all,

recently I decided to upgrade my Kernel from 3.4.9 to 3.5.7 (gentoo-sources) by copying the ".config" file to the new Kernel path and "making oldconfig"...

I did several kernel upgrades since the version 2 branch like this way...but this time, after trying to boot the new kernel I got a kernel panic quite early during boot-phase....unable to "read" the reason, cause it was too fast...

I rebooted with the old kernel 3.4.9, everything works fine....so I decided to boot the new kernel 3.5.7 with the "acpi=off" option...with this, also the new kernel is able to boot without any kernel panic so far...

I also tried the unstable kernel 3.6.4 from gentoo-sources, again resulting in a kernel-panic when missing "acpi=off"

I've attached the config of 3.4.9 and in principal exactly the same config for 3.5.7 after "make oldconfig" an merging in the new kernel options...

the "acpi=off" workaround isn't really an option, cause I'd like to have cpu_freq and stuff on my laptop...

Does anybody know the reason, why 3.4.9 works fine on my Sony Vaio VPCS13C5E and 3.5.7 crashes with a kernel panic triggered by an ACPI error?

Any help would be greatly appreciated!

with kind regards,
Oliver
Comment 1 Oliver Deppert 2012-11-21 10:18:44 UTC
Created attachment 330123 [details]
Kernel Config from "paniced" 3.5.7
Comment 2 Oliver Deppert 2012-11-21 10:19:14 UTC
Created attachment 330125 [details]
"lspci -k" from "running" 3.4.9
Comment 3 Oliver Deppert 2012-11-21 10:19:48 UTC
Created attachment 330127 [details]
dmesg from "running" 3.4.9
Comment 4 Mikle Kolyada (RETIRED) archtester Gentoo Infrastructure gentoo-dev Security 2012-11-21 11:34:31 UTC
Please do not do it.
Comment 5 Sergey Popov (RETIRED) gentoo-dev 2012-11-21 11:43:29 UTC
Oliver, did you send this report to bugzilla.kernel.org?
Comment 6 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2012-11-21 15:50:11 UTC
> Oliver, did you send this report to bugzilla.kernel.org?

This would not yield any solution since we don't know the problem yet.

> unable to "read" the reason, cause it was too fast...

Compile your kernel with CONFIG_BOOT_PRINTK_DELAY=y

Then, boot your system with the extra kernel parameter boot_delay=N where you set N to a value that is convenient for being able to read / capture the error.

If you can, take a picture or capture a video...

> resulting in a kernel-panic when missing "acpi=off"

You could see whether it lies in the PCI part or not using pci=noacpi as an option.

> Does anybody know the reason, why 3.4.9 works fine on my Sony Vaio VPCS13C5E and 3.5.7 crashes with a kernel panic triggered by an ACPI error?

Being able to read the error might help. Your explanation makes me wonder if you can still read the end of the kernel panic, or does that go away too for some reason?

If you can't get a hold on the error, you can figure out the last version for which this worked and the first for which this broke such that we can look at a git diff and perhaps proceed with a git bisect if the git diff doesn't reveal any relevant reason for your problem.

So, could you see what the last version is for which it still works and the first version for which it breaks?
Comment 7 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2012-11-21 15:55:31 UTC
Forgot to note, the N in boot_delay=N is in milliseconds, so you probably want to use 1000 (one second) and adapt from there when that's too slow / fast. Interactively editing the kernel line in grub can be handy so you don't have to boot into your working kernel every time to adapt the value to what you like.
Comment 8 Oliver Deppert 2012-11-21 16:25:30 UTC
> Compile your kernel with CONFIG_BOOT_PRINTK_DELAY=y
OK, I will try this with the "boot_delay" grup option...I can capture a video with my handy hopefully...I tried this before, but the speed was even to fast for my handy cam...

> Oliver, did you send this report to bugzilla.kernel.org?
No, I didn't send to bugzilla.kernel so far...But I will do this for sure, if I got an idea, where the bug lives...

> You could see whether it lies in the PCI part or not using pci=noacpi as an option.
I've tried a lot of grub options so far...like noapic and many others....successively reducing by one option...I tried also pci=noacpi...but the only option able to boot the kernel without a panic was acpi=off...

> Being able to read the error might help. Your explanation makes me wonder if you can still read the end of the kernel panic, or does that go away too for some reason?
No, I can read the end of the panic....but I was searching for a process triggering this panic...the only thing I'm able to read is the trace output....only numbers, where I'm not able to refer any ACPI stuff related to...but as I said, I can recompile and give a try on "boot_delay" and capturing a video of the full trace...

>So, could you see what the last version is for which it still works and the first version for which it breaks?
Ok, from the gentoo-sources point of view...the last version working fine is 3.4.9....and the first version the panic occurs is 3.5.7...but I can also directly compile versions from kernel.org, to give a try on version numbers in between, if necessary....

Thanks for your help!

with kind regards,
Oliver
Comment 9 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2012-11-21 17:03:43 UTC
Hopefully you'll see a clear error that way, that'll surely help trim down what to look for; don't forget to compile the config option in the kernel in order to support the kernel parameter. I suppose /var/log/messages isn't updated with more information on the error? Is it perhaps possible to attach a serial cable and capture the errors through a serial console?

> but I can also directly compile versions from kernel.org, to give a try on version numbers in between, if necessary....

There are two unmasked packages in between you could try.

Going directly from kernel.org might result in some unnecessary manual / double work, if you want to go down that road then you might as well start with a git bisect instead.

What it basically does is it takes all the work in between the two versions known as being "good" and "bad"; it splits those in the middle such that you have the first half and can determine if that work is "good" or "bad". This makes a new range of two versions that is half the size of the former range.

Example for 8 changes:

    Try the first 4 changes made
    --GOOD--> Add 2 changes to it, so we try the first 6 changes
    ---BAD--> Remove 1 change from it, so we try the first 5 changes
    --GOOD--> The first 5 changes are good and the first 6 are bad, thus change 6 is bad.

So, for 8 changes you only need to try 3 times. (8 / 2 / 2 / 2 = 1)

Since your range is still quite big it will take a few runs before it gets small, I don't know how many it will be until you hit a single commit. If you get bored and want to continue some other day you can also share the progress of how far you got (attaching the partial log) such that we can look through that diff.

See below URL for more information and how to do it

http://wiki.gentoo.org/wiki/Kernel_git-bisect
Comment 10 Oliver Deppert 2012-11-24 06:48:53 UTC
Hi,

about the current status...

so far, I've tried the two other unmasked versions from gentoo-sources:
3.4.11 -> worked
3.5.6 -> kernel panic

Attached the link to the screen capture of the kernel-panic from 3.5.6 with boot_delay=150

https://docs.google.com/folder/d/0B2OfgWxpfOBSRDVJUUtmVTNXQ2c/edit

looks like a problem with the kernel option "Sony Vaio Laptop Extras"...I will give a try by removing this kernel option...we will see..

If it isn't related to Sony Vaio Extras, I will give a try on the kernel bisect...I think, it will take a little longer...

>I suppose /var/log/messages isn't updated with more information on the error? Is it perhaps possible to attach a serial cable and capture the errors through a serial console?
No, var/log/messages isn't filled, cause it's a crypted root-drive and I wasn't able to decrypt before the panic occurs...serial cable isn't an option...no second machine is available to connect...

regards,
Oliver
Comment 11 Oliver Deppert 2012-11-24 07:46:24 UTC
Hi,

ok...I've found the reason:
It is related to the kernel option "SONY_LAPTOP"...without the Sony Laptop Extras option the Kernel 3.5.6 boots well without any panic...

the strange thing:
in the kernel changelogs between 3.4.11 and 3.5.6 aren't any commits related to this special option...the only thing which is related either to Sony or Vaio can be found in the changelog of 3.4.12 and 3.5.5...but doesn't look like related to Sony Laptop Extras...

I guess they have changed something in the core ACPI kernel driver between 3.4.11 and 3.5.6 and maybe forgot to addapt the "Sony Laptop Extras" option. This option has to do at least something with ACPI, cause it adds the Vaio special keys and Backlight dimmer to/from the BIOS ACPI table...

so far, the priority to "solve" this bug isn't as high as thought before, but it would be nice to solve it for future kernel versions...to be able to use the special Sony keys and stuff...

Should I report this to the kernel bugzilla?

regards,
Oliver
Comment 12 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2012-11-24 12:17:30 UTC
The buffer_call procedure visible as the function where it fails in in your video has been introduced in this commit:

commit ebcef1b0e41f2ff972e5c5487a30e8f4ee2b6f13
Author: Mattia Dongili <malattia@linux.it>
Date:   Sat May 19 22:35:46 2012 +0900

    sony-laptop: generalise ACPI calls into SNC functions
    
    All calls into the SNC device methods have zero or one arguments and
    return an integer or a buffer (some functions go as far as returning an
    integer _or_ a buffer depending on the input parameter...).
    This allows simplifying a couple of code paths and prepares the field
    for other users of functions returning buffers.
    
    Signed-off-by: Mattia Dongili <malattia@linux.it>
    Signed-off-by: Matthew Garrett <mjg@redhat.com>

http://www.spinics.net/lists/platform-driver-x86/msg03319.html

This function contains some code that deals with memory at the end, so there is probably a bug there that triggers the swapper kernel panic.

Since newer commits depend on this commit it isn't easy to revert the changes from this commit through a patch, but what you could do to check whether this is the offending commit is (ATTENTION: Make sure you first enable the offending config variable again):

    cd /usr/src
    git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-stable
    ln -s linux-stable linux
    mount /boot
    cd /usr/src/linux

    git checkout ebcef1b0e41f2ff972e5c5487a30e8f4ee2b6f13^

    cp ../linux-your-actual-kernel-version-here/.config 
    make oldconfig
    make -j4 && make modules_install && make install
    reboot

    git checkout ebcef1b0e41f2ff972e5c5487a30e8f4ee2b6f13

    cp ../linux-your-actual-kernel-version-here/.config 
    make oldconfig
    make -j4 && make modules_install && make install
    reboot

The first checkout line will get the commit before our possible offending commit, whereas the second will get the possible offending commit; if my guess is right then the first checkout would result in a bootable system whereas the second will result in a non-bootable system.

If this is the case, good, we have found the offending commit. If not, we at least now know it's one of the commits touching that file more recent than the commit we have just deemed as good.

If it's not in this untouched buffer function call (the code is still the same in 3.7 kernels since that commit), then the error must be within

> static int sony_nc_add(struct acpi_device *device)

The more specific we know where the error lies, the better the upstream bug report would be; hence if possible, please try the above.
Comment 13 Oliver Deppert 2012-12-02 10:14:16 UTC
Hi,

>if my guess is right then the first checkout would result in a bootable system whereas the second will result in a non-bootable system.

I've tried the method you mentioned...and your guess was right...with the first commit, the system boots fine....with the second one the kernel panic occurs...

So I think, we've found the reason for the kernel panic...what do you suggest to do next?

regards,
Oliver
Comment 14 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2012-12-02 13:25:25 UTC
In order for this to be patched, you can report this bug upstream at http://bugzilla.kernel.org and let them know you have found a regression. Carefully explain your problem, include the commit I mentioned in the previous comment and to proof you have found it you could also attach the bisect log.

When you do this, please place a link to this bug at the upstream bug as well as place a link to the upstream bug in this bug in the URL fields; please also put kernel@gentoo.org in the CC field so we get updates on this and help along if needed.