456198 – sys-kernel/gentoo-sources-3.7.4 - /proc/vmcore missing?

Bug 456198 - sys-kernel/gentoo-sources-3.7.4 - /proc/vmcore missing?

Summary: sys-kernel/gentoo-sources-3.7.4 - /proc/vmcore missing?

Status:	RESOLVED UPSTREAM

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:	linux-bugzilla-pending
Keywords:	UPSTREAM

Depends on:
Blocks:

Reported:	2013-02-08 17:21 UTC by Jason Mours
Modified:	2013-10-14 15:52 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jason Mours 2013-02-08 17:21:35 UTC

Getting some (gentoo-sources-3.7.4) kernelcrash info you guys may want to take a look at. Astemic IRQ handles with emerge, rsync, wget or just portage... maybe more. BUT /proc/vmcore doesn't exist. I have the kernel options setup as per Gentoowiki. I am however running genkernel w/ initramfs. Any insight? Am I looking in to right place /proc/kcore has an ELF dump. (I 'think') Fedora was missing their /proc/vmcore and has a patch. Not sure how old it is FC6 I believe.

Comment 1 Tom Wijsman (TomWij) (RETIRED) gentoo-dev

2013-02-08 23:32:12 UTC

> Getting some (gentoo-sources-3.7.4) kernelcrash info you guys may want to take a look at. Astemic IRQ handles with ...

Can you give us more details from these messages?

> BUT /proc/vmcore doesn't exist.

Are you sure CONFIG_PROC_VMCORE is enabled in your kernel configuration?

Please note that this only works in a capture kernel.

See https://www.kernel.org/doc/Documentation/kdump/kdump.txt for instructions.

> I have the kernel options setup as per Gentoowiki.

Which kernel options setup are you referring to? Can you link us this article?

> Fedora was missing their /proc/vmcore and has a patch.

I can't find that, the only thing I see is that they have enabled CONFIG_PROC_VMCORE in FC6, which is mentioned in https://www.redhat.com/archives/fedora-package-announce/2007-June/msg00544.html and that's from 2007.

Comment 2 Jason Mours 2013-02-09 04:41:17 UTC

(In reply to comment #1)
> > Getting some (gentoo-sources-3.7.4) kernelcrash info you guys may want to take a look at. Astemic IRQ handles with ...
> 
> Can you give us more details from these messages?
> 
> > BUT /proc/vmcore doesn't exist.
> 
> Are you sure CONFIG_PROC_VMCORE is enabled in your kernel configuration?
> 
> Please note that this only works in a capture kernel.
> 
> See https://www.kernel.org/doc/Documentation/kdump/kdump.txt for
> instructions.
> 
> > I have the kernel options setup as per Gentoowiki.
> 
> Which kernel options setup are you referring to? Can you link us this
> article?
> 
> > Fedora was missing their /proc/vmcore and has a patch.
> 
> I can't find that, the only thing I see is that they have enabled
> CONFIG_PROC_VMCORE in FC6, which is mentioned in
> https://www.redhat.com/archives/fedora-package-announce/2007-June/msg00544.
> html and that's from 2007.

Confirmed. I have /proc/vmcore enabled.
Here is the wiki I'm working off of.

http://wiki.gentoo.org/wiki/Kernel_Crash_Dumps

I thought I understood the link you sent me. And yes the redhat archives was what I had in mind. There was also something I found in LKML something similar in the crash :

Something more recent:
https://lkml.org/lkml/2011/2/8/62
unannotated irqs-off

But it doesn't have anything about what exactly triggered it or /proc/vmcore.
I am working off of an initramfs and maybe my kdump.start isn't working. I'll keep tinkering.

But on another note, the crash occurs when emerge initiates a network connection, during --sync or when d/l new ebuilds for updates. Normal rebuilding doesn't trigger it. It seems to be IRQ handle in nature and unannotated irqs-off pops up.

Thanks

Comment 3 Jason Mours 2013-02-09 05:59:37 UTC

I have verified that sysfs is enabled. I just tried enabling CONFIG_SYSFS_DEPRECATED & CONFIG_SYSFS_DEPRECATED_V2, no luck. /proc/vmcore is still missing.

Does it only show up when the kexec kernel loads? Or should it be present during normal kernel operation. New at this...

Comment 4 Jason Mours 2013-02-09 07:09:19 UTC

Working off your link I have verified a working kexec script :

kexec -p /boot/kernel-genkernel-x86_64-3.7.4-gentoo /
--initrd=/boot/initramfs-genkernel-x86_64-3.7.4-gentoo /
--append="root=/dev/ram0 real_root=/dev/sda2 single irqpoll maxcpus=1 reset_devices"

parameter crashkernel=64M@16M doesn't work and I can only allocate the 64M and not reserve address space @ 16M, so booting I have crashkernel=64M with no address reservation and appears working.

I also have not disabled SMP as recommended, but specified maxcpus=1 in the kexec script.

That's it I believe. I do NOT mount /dev/sda1 @ boot, I mount it manually @ /mnt/boot, but genkernel places copies in /boot so I use those as my panic kernel images. I don't think it's a problem. CONFIG_RELOCATABLE=y

Hope this covers things on my end. /proc/vmcore is still missing. Let me know if there's anything else. Thanks for the help!

Comment 5 Tom Wijsman (TomWij) (RETIRED) gentoo-dev

2013-02-09 16:24:21 UTC

From that Kernel Crash Dump article we see that the reserved area must work in order for the crash mode kernel to be placed there:

> kexec runs the kernel in crash mode, relocated to a reserved area of memory.

I think this is a necessity for the /proc/vmcore to appear.

I think you misunderstand the crashkernel parameter, here is its description:

> crashkernel=size[KMG][@offset[KMG]]
>     [KNL] Using kexec, Linux can switch to a 'crash kernel'
>     upon panic. This parameter reserves the physical
>     memory region [offset, offset + size] for that kernel
>     image. If '@offset' is omitted, then a suitable offset
>     is selected automatically. Check
>     Documentation/kdump/kdump.txt for further details.

From https://www.kernel.org/doc/Documentation/kernel-parameters.txt

The part before the @ determines the size, the part after where to place it.

Since it doesn't work with @16M appended, I think you should let it automatically determine an offset by just using crashkernel=64M instead.

If it's not that, the only thing left is that your /proc/vmcore could be invalidated and removed for one or another reason. See http://lkml.indiana.edu/hypermail/linux/kernel/1111.0/02444.html for instance, another such occasion is http://www.mail-archive.com/kexec@lists.infradead.org/msg05604.html. It looks like these might have something to do with firmware.

Comment 6 Jason Mours 2013-02-09 19:16:36 UTC

(In reply to comment #5)
> From that Kernel Crash Dump article we see that the reserved area must work
> in order for the crash mode kernel to be placed there:
> 
> > kexec runs the kernel in crash mode, relocated to a reserved area of memory.
> 
> I think this is a necessity for the /proc/vmcore to appear.
> 
> I think you misunderstand the crashkernel parameter, here is its description:
> 
> > crashkernel=size[KMG][@offset[KMG]]
> >     [KNL] Using kexec, Linux can switch to a 'crash kernel'
> >     upon panic. This parameter reserves the physical
> >     memory region [offset, offset + size] for that kernel
> >     image. If '@offset' is omitted, then a suitable offset
> >     is selected automatically. Check
> >     Documentation/kdump/kdump.txt for further details.
> 
> From https://www.kernel.org/doc/Documentation/kernel-parameters.txt
> 
> The part before the @ determines the size, the part after where to place it.
> 
> Since it doesn't work with @16M appended, I think you should let it
> automatically determine an offset by just using crashkernel=64M instead.
> 
> If it's not that, the only thing left is that your /proc/vmcore could be
> invalidated and removed for one or another reason. See
> http://lkml.indiana.edu/hypermail/linux/kernel/1111.0/02444.html for
> instance, another such occasion is
> http://www.mail-archive.com/kexec@lists.infradead.org/msg05604.html. It
> looks like these might have something to do with firmware.

OK. Allocate 64M of memory @ the 256hex offset (0x1000000) +PLUS+ size 16M. The offset (0x1000000) is specified in .config , but this is stock and is identical to what is specified in the link from kernel.org. I'm allowing it to auto allocate it anyways, but does the offset specified need differ from booting kernel to capture kernel? Would that explain it? Either way, I thought an empty /proc/vmcore would be present in the booting kernel.

Not sure about what firmware could be invalidating. It's not an embedded system. The crash is network related & I know the embedded realtek 8168 has errata as the r8168 ebuild module panicked & generated crash info on shutdown even before I configured tracing. Perhaps... I have started using the kernel r8169 module successfully, as process of elimination without panic on shutdown. Motherboard is flashed with the latest BIOS, and that's as close the hardware as I can get. Guess I can start disabling the embedded components.

Comment 7 Tom Wijsman (TomWij) (RETIRED) gentoo-dev

2013-04-14 07:43:22 UTC

> Guess I can start disabling the embedded components.

Did you ever try this, have you found the culprit?

You may also want to try git-sources-3.9_rc6, there have been some commits in this area. Not sure if all are available in the tree but you may be able to apply some of them to do further testing. See the following link for search results:

http://search.gmane.org/?query=vmcore+PATCH+-watchdog+-makedumpfile&author=&group=gmane.linux.kernel&sort=date&DEFAULTOP=and&xP=Zvmcore%09patch%09Zwatchdog&xFILTERS=Glinux.kernel---A

Comment 8 Jason Mours 2013-04-14 15:30:03 UTC

(In reply to comment #7)
> > Guess I can start disabling the embedded components.
> 
> Did you ever try this, have you found the culprit?
> 
> You may also want to try git-sources-3.9_rc6, there have been some commits
> in this area. Not sure if all are available in the tree but you may be able
> to apply some of them to do further testing. See the following link for
> search results:
> 
> http://search.gmane.org/?query=vmcore+PATCH+-watchdog+-
> makedumpfile&author=&group=gmane.linux.
> kernel&sort=date&DEFAULTOP=and&xP=Zvmcore%09patch%09Zwatchdog&xFILTERS=Glinux
> .kernel---A



Yeah, I had disabled audio, firewire, usb, and nic. Even switched to an Intel pcie e1000e. No luck, no /proc/vmcore, and portage and layman still throw panic and oops. The oops is irq related, so if gigabyte is doing something in the silicon that they don't want out, I'm not sure. I was able to pull a spinlock oops out by cooking python(?kulprit?), so maybe with enough tracing & IO / gcc & gdb could figure out my issue... having /proc/vmcore would be nice, but the nature of my softirq oops looks vague on the internet...even coined as dreaded.

I'll have some time later on and compile a git-source kernel and let you guys know. Thanks for the link!... ;) now about that dev-util/mutrace-0.2 bug ;)

Comment 9 Jason Mours 2013-04-15 00:05:58 UTC

No luck, vanilla git-sources:3.9.0-rc6 with kexec-tools-2.0.4-r1 still does not provide /proc/vmcore.

Diff vmalloc.c & vmcore.c between 3.8.6 & 3.9.0-rc6 shows NUMA and some get free pages (GFP) commits, printk headers included now as well, I do use printk extensively throughout my kernel.config . But I'm not sure what's preventing vmcore being committed to fs... my machine is better at interpreting 'C' than me. But I stare! ;) Thanks again for gmane. I'll post if any future commits conjurer up vmcore.

Comment 10 Jason Mours 2013-06-01 16:54:29 UTC

Being bug day and all, I thought I'd send a *ping* : Following git-sources there is still no /proc/vmcore as of 3.10-rc3. Following GNAME, They were re-writing it for );nmap & elf headers, and I think those patch sets made it into the tree. But I don't think the problem is in vmcore.c or vmalloc.c, but I'm sure, or could even explain. There was also a general post about /proc/vmore missing, without much context.

After fixing the tracing errata (kintsukuroi!) My output on oops looks pretty interesting, but I can't get a proper dump for submission.

Comment 11 Jason Mours 2013-06-01 16:56:14 UTC

My head just isn't together this afternoon, I meant to express how I could NOT explain why I believe the problem lies outside of vmalloc.c & vmcore.c, it's just a hunch.

Comment 12 Jason Mours 2013-06-01 17:42:57 UTC

and my hunch is udev, possibly baselayout. I'm sitting at sys-fs/udev-204 & sys-apps/baselayout-2.2

Comment 13 Tom Wijsman (TomWij) (RETIRED) gentoo-dev

2013-07-19 10:14:27 UTC

There are some new gentoo-sources and git-sources kernels, could you try again?

If not, I think it is best that you file this upstream at https://bugzilla.kernel.org and leave us a link behind to the upstream bug; alternatively, you could try the mailing list and CC kernel@g.o (expand domain).

Comment 14 Jason Mours 2013-07-19 16:38:15 UTC

(In reply to Tom Wijsman (TomWij) from comment #13)
> There are some new gentoo-sources and git-sources kernels, could you try
> again?
> 
> If not, I think it is best that you file this upstream at
> https://bugzilla.kernel.org and leave us a link behind to the upstream bug;
> alternatively, you could try the mailing list and CC kernel@g.o (expand
> domain).

No joy yet... I'm really at an inconsistent state with my portage not resuming and ppl(lpsol) not working with glpk-4.50 yet. So when I get somewhere stable I'll file something upsteam and post a bugzilla link for those interested to follow. 

But to confirm my system is sitting at:

gentoo-sources:3.9.10 - no /proc/vmcore
&
git-sources:3.11.0-rc1 - no /proc/vmcore *cute windows tux*

... but the more I trace, the more my oops appears to be anything tcp related and not hardware as SSH into the box creates oops on the console... but I digress.

Comment 15 Tom Wijsman (TomWij) (RETIRED) gentoo-dev

2013-08-12 22:03:52 UTC

(In reply to Jason Mours from comment #14)
> No joy yet... I'm really at an inconsistent state with my portage not
> resuming and ppl(lpsol) not working with glpk-4.50 yet. So when I get
> somewhere stable I'll file something upsteam and post a bugzilla link for
> those interested to follow. 

Has there been progress on this state?

Are bugs filed for the problems you are experiencing with those packages?

Comment 16 Jason Mours 2013-08-13 03:29:52 UTC

No, I haven't moved forward with filing this upstream yet. I'll get to it... I was having some problems brewing my own icedtea and adding gcj to gcc... other issues with ppl and sitting tight with automake 1.13 & aclocal bugs... also my portage won't run a resume as my glpk-4.50 and ppl[lpsol] being out of line. So I'm stuck with certain packages inconsistent and failing. Also there looks like some non-continuous 'c' is hitting testing with xz and ncurses...yes, I actually watch the compiles.

If anyone is curious why, why oh why... What I'm trying to accomplish is an attempt at machine learning. A *tickless* netlink tracing kernel / graphite GCC working with a function rich I/O tailors glibc, that and a global python centric -ggdb to boot. The haskell hounds were helping me out with some GHC work, and even though I have yet to add mlton (it will be epic) my machine was looking rather suave. That is until I saw the latest ncurses, xorg-server, & xz... but I digress.

SO... Enough excuses as to why I'm procrastinating, Hey it's summer isn't it? & I'm not getting laid.. So besides the fact that I don't code (well). Since I'm coming across like a ricer, I must assure people my gentoo is stock portage & science. I'm simply working with build orders and slowing cooking in USE flags as I reference this upstream.

Comment 17 Tom Wijsman (TomWij) (RETIRED) gentoo-dev

2013-10-14 15:52:47 UTC

(In reply to Jason Mours from comment #16)
> No, I haven't moved forward with filing this upstream yet. I'll get to it...
> I was having some problems brewing my own icedtea and adding gcj to gcc...
> other issues with ppl and sitting tight with automake 1.13 & aclocal bugs...
> also my portage won't run a resume as my glpk-4.50 and ppl[lpsol] being out
> of line. So I'm stuck with certain packages inconsistent and failing. Also
> there looks like some non-continuous 'c' is hitting testing with xz and
> ncurses...yes, I actually watch the compiles.
> 
> If anyone is curious why, why oh why... What I'm trying to accomplish is an
> attempt at machine learning. A *tickless* netlink tracing kernel / graphite
> GCC working with a function rich I/O tailors glibc, that and a global python
> centric -ggdb to boot. The haskell hounds were helping me out with some GHC
> work, and even though I have yet to add mlton (it will be epic) my machine
> was looking rather suave. That is until I saw the latest ncurses,
> xorg-server, & xz... but I digress.
> 
> Hey it's summer isn't it? & I'm not getting laid.

Sorry, summer is over; can you file this upstream and let us know the URL? :)

Also make sure you try the latest kernel again to ensure this is not fixed yet.