328411 – sys-kernel/gentoo-sources-2.6.34-r2 - Soft kernel lockups

Bug 328411 - sys-kernel/gentoo-sources-2.6.34-r2 - Soft kernel lockups

Summary: sys-kernel/gentoo-sources-2.6.34-r2 - Soft kernel lockups

Status:	RESOLVED UPSTREAM

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	AMD64 Linux

Importance:	High critical
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-07-15 15:31 UTC by Phillip Merensky
Modified:	2010-08-03 21:37 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
Stacktrace from dmesg about the kernel crash (dmesg_kernel_crash.txt,10.45 KB, text/plain) 2010-07-15 15:33 UTC, Phillip Merensky	Details
Kernel boot information (kernel_boot_information.txt,57.63 KB, text/plain) 2010-07-15 15:36 UTC, Phillip Merensky	Details
Emerge info (server now runs with kernel 2.6.31 again) (emerge_info.txt,3.89 KB, text/plain) 2010-07-15 15:38 UTC, Phillip Merensky	Details
Kernel Stack Trace of Kernel 2.6.43-r2 (2.6.34-r2-kernel_stack_trace.txt,12.37 KB, text/plain) 2010-07-19 20:36 UTC, Phillip Merensky	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Phillip Merensky 2010-07-15 15:31:43 UTC

After about 24 hours the new kernel caused a hard lockup with the attached stacktrace. The server works normally with gentoo-sources-2.6.31-r10.

I also had lockup issues (though nfs related) with kernel 2.6.32.

Please write if you need more data  to debug.

Reproducible: Always

Actual Results:  
It seems that all applications from "memory" still work but if the hd has to be accessed the related command simply stalls without output (e.g. dmesg works, restarting varnish does not).

Same for server related apps like varnish, apache, qmail etc.. They simply "hang".

Comment 1 Phillip Merensky 2010-07-15 15:33:02 UTC

Created attachment 238905 [details]
Stacktrace from dmesg about the kernel crash

Comment 2 Phillip Merensky 2010-07-15 15:36:31 UTC

Created attachment 238907 [details]
Kernel boot information

Comment 3 Phillip Merensky 2010-07-15 15:38:33 UTC

Created attachment 238909 [details]
Emerge info (server now runs with kernel 2.6.31 again)

Comment 4 Phillip Merensky 2010-07-19 15:43:33 UTC

It actually seems to be a SOFT lockup because I can still login and do memory related things. Sorry for that. If somebody could change the topic I would be glad. Thanks.

Comment 5 George Kadianakis (RETIRED) gentoo-dev

2010-07-19 17:06:48 UTC

Can you try upgrading to the latest gentoo-sources and see if this behavior continues?
Also, I'm not familiar with reiserfs (which could be connected to your problem judging from your stacktraces) but are the messages in your boot log normal?

Comment 6 Phillip Merensky 2010-07-19 17:24:50 UTC

The problem is that the computer with the issues is a frequently used server.
So testing is a bit dangerous. I will see what I can do.

Comment 7 George Kadianakis (RETIRED) gentoo-dev

2010-07-19 17:29:58 UTC

(In reply to comment #6)
> The problem is that the computer with the issues is a frequently used server.
> So testing is a bit dangerous. I will see what I can do.
> 

Unfortunately, I can't pinpoint a cause for sure (except that it's caused by I/O writes) and some searches on the nets didn't bring up anything interesting (various Ubuntu bug reports blaming dm-crypt and what not).

If you want to avoid downtime it would be better to post this on the official kernel bugzilla in case someone more experienced can provide you with a solution.
(If you actually do this, don't forget to give us the link of the upstream bug report.)

Comment 8 Phillip Merensky 2010-07-19 20:36:24 UTC

Created attachment 239441 [details]
Kernel Stack Trace of Kernel 2.6.43-r2

After adding advanced monitoring I can confirm that Kernel 2.6.34-r2 also crashes with this stack trace.

Comment 9 Phillip Merensky 2010-07-19 20:48:28 UTC

(In reply to comment #5)
> Also, I'm not familiar with reiserfs (which could be connected to your problem
> judging from your stacktraces) but are the messages in your boot log normal?
> 
The reiserfs messages from the boot log result from the last crash with the depicted kernel and resemble journal operations because the file system sync could not be executed, i.e. this is completely normal.

Comment 10 Phillip Merensky 2010-07-20 13:22:46 UTC

If I should report this upstream, just let me know.

Comment 11 George Kadianakis (RETIRED) gentoo-dev

2010-07-26 16:12:25 UTC

(In reply to comment #10)
> If I should report this upstream, just let me know.
> 

Sorry for the slow reply.
I, indeed, think that the best course of action would be reporting this upstream.

Comment 12 Mike Pagano gentoo-dev

2010-07-29 23:10:51 UTC

Please submit this upstream and post the url back here. We'll track the upstream bug and backport any patches as needed

Comment 13 Phillip Merensky 2010-08-02 22:00:31 UTC

Sorry for the delay. I will report the bug upstream as soon as I have some spare time.

Comment 14 Phillip Merensky 2010-08-03 21:37:24 UTC

Strangely kernel 2.6.31-r10 is also producing soft lockups now, which normally should not occur. However I have no stack trace of such a lockup yet. 
It does not seem to be a hardware issue (harddisks etc. seem to be fine) and it only occurs when our backup is executed/after it was executed. The backup saves more than 10GB to NAS which is connected via NFS.
If the backup is disabled no lockups occur.
I will look this over the next days/weeks and report back to this issue.
With the new lockups occuring with kernel 2.6.31 too, I personally do not think that information suffices to report it upstream yet.