Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 239463 - 2.6.25 kernel regression: endless loop lockup in clockevents
Summary: 2.6.25 kernel regression: endless loop lockup in clockevents
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: x86 Linux
: High critical (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL:
Whiteboard: [linux-2.6.25-regression] [linux >= 2...
Keywords: Bug
Depends on:
Blocks:
 
Reported: 2008-10-03 18:28 UTC by Alex Efros
Modified: 2008-11-25 13:06 UTC (History)
4 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
.config for 2.6.25-hardened-r7 (.config,49.37 KB, text/plain)
2008-10-03 18:31 UTC, Alex Efros
Details
netconsole output to windows netcat (netcat.log,14.30 KB, text/plain)
2008-10-05 13:10 UTC, Alex Efros
Details
last ssh commands I executed before final hang (putty.log,11.02 KB, text/plain)
2008-10-05 13:10 UTC, Alex Efros
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Efros 2008-10-03 18:28:24 UTC
After kernel upgrade from hardened-sources-2.6.24-r3 to hardened-sources-2.6.25-r4 my workstation hangs after about 3 days uptime. I've rebooted it, and it hang again after about a day.

Then I switch back to hardened-sources-2.6.24-r3, just to find out is it some hardware issue (if 2.6.24 will hang too). I work using this kernel about 2 weeks - no hang.

Finally I tried upgrade to hardened-sources-2.6.25-r7. This kernel also hang after about a day.

In all cases it was 100% hang - system was not accessible from network, no information in logs or on console (usual X desktop was on monitor).
At last hang I notice kernel reply to ping in local network on internal (ETH) interface, but not reply to ping on external (PPPoE) interface. SSH/HTTP wasn't reply even in local network.

There was no records in klog/syslog related to these hangs.

Reproducible: Always

Steps to Reproduce:
1. boot hardened-sources-2.6.25-r[47]
2. wait several days
3.

Actual Results:  
kernel hangs

Expected Results:  
kernel shouldn't hang

Portage 2.1.4.4 (hardened/x86/2.6, gcc-3.4.6, glibc-2.6.1-r0, 2.6.24-hardened-r3 i686)
=================================================================
System uname: 2.6.24-hardened-r3 i686 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Timestamp of tree: Thu, 02 Oct 2008 15:05:01 +0000
app-shells/bash:     3.2_p33
dev-java/java-config: 1.3.7, 2.1.6
dev-lang/python:     2.5.2-r7
sys-apps/baselayout: 1.12.11.1
sys-apps/sandbox:    1.2.18.1-r2
sys-devel/autoconf:  2.13, 2.61-r2
sys-devel/automake:  1.5, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10.1-r1
sys-devel/binutils:  2.18-r3
sys-devel/gcc-config: 1.4.0-r4
sys-devel/libtool:   1.5.26
virtual/os-headers:  2.6.23-r3
ACCEPT_KEYWORDS="x86"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-march=prescott -O2 -pipe"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /service /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/share/config /var/log /var/qmail/alias /var/qmail/control"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/env.d/java/ /etc/fonts/fonts.conf /etc/gconf /etc/revdep-rebuild /etc/terminfo /etc/udev/rules.d"
CXXFLAGS="-march=prescott -O2 -pipe"
DISTDIR="/usr/portage-distfiles"
EMERGE_DEFAULT_OPTS="--with-bdeps=y"
FEATURES="distlocks metadata-transfer parallel-fetch sandbox sfperms strict unmerge-orphans userfetch userpriv usersandbox"
GENTOO_MIRRORS="http://ftp.lug.ro/gentoo/ http://mirror.qubenet.net/mirror/gentoo/"
LANG="ru_RU.KOI8-R"
LINGUAS="en ru"
MAKEOPTS="-j3"
PKGDIR="/usr/portage-packages"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/portage/local/layman/powerman /usr/local/portage"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="X Xaw3d a52 aac acpi aim alsa apache2 arts asf avi bash-completion berkdb bitmap-fonts bzip2 cdr cracklib crypt cscope curl dbus dga divx4linux dlloader dri dts dvd dvdr dvdread encode ffmpeg flac flash gd gdbm gif gnutls gpgme gtk gtk2 hardened hddtemp icq idn imagemagick imap imlib irc jabber javascript jpeg kdeenablefinal lm_sensors lzo mad mailbox mbox midi mmx mng motif mp3 mpeg msn mysql ncurses nls nptl nptlonly ogg opengl oss pam pcre perl pic png pwdb qt quicktime rcc readline real rss rtc samba sdl slang spell sse sse2 ssl svg sysfs tcltk tcpd tiff truetype truetype-fonts type1-fonts urandom vim-pager vim-syntax vim-with-x vorbis win32codecs x86 xinetd xorg xv xvid yahoo zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1 emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="log_config vhost_alias autoindex alias rewrite dir deflate filter mime negotiation auth_basic authn_file authz_host authz_user authz_groupfile cgi actions headers env setenvif" ELIBC="glibc" INPUT_DEVICES="keyboard mouse" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="en ru" LIRC_DEVICES="serial" USERLAND="GNU" VIDEO_CARDS="vesa fbdev nv"
Unset:  CPPFLAGS, CTARGET, FFLAGS, INSTALL_MASK, LC_ALL, LDFLAGS, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS
Comment 1 Alex Efros 2008-10-03 18:31:43 UTC
Created attachment 167104 [details]
.config for 2.6.25-hardened-r7

I used this config to compile kernel which hang.
Comment 2 Alex Efros 2008-10-03 18:35:47 UTC
Here is my `lspci` output:

00:00.0 Host bridge: Intel Corporation 82P965/G965 Memory Controller Hub (rev 02)
00:01.0 PCI bridge: Intel Corporation 82P965/G965 PCI Express Root Port (rev 02)
00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Contoller #4 (rev 02)
00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #5 (rev 02)
00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #2 (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1 (rev 02)
00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 5 (rev 02)
00:1c.5 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 6 (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev f2)
00:1f.0 ISA bridge: Intel Corporation 82801HB/HR (ICH8/R) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801HR/HO/HH (ICH8R/DO/DH) 6 port SATA AHCI Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev 02)
01:00.0 VGA compatible controller: nVidia Corporation G71 [GeForce 7950 GT] (rev a1)
02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E Gigabit Ethernet Controller (rev 12)
03:00.0 SATA controller: JMicron Technologies, Inc. JMicron 20360/20363 AHCI Controller (rev 02)
03:00.1 IDE interface: JMicron Technologies, Inc. JMicron 20360/20363 AHCI Controller (rev 02)
05:01.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)
05:02.0 Multimedia audio controller: Creative Labs SB Audigy (rev 04)
05:02.1 Input device controller: Creative Labs SB Audigy Game Port (rev 04)
05:02.2 FireWire (IEEE 1394): Creative Labs SB Audigy FireWire Port (rev 04)
05:03.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link)
05:04.0 Ethernet controller: Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller (rev 14)
Comment 3 kfm 2008-10-04 02:36:39 UTC
Please enable CONFIG_DEBUG_KERNEL and CONFIG_DETECT_SOFTLOCKUP. Also, please monitor the affected host via a serial console and be sure to enable the CONFIG_EARLY_PRINTK to ensure that any relevant messages stand the best chance of being logged (due to a printk routine that is not interrupt dependent).

I think it may also be a good idea to ensure that the following options are enabled: CONFIG_FRAME_POINTER, CONFIG_PRINTK_TIME, CONFIG_MAGIC_SYSRQ, CONFIG_DEBUG_RT_MUTEXES, CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_MUTEXES, CONFIG_DEBUG_LOCK_ALLOC, CONFIG_PROVE_LOCKING, CONFIG_DEBUG_INFO. However, in order to avoid unforeseen side effects, I don't recommend enabling all debugging options indiscriminately (I see that some others are already enabled in your .config - possibly upstream defaults).

Apprently, as you have a i82801 based chipset, it's also possible to use the Intel Watchdog Timer to aid in debugging in the event that this proves to be hard lockup. I'm not sure how at the time of writing though.
Comment 4 Alex Efros 2008-10-04 05:12:33 UTC
(In reply to comment #3)
> Please enable CONFIG_DEBUG_KERNEL and CONFIG_DETECT_SOFTLOCKUP. Also, please
> monitor the affected host via a serial console and be sure to enable the
> CONFIG_EARLY_PRINTK to ensure that any relevant messages stand the best chance
> of being logged (due to a printk routine that is not interrupt dependent).

Ok. But this is my home workstation, and I don't have serial console. I'll try to use CONFIG_NETCONSOLE instead.

> I think it may also be a good idea to ensure that the following options are
> enabled: CONFIG_FRAME_POINTER, CONFIG_PRINTK_TIME, CONFIG_MAGIC_SYSRQ,
> CONFIG_DEBUG_RT_MUTEXES, CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_MUTEXES,
> CONFIG_DEBUG_LOCK_ALLOC, CONFIG_PROVE_LOCKING, CONFIG_DEBUG_INFO. However, in
> order to avoid unforeseen side effects, I don't recommend enabling all
> debugging options indiscriminately (I see that some others are already enabled
> in your .config - possibly upstream defaults).

I'm not a kernel hacker, so please decide yourself and I'll do what you recommend. Right now I've only CONFIG_FRAME_POINTER and CONFIG_MAGIC_SYSRQ enabled. First can't be disabled (probably some other things depend on it) and second I always use. I'll enable all other options you mentioned above and nothing else.

> Apprently, as you have a i82801 based chipset, it's also possible to use the
> Intel Watchdog Timer to aid in debugging in the event that this proves to be
> hard lockup. I'm not sure how at the time of writing though.

Isn't watchdog can just detect lockup and reboot system? How this can help in debugging?
Comment 5 Alex Efros 2008-10-04 06:18:25 UTC
At first boot this kernel hang somewhere in init scripts near these commands:
    [ -f /var/run/random-seed ] && cat /var/run/random-seed >>/dev/urandom
    rm -f /var/run/random-seed
    (umask 077; dd if=/dev/urandom of=/var/run/random-seed count=1 2>/dev/null)
    if [ ! -f /etc/adjtime ]; then echo '0.0 0 0.0' > /etc/adjtime ; fi
    hwclock --adjust --localtime
    hwclock --hctosys --localtime
There was nothing in logs (because log services wasn't started yet) and nothing on console (probably because my init scripts already executed `dmesg -n 1` which I used to remove some junk from console while boot).

Next time it boot ok and doesn't hang yet. 

Sadly, but I failed to configure netconsole. I use ADSL for internet connection, and ADSL modem working in bridge mode - so I've to run PPPoE on my system. I've tried to use "Dynamic reconfiguration" feature of netconsole and configure it after pppd started using configfs (/sys/kernel/config/netconsole/). But it refuse to work using ppp interface with message "netconsole: ppp0 doesn't support polling, aborting". I tried to use eth interface of my local network for netconsole and iptables DNAT to redirect netconsole packets from fake IP in local network to real IP of remote server in internet, but it doesn't work for some reason (ping/telnet works ok, but not netconsole).

So, chances are at next hang we will not receive any additional information - logs will be empty as usually. Any other ideas?
Comment 6 Alex Efros 2008-10-05 13:08:41 UTC
It hangs again. This time I've used netcat on windows machine in local network to get some info from netconsole.

But netcat doesn't show anything related to hang - latests kernel messages was usual kern.debug records from firewall.
The SysRq doesn't work. It neither product new kernel messages nor it reboot system.
Actually only information I have this time, is "time", which kernel now show in log, in this way: "kern.debug: [  814.567163] IN=ppp1 ...". That time was equal to ~25 hours (after boot). The 'conky' show on my X screen same information: "uptime 1d 1h", but current time it show as "10:05" (real current time was "14:40"). Looks like it hangs at 10:05, when I was away from computer, and from that time netcat doesn't received anything from netconsole.

My girlfriend notified me: "internet doesn't work" (of course, my system is router and it hangs!) "but samba works - she was able to upload .avi to my system AFTER HANG" (???) "and putty works too" (???????). Wow!

So, this time I was able to log in to linux from that windows machine using ssh and work some time. But few minutes later it finally hang and stop responding to anything.

After login, I notice strange behaviour or current date/time - it was broken and show something around 10:06. But if I execute `date` several times, current time will increment by few seconds, as expected.

Next strage thing I notice is strange ping behaviour - when ping localhost or ppp gateway I got reply only to first packet. Also while experimenting with this I somehow got error message like "send buffer overflow" or something like it, but I doesn't remember how I get it.

Next I've used /proc/sysrq-trigger, and it works. This way I send a lot of debug information to windows netcat. Also I noticed this debug information is stored in logs on HDD! I've gathered a lot of additional information and saved it to files on HDD.

But then it finally hangs and I have to reboot it.

Sadly, but after reboot all changes on HDD which was done after 10:05 was lot. All debug information generated by /proc/sysrq-trigger, all files I created. The .avi file uploaded by my girlfriend also disappear. It was uploaded to different HDD partition with different filesystem (my root fs use ext3, while this partition use ext2). But if files on root partition completelly disappear, the .avi file is shown by ls -l in this way:
drwxrwxr-x 3 root     users       4096 Oct  5 05:41 .
drwxrwx--- 8 root     users      32768 Oct  5 08:39 ..
drwx------ 2 root     root       16384 Oct  2 03:50 lost+found
-????????? ? ?        ?              ?            ? film.avi
The df show disk space used by this film, so I probably should run e2fsck to fix this partition. BTW, after uploading file my girlfriend was able to run and view this film from my system!

Because all files with debug information was lost, only information I have is screen dumps from windows netcat and putty - but they doesn't complete because screen buffer size on windows was very limited. I'll attach these logs as separate files.

The interesting thing about netcat.log is time. See, line:
[96505.515383] SysRq : Show State
has same time as previous lines. But there surely was several seconds delay between I run `echo q > /proc/sysrq-trigger` and `echo t > /proc/sysrq-trigger`.

Also interesting is few records from firewall which was send to log after I work a little in command prompt. Maybe there some issue with interrupts, and because of this both time freezed at 10:05 (but incremented by few seconds when I run `date`)?

Putty log contain just few ping and strace ping, including last strace ping which result in final system hang.
Comment 7 Alex Efros 2008-10-05 13:10:05 UTC
Created attachment 167299 [details]
netconsole output to windows netcat
Comment 8 Alex Efros 2008-10-05 13:10:36 UTC
Created attachment 167301 [details]
last ssh commands I executed before final hang
Comment 9 Alex Efros 2008-10-06 12:02:29 UTC
Today it hangs again, after ~18 hours uptime. I'm again was away from computer this time, and when I notice this (~4 hours later) neither SysRq nor network/ssh was available, so I was unable to get more debug information and have to press reset. In windows netcat, as usually, was no log records related to this issue.

I'm going to switch back to 2.6.24 for now.
Comment 10 kfm 2008-10-06 12:13:17 UTC
Alex, I think this will need to be represented upstream. Before we consider using that channel, please attempt to reproduce the problem in vanilla-sources-2.6.25.17.
Comment 11 Alex Efros 2008-10-06 15:53:43 UTC
ok, I've just boot vanilla 2.6.25.17. if it won't hang in next 24 hours I'll wait 4-5 days more. if it won't hang - then this bug probably only in hardened-sources.
Comment 12 Gordon Malm (RETIRED) gentoo-dev 2008-10-07 23:26:02 UTC
A more appropriate test would be gentoo-sources-2.6.25-r9.
Comment 13 Alex Efros 2008-10-08 02:00:23 UTC
I will continue running vanilla now (it doesn't hang yet), and will test gentoo-sources-2.6.25-r9 if vanilla won't hang in next few days.
Comment 14 Alex Efros 2008-10-08 11:58:11 UTC
Ok, after 1 days 18 hours sys-kernel/vanilla-sources-2.6.25.17 also hang in same way. So, this isn't hardened-related bug, and I'll update subject.
Comment 15 Duane Griffin 2008-10-08 13:16:30 UTC
It would be very helpful if you could enable the CONFIG_FRAME_POINTER and CONFIG_DEBUG_INFO options, then get the full output from "echo t > /proc/sysrq-trigger" once it has hung. If you don't mind it would also be a good idea to try the very latest vanilla kernel, 2.6.27-rc9 as I write.
Comment 16 Alex Efros 2008-10-08 13:44:57 UTC
CONFIG_FRAME_POINTER and CONFIG_DEBUG_INFO was enabled, but it was only once when I got a chance to do something like "echo t > /proc/sysrq-trigger" once it has hung - all other times it doesn't reply to ping/ssh after hang. :(

Sadly, but I unable to continue testing kernels for now. My system is offline for hours everyday because of these hangs, some files was lost, use of non-hardened kernel isn't secure, etc. I've boot latests stable 2.6.24-hardened-r3 and will stick with it until I receive request to test some patch which expected to fix this issue or stable 2.6.26+ hardened kernel will be released.
Comment 17 Duane Griffin 2008-10-08 14:55:19 UTC
Fair enough. If and when you manage to get a stack dump from a recent kernel showing where it is hanging, please add it and reopen the ticket.
Comment 18 Alex Efros 2008-10-08 15:10:29 UTC
I wonder, is it safe to do something like this on running (not hang!) kernel:

 while sleep 5; do echo t > /proc/sysrq-trigger; done

If yes, then I may later try it. This way chance to get stack dump and other informaion after hang using netconsole should increase.
Comment 19 Gordon Malm (RETIRED) gentoo-dev 2008-10-08 15:21:08 UTC
(In reply to comment #16)
> I've boot latests stable
> 2.6.24-hardened-r3 and will stick with it until I receive request to test some
> patch which expected to fix this issue or stable 2.6.26+ hardened kernel will
> be released.
> 

Unless you have a driver/app that is not yet compat with 2.6.26, I would recommend giving hardened-sources-2.6.26-r2 a try.  It's working well here.  The 2.6.24-series was removed due to numerous mainline vulns (and we don't have the team/time to backport everything).
Comment 20 Duane Griffin 2008-10-08 15:26:40 UTC
(In reply to comment #18)
> I wonder, is it safe to do something like this on running (not hang!) kernel:
> 
>  while sleep 5; do echo t > /proc/sysrq-trigger; done
> 
> If yes, then I may later try it. This way chance to get stack dump and other
> informaion after hang using netconsole should increase.

That should be perfectly safe, although of course it will produce a large volume of log messages, so you may want to check you won't run out of disk space.

Another thing you may want to try is leaving the machine at a console (not in X) and triggering the dump via the keyboard when it hangs.
Comment 21 Alex Efros 2008-10-08 20:22:32 UTC
O.k., I've just boot hardened-sources-2.6.26-r2.

Only failed driver was vmware: vmware-modules failed to compile (I havn't tried ~x86 version of vmware-workstation and vmware-modules yet).

I'll run that while/sysrq-trigger loop now, but probably with 30-60 seconds timeout instead of 5. Any recommendations which sysrq commands I should execute in this loop, in addition to "t"?
Comment 22 Alex Efros 2008-10-08 21:17:18 UTC
I'm afraid that sysrq thing is senseless when used with netconsole. I've just tested it: amount of data produced by single sysrq 't' command is huge (more than 100KB), and looks like MANY UDP packets are LOST - either not send by linux kernel or dropped because of input buffer overflow by windows kernel. Fact: windows netcat receive only part of data!

Example:
- first sysrq 't' result in 125KB in linux log and 44KB in windows netcat
- second sysrq 't' result in 240KB in linux log and 52KB in windows netcat

P.S. More over, data send to windows and data send to klog are 80% different. But this may be normal, if this operation isn't atomic.
Comment 23 Alex Efros 2008-10-11 16:33:36 UTC
hardened-sources-2.6.26-r2 hangs. I work on the system when it hangs, so I was able to immediately check SysRq (don't work, keyboard was dead as usually - even NumLock doesn't work) and ping/ssh from another computer in local network (no response). netconsole logs was empty, as usually.

Actually only interesting thing was next hang, which happens while rebooting system. Hang while boot already happens once some time before, and it happens exactly at same place: while executing these commands:
    if [ ! -f /etc/adjtime ]; then echo '0.0 0 0.0' > /etc/adjtime ; fi
    hwclock --adjust --localtime
    hwclock --hctosys --localtime
On next try system boot ok.

BTW, I've a cron script, which executed from /etc/cron.hourly/ and doing this:
    ntpdate -b 80.67.179.2 129.240.64.3 134.214.100.6 130.88.203.12 128.118.25.3 130.126.24.24 128.59.64.60 193.2.4.6
    hwclock --systohc

According to logs, last records was added at 2008-10-11_15:59:06.20984, so it's possible what hang happens when at 2008-10-11_16:00 was executed script in /etc/cron.hourly/ which run hwclock.

Is there exists any settings in kernel which may affect hwclock and which I can switch on/off to see is that helps?
Comment 24 Alex Efros 2008-10-11 22:38:57 UTC
Ok, I managed it to hang under my control! I was right - it's hwclock.

I've boot 2.6.26-hardened-r2 with param init=/bin/bash and then run this script:

#!/bin/bash
mount -n -t ramfs none /dev
mknod -m 660 /dev/console c 5 1
mknod -m 660 /dev/null c 1 3
mknod -m 644 /dev/rtc c 254 0
for i in $(seq 1 100); do
    echo $i start
    hwclock --hctosys --localtime
    echo $i end
done

... and it always hang, usually after 5-15 hwclock executions.
If I'll not create /dev/rtc - it will NOT hang.

I've checked /dev/rtc under 2.6.24-hardened-r3 (which never hangs), and found it has different major/minor numbers:

crw-r--r-- 1 root root 10, 135 Oct 12  2008 /dev/rtc

I remember, I've changed something around RTC configuration when upgrade from 2.6.24 to 2.6.25... Here is my RTC-related configuration from both .config:

home /usr/src # grep -i RTC linux-2.6.24-hardened-r3/.config
CONFIG_HPET_EMULATE_RTC=y
CONFIG_RTC=y
CONFIG_SND_RTCTIMER=y
CONFIG_SND_SEQ_RTCTIMER_DEFAULT=y
# CONFIG_RTC_CLASS is not set

home /usr/src # grep -i RTC linux-2.6.26-hardened-r2/.config
CONFIG_HPET_EMULATE_RTC=y
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set
# RTC interfaces
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set
# I2C RTC drivers
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# SPI RTC drivers
# Platform RTC drivers
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_V3020 is not set
# on-CPU RTC drivers
Comment 25 Alex Efros 2008-10-12 01:37:38 UTC
Ok, I've now stable 2.6.25 and 2.6.26 hardened kernels. Here is full story.

In 2.6.24-hardened-r3 I used this (STABLE) configuration:

    Character devices  --->
      <*> Enhanced Real Time Clock Support
    < > Real Time Clock  --->

Same configuration in 2.6.25 and 2.6.26 is NOT STABLE (running hwclock in a loop is ease way to hang these kernels in ~20 seconds)!

I've tried another configuration in 2.6.25 and 2.6.26, but it's NOT STABLE too:

    Character devices  --->
      <*> Enhanced Real Time Clock Support
    <*> Real Time Clock  --->

Finally, I got STABLE 2.6.25 and 2.6.26 using this configuration:

    Character devices  --->
      < > Enhanced Real Time Clock Support
      <*> Generic /dev/rtc emulation
    < > Real Time Clock  --->
Comment 26 kfm 2008-10-12 04:04:20 UTC
Please try 2.6.25.18 and/or 2.6.26.6 with CONFIG_RTC enabled once more. They both contain many fixes from Thomas Gleixner, all of which concern the clock:

Thomas Gleixner (9):
      clockevents: prevent endless loop in periodic broadcast handler
      clockevents: enforce reprogram in oneshot setup
      clockevents: prevent multiple init/shutdown
      clockevents: prevent endless loop lockup
      HPET: make minimum reprogramming delta useful
      clockevents: broadcast fixup possible waiters
      x86: HPET fix moronic 32/64bit thinko
      x86: HPET: read back compare register before reading counter
      clockevents: remove WARN_ON which was used to gather information
Comment 27 Alex Efros 2008-10-12 14:55:58 UTC
I've tried these two configurations:

    Character devices  --->
      <*> Enhanced Real Time Clock Support
    < > Real Time Clock  --->

    Character devices  --->
      < > Enhanced Real Time Clock Support
    <*> Real Time Clock  --->

Both HANGS on vanilla-sources-2.6.25.17 and both work STABLE on vanilla-sources-2.6.25.18, vanilla-sources-2.6.26.6 and vanilla-sources-2.6.27.

Please let me know when this issue will be fixed in hardened-sources.
Comment 28 Gordon Malm (RETIRED) gentoo-dev 2008-10-12 17:59:52 UTC
Glad to see upstream fixed the problem with the latest -stable patch.  Thanks for sticking it out and working to figure out the source of the problem.

Thanks go to Kerin also for his fine handling of this bug.  Nice work as per usual for him.

sys-kernel/hardened-sources-2.6.25-r8 was added with 2.6.25.18 same day and is scheduled to be marked stable later today or tomorrow. :)

kernel@ owns the bug, so I'll leave it to them to resolve as desired.
Comment 29 Daniel Drake (RETIRED) gentoo-dev 2008-11-25 13:06:44 UTC
This is fixed in current stable gentoo-sources, thanks!