Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 117045 - emerge hangs systems when building some apps
Summary: emerge hangs systems when building some apps
Status: RESOLVED NEEDINFO
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: AMD64 Linux
: High normal (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-12-28 19:47 UTC by Hal Engel
Modified: 2006-05-08 08:50 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
piped console output from emerge k3b (k3b-emerge.txt,416.03 KB, text/plain)
2005-12-29 18:57 UTC, Hal Engel
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Hal Engel 2005-12-28 19:47:12 UTC
When emerging some apps my machine will hang (a hard freeze where I have to hit the reset button).  It always happens when emerging the same apps and in the same place during the emerge.   Failing apps include:

sys-devel/gcc-3.4.4-r1
dev-util/kdevelop-3.2.1-r1
app-cdr/k3b-0.12.8

In every case the emerge will complete the compile and link steps and it fails during the part where it is stripping the files or copying files to /usr .  

I have tried running my system with the memory slowed way down (100 MHz) to make sure that memory was not an issue.  It failed in exactly the same way.  memtest86 also shows no errors when memory at 200 MHz. 

I have also run emerge -e world and it emerged everything else without a problem (600+ apps).  But I had to watch the emerge process and ^c every app I knew would fail and then emerge --resume --skipfirst.  I have no idea why only a small handful of ebuilds fail but they are all failing in exactly the same way.

The first app to do this was gcc-3.4.4-r1 at which point gcc was borked.   I open a bug report on that problem and I received assistance on getting gcc 3.4.3-r1 working.  But after that I have had the same problem with other apps like those listed above.   Always the same small list of apps.  Always failing in exactly the same.   

Could there be something wrong with my tool chain?
 
$ emerge --info
Portage 2.0.53 (default-linux/amd64/2005.0, gcc-3.4.3, glibc-2.3.5-r2, 2.6.14-gentoo-r5 x86_64)
=================================================================
System uname: 2.6.14-gentoo-r5 x86_64 AMD Athlon(tm) 64 X2 Dual Core Processor 4800+
Gentoo Base System version 1.6.13
distcc 2.18.3 x86_64-pc-linux-gnu (protocols 1 and 2) (default port 3632) [disabled]
ccache version 2.3 [disabled]
dev-lang/python:     2.3.5-r2, 2.4.2
sys-apps/sandbox:    1.2.12
sys-devel/autoconf:  2.13, 2.59-r6
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r1
sys-devel/binutils:  2.16.1
sys-devel/libtool:   1.5.20
virtual/os-headers:  2.6.11-r2
ACCEPT_KEYWORDS="amd64"
AUTOCLEAN="yes"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-march=k8 -O2 -msse3 -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3.4/env /usr/kde/3.4/share/config /usr/kde/3.4/shutdown /usr/kde/3/share/config /usr/lib/X11/xkb /usr/share/config /usr/share/texmf/dvipdfm/config/ /usr/share/texmf/dvips/config/ /usr/share/texmf/tex/generic/config/ /usr/share/texmf/tex/platex/config/ /usr/share/texmf/xdvi/ /var/qmail/control"
CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d"
CXXFLAGS="-march=k8 -O2 -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoconfig distlocks sandbox sfperms strict"
GENTOO_MIRRORS="http://gentoo.llarian.net/ http://gentoo.osuosl.org/ http://gentoo.ccccom.com http://gentoo.mirrors.tds.net/gentoo http://mirror.datapipe.net/gentoo"
MAKEOPTS="-j2"
PKGDIR="/usr/portage/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="amd64 X alsa apache2 arts audiofile avi berkdb bitmap-fonts bzip2 cdr crypt cups curl dbus doc dvr eds emboss encode esd exif expat fam ffmpeg flac foomaticdb fortran gd gdbm gif gimpprint glut gnome gpm gstreamer gtk gtk2 guile hal idn imagemagick imlib ipv6 jack java jpeg junit kde lcms libwww lzw lzw-tiff mad mng motif mp3 mpeg nas ncurses nls nptl nvidia ogg openal opengl pam pcre pdflib perl pic png ppds python qt quicktime readline scanner sdl spell ssl tcltk tcpd tetex threads tiff truetype truetype-fonts type1-fonts udev usb userlocales vorbis wmf wxwindows xine xinerama xml xml2 xpm xv xvid zlib userland_GNU kernel_linux elibc_glibc"
Unset:  ASFLAGS, CTARGET, LANG, LC_ALL, LDFLAGS, LINGUAS
Comment 1 Jakub Moc (RETIRED) gentoo-dev 2005-12-28 20:18:30 UTC

*** This bug has been marked as a duplicate of 20600 ***
Comment 2 Hal Engel 2005-12-28 22:19:40 UTC
With all due respect I don't think this is a dup.  I can see that there are some things that might be common with the metabug but there are also significant differences.  And the number of these differences is far greater than the things that are similar. 

1. I am not seeing ANY segmentation faults while running my emerges or anything else on this machine.  I just ran an emerge -e world (twice) yesterday without a single segmentation fault (599 packages each).  

2. Hardware problems will always exibit some level of randomness and in my case there is absolutely no randomness to when and where the failure occurs. It always occurs when emerging the same applications and at the exact same location in the emerge process and on only a small handful of the applications on the system (3 out of 602).  For example I can emerge gcc 3.4.3-r1 but 3.4.4-r1 will fail at the point when it is being installed (compilation and linking are complete).  

3. I have run my hardware at lower speeds as a test and it had no effect on the behavior.   I also had a 3500 Winchester and half as much memory (also different sims) in this same machine and it also had the same exact problem (much lower power consumption than the current X2 4800+).  I should add that the 3500 Winchester machine also had a different motherboard with a different motherboard chipset (Via vs. nForce4).   I also set MAKEOPTS="-j1" to reduce the load on the processor which results in a cooler processor and lower power consumption and it still had exactly the same behavior.

4. In addition the emerges are ALWAYS completing the compilation and link steps and are failing during the installation step at a point when CPU utilization is low.  There is a remote possibility that because of the location in the process where this is occuring that it could be because of a disk IO problem.  But why would this IO problem only occur when emerging specific applications and not on any of the other 598 apps (I have recently emerged everything on this machine more than once) and why would it only occur at exactly the same location in the process?

The metabug talks only about compilation failures and does not say anything about failures during the installation part of the emerge process.  At this point I am confused about why my system not failing during compilation makes this a compilation bug and therefore a duplicate of the metabug. 
Comment 3 Hal Engel 2005-12-29 18:57:35 UTC
Created attachment 75763 [details]
piped console output from emerge k3b

This is the output from emerge k3b.  The emerge fails but only after competing the compile and link steps without any errors.  It fails after the ebuild install step.  I pulled up a console after rebooting and ran emerge /usr/portage/k3b/k3b-0.12.8.ebuild qmerge and the machine locked up before completing the qmerge.  The output to the console is exactly what I have been seeing with the 3 packages that this happens with.  To clarify something from the orginal bug report this is only happening on the three packages listed in the bug report.  The other 599 packages on the system will emerge without problems.

I am not getting segmentation faults or internal errors from the compiler and never have on this system.  It is clear that this is not a duplicate of the compiler metabug.  I looked through the other duplicates for the compiler metebug and almost all of them were segmentation faults or internal errors in gcc.  The others did not have enough information to rule out the possibility that it could still be the compiler metabug.   This bug report does now in fact have enough information to rule out that possibility.
Comment 4 Hal Engel 2005-12-29 18:59:47 UTC
Edit the above:

I should have wrote

ebuild /usr/portage/k3b/k3b-0.12.8.ebuild qmerge

not 

emerge /usr/portage/k3b/k3b-0.12.8.ebuild qmerge
Comment 5 Jason Stubbs (RETIRED) gentoo-dev 2005-12-30 06:47:58 UTC
There's really not anything that portage can do. System lock-ups are either kernel or hardware. The only possibility for user space to lock up the system is by triggering an issue with one of those two.

Try compiling the kernel with sysrq support and run emerge from a console to see if you can get a trace and if so where the kernel is dead-locking. If that doesn't work, try running emerge through strace a couple of times and see if the final output is the same. You'll need to send it to the screen and write it down unless you have a serial console.
Comment 6 solar (RETIRED) gentoo-dev 2005-12-30 07:22:05 UTC
> try running emerge through strace a couple of times and see if
> the final output is the same. You'll need to send it to the screen and write it
> down unless you have a serial console.

Something like this should do the trick.

localhost:
strace -v -f -o /dev/stdout emerge foo | /bin/busybox nc remotehost port 

remotehost:
/bin/busybox nc -l -p port > outfile
Comment 7 Hal Engel 2005-12-30 11:52:33 UTC
I built the kernel with sysrq enabled and restarted my system and ran from a console to keep things as simple as possible (no X).  When ebuild /usr/portage/app-cdr/k3b/k3b-0.12.8.ebuild qmerge locks up and I try to use Atl/sysrq/<whatever key> nothing happens.  I used showkeys -s to confirm that my keyboard is sending the correct scan code for Alt/sysrq and it is (0x54).  I also did ALt/sysrq/s and Alt/sysrq/u to see if this would sync and unmount my disk drives.  On reboot my file systems were not clean so it appears that the kernel is locking up very hard when this happens and not accepting any (keyboard) input at all.

Unfortunately I don't have a machine that I can use as a remote console.  I did try:
 
strace -v -f -o /var/log/strace.txt ebuild /usr/portage/app-cdr/k3b/k3b-0.12.8.ebuild qmerge 

hoping that strace would create a file that I could look at after a reboot.  Of course since Alt/sysrq/s was not working I was fairly sure that this wouldn't work and it didn't.

You are probably right that this is a kernel problem of some sort.  I wish there were some way to get additional information about what was happening so that I could pass this on to someone who works on the kernel so that it might eventually get fixed.  

At the moment I have the three offending apps masked and two of them are not really important.  But the third app is gcc and I am curently one version behind the stable release.  This is my biggest concern since gcc is a very important part of my tool chain.  With those three apps masked I can run emerge -e world and it will emerge everything on my system (about 600 packages) with out any problems.  Other than this one problem my system at least appears to be totally stable so this is very perplexing.

Is there anything else that I can try to get more information that might help locate the root cause of this problem?  I will keep digging until I either run out of things to try or I find something.
Comment 8 Alec Warner archtester Gentoo Infrastructure gentoo-dev Security 2006-01-03 21:02:42 UTC
Throwing to wranglers.  If you can pinpoint the problem to something in portage please re-assign to dev-portage.
Comment 9 Daniel Drake (RETIRED) gentoo-dev 2006-01-10 08:34:14 UTC
Please reproduce this on the latest kernel, currently gentoo-sources-2.6.15. Please turn on Soft lockup detection (found under kernel hacking) and see if that reveals anything. You could also investigate the NMI watchdog (see /usr/src/linux/Documentation/nmi_watchdog.txt)
Comment 10 Hal Engel 2006-01-10 13:53:20 UTC
With my setup the nmi_watchdog is on by default.  I checked /usr/src/linux/Documentation/nmi_watchdog.txt and this document says to look for a count for nmi interrupts in /proc/interrupts and I am seeing non-zero values.   So my lockups are very hard as I am not seeing any oops messages after the lockups.  

I am in the process of building kernel 2.6.15 with soft lockup detection.  I will report back when I have results from testing with the new kernel with soft lock up detection turned on.
Comment 11 Hal Engel 2006-01-12 17:51:55 UTC
I emerged k3b with kernel-2.6.15 with soft lockup detection enabled.  When I did this I ran from a terminal session with no Xserver loaded as I wanted to minimize the number of variables.  When it got to the point where it would lock up if running emerge from KDE it just rebooted.  That is in KDE it locks up and in a terminal session it will reboot.  I am not sure that the soft lockup detection did anything.  Does this create some kind of dump file?  If so what is the name of the file.  I looked in /var/log/messages to see if I could find anything and there was nothing related to the crash.

I did some web searches to see if I could learn more about how exactly this was supposed to work but I can't find any documentation on where to find the output from soft lockup detection.
Comment 12 Hal Engel 2006-01-16 11:55:41 UTC
OK I finally got something.  Today I emerged gcc 3.4.4-r1 from a console using kernel 2.6.15 with soft lockup detection enabled.  It did a dump to the console but it also rolled some of it off of the screen and if it dumped this to disk I do not know where it is.  If someone can tell where the dump file is located I will attach it to this bug.  In any case I did copy down everything on the screen and here it is:

R10: 0000000000000008 R11: 0000000000000202 R12: ffff810002c116600 R13: 00000000fffffffd R14: 0000000000000005 R15: 0000000000000000 FS: 00002aaaab11cb00(0000) GS: ffffffff80499800(0000) knlGS: 00000000556686c0 CS: 0000000000000010 DS: 0000000000000000 ES: 0000000000000000 CR0: 000000000005003b CR2: 00002aaaaaf25d70 CR3: 000000007cb71000 CR4: 00000000000006e0 

Process syslog_ng (pid: 8203, threadinfo ffff81007ebb0000, task ffff81007ee1d0c0)

stack: ffffffff8012e5f8 0000000000000001 ffff81007ee1d0c0 0000000000000000 00000000fffffffd 0000000000000005 0000000000000002 0000000000000000 ffffffff8013cbb4 0000000000002006 

Call Trace: <IRQ> <ffffffff8012e5f8> {scheduer tick+152} <ffffffff8010e80e> {apic_timer_interupt+98} <EOI> <ffffffff80228e11> {__up_read+33} <ffffffff80226509> {__read_lock_failed+53} <ffffffff8036ad6e> {.test.lock.spinlock+39} <ffffffff8014a29> {do_tkill+121} <ffffffff8014101> {sigprocmask+225} <ffffffff80140b14> {sys_tgkill+36} <ffffffff8010dc26> {systesm_call+126}

Code: 83 3f 00 7e f9 e9 1d fd ff ff f3 90 83 3f 00 7e f9 e9 3a fd

Console shuts up...
<0> kernal panic - not syncing! Aiec, killing interupt handler
Comment 13 Hal Engel 2006-01-18 13:59:33 UTC
I learned some more about what is failing.  This whole thing appears to have started when emerging gcc-3.4.4 a while back.  It appears that it left the system in a strange state.  I can get it to hang the system when I run fix_libtool_files.sh 3.4.3.   I see the folowing before the system hangs:

Scanning libtool for hardcoded gcc library paths ....
    [1/21] scanning /lib...
    [2/21] scanning /lib/user....
        Fixing: /lib/usr/libkwf.la [v]
        Fixing: /lib/usr/libflibmanagement_extra_def-2.4.la [v]
        Fixing: /lib/usr/libkwf.la [v]
        Fixing: /lib/usr/libflibmanagement_extra_def-2.4.la [v]

Then the system will hang is I am in KDE or reboot if I am using a terminal session only (X not running).

So something is badly miss-configured on my system.  I have done a bunch of searching on the forums and have not found any reports of fix_libtool_files.sh causing system crashed.  Anyone have any ideas about how to fix this? 
Comment 14 Daniel Drake (RETIRED) gentoo-dev 2006-04-25 13:52:31 UTC
This is a really strange bug, and about the only thing I can think of is a bad hard disk, which causes a system hang when a certain sector is accessed. Sounds very unlikely, but I can't think of any other explanation.

Do you have any other disks you can test upon? Can this disk be temporarily put in another PC?
Comment 15 Daniel Drake (RETIRED) gentoo-dev 2006-05-08 08:50:30 UTC
see comment #14