Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 289404 - sys-cluster/torque-2.3.7: pbs_server dies probably due to glibc issues
Summary: sys-cluster/torque-2.3.7: pbs_server dies probably due to glibc issues
Status: RESOLVED INVALID
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: High normal
Assignee: Justin Bronder (RETIRED)
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-10-16 22:28 UTC by Martin Mokrejš
Modified: 2009-11-03 14:20 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Mokrejš 2009-10-16 22:28:08 UTC
Hi,
  I cannot say yet what is the exact cause but pbs_server crashes supposedly when too many jobs get submitted in a row locally on ~amd64 server. If the jobs come from nodes being in amd64 stable things go well.

   The weird thing is that it affected the whole machine. For example, I could not access ome NFS-housed files. My portage tree in on NFS drive and I would say randomly some files could not be updated on 'emerge --sync'. Some applications were crashing and all of that started several times when one of users submitted 1 000 jobs to the queue from the server (i.e. locally). I cannot imagine other issue than kernel or glibc. Reboot repaired the situation until another attempt to submit hat many jobs.

Oct  4 23:30:45 nfssrv kernel: gmond[2201]: segfault at 18 ip 00007f70396037ba sp 00007f70380b54e0 error 4 in libpython2.6.so.1.0[7f703957d000+140000]
Oct  5 23:06:52 nfssrv kernel: sshd[14236]: segfault at 7f27b7d94ce0 ip 00007f27b7d94ce0 sp 00007fff3e0165c0 error 15
Oct  6 01:48:44 nfssrv kernel: rrdtool[32181]: segfault at 0 ip 00007f6feb4389d0 sp 00007fff566c2388 error 4 in libc-2.10.1.so[7f6feb3bd000+14f000]
Oct 13 19:38:16 nfssrv kernel: pbs_sched[28974]: segfault at 7fc599e2aa50 ip 00007fc5992f5886 sp 00007fffd78bf0c0 error 4 in libc-2.10.1.so[7fc599282000+14f000]
Oct 14 12:57:54 nfssrv kernel: python2.6[7702]: segfault at ffffffffffffffff ip 00007fd7ffeed1bd sp 00007fff01c3b730 error 4 in libpython2.6.so.1.0[7fd7ffe79000+140000]
Oct 14 14:01:38 nfssrv kernel: python2.6[9843]: segfault at a9 ip 00007f68e21a882b sp 00007fffefadb3a0 error 4 in libpython2.6.so.1.0[7f68e20bd000+140000]
Oct 14 22:03:22 nfssrv kernel: pbs_sched[2381]: segfault at 130e538 ip 00007f883e1c7886 sp 00007fff5a8a9000 error 4 in libc-2.10.1.so[7f883e154000+14f000]


Since that I re-compiled same glibc, gcc and some other system parts again with no luck. I tried upgrade from 2.6.30.1 to 2.6.30.8 and 2.6.30.9.

# emerge --info
Portage 2.1.7.1 (default/linux/amd64/2008.0, gcc-4.4.1, glibc-2.10.1-r0, 2.6.30.9-default x86_64)
=================================================================
System uname: Linux-2.6.30.9-default-x86_64-Intel-R-_Core-TM-2_Quad_CPU_Q6600_@_2.40GHz-with-gentoo-2.0.1
Timestamp of tree: Fri, 16 Oct 2009 14:00:01 +0000
app-shells/bash:     4.0_p33
dev-java/java-config: 1.3.7-r1, 2.1.9-r1
dev-lang/python:     2.4.6, 2.5.4-r3, 2.6.3, 3.1.1-r1
dev-python/pycrypto: 2.0.1-r8
sys-apps/baselayout: 2.0.1
sys-apps/openrc:     0.4.3-r3
sys-apps/sandbox:    2.1
sys-devel/autoconf:  2.13, 2.63-r1
sys-devel/automake:  1.5, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10.2, 1.11
sys-devel/binutils:  2.19.1-r1
sys-devel/gcc-config: 1.4.1
sys-devel/libtool:   2.2.6a
virtual/os-headers:  2.6.30-r1
ACCEPT_KEYWORDS="amd64 ~amd64"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-O2 -pipe -march=nocona"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /var/qmail/alias /var/qmail/control"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/env.d/java/ /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/php/apache2-php5/ext-active/ /etc/php/cgi-php5/ext-active/ /etc/php/cli-php5/ext-active/ /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo /etc/udev/rules.d"
CXXFLAGS="-O2 -pipe -march=nocona"
DISTDIR="/nfslarge/usr/portage/distfiles"
FEATURES="assume-digests distlocks fixpackages news nostrip parallel-fetch protect-owned sandbox sfperms splitdebug strict unmerge-logs unmerge-orphans userfetch"
GENTOO_MIRRORS="http://distfiles.gentoo.org http://distro.ibiblio.org/pub/linux/distributions/gentoo"
LDFLAGS="-Wl,-O1"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/nfslarge/usr/portage"
PORTDIR_OVERLAY="/nfslarge/usr/portage/local/layman/science /nfslarge/usr/portage/local/layman/sunrise /nfslarge/usr/portage/local"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="X amd64 apache2 berkdb bzip2 cgi crypt dri gdbm hal java jce kerberos laptop mmx modules mpi mpich2 multilib ncurses nptl nptlonly nsplugin pam pcre python readline sse sse2 ssl svg sysfs syslog tcpd unicode xorg zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" ELIBC="glibc" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" USERLAND="GNU" VIDEO_CARDS="vesa nvidia" 
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, FFLAGS, INSTALL_MASK, LANG, LC_ALL, LINGUAS, MAKEOPTS, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS
Comment 1 Justin Bronder (RETIRED) gentoo-dev 2009-11-02 14:54:47 UTC
There's nothing here that actually points at torque as a problem.  I'd check your hardware.
Comment 2 Martin Mokrejš 2009-11-03 14:20:44 UTC
I worked around the problem with upgrade to torque-2.4.1b1. The HW is fine.