Yesterday I upgraded from 2.5.16 to 2.6.4, because I recently had some troubles with the core crashing all the time. Unfortunately the situation did not get any better. I tried to investigate that matter the whole afternoon, but it didn't bring me any step forward. There seem to be absolutely no hints in the system-log (/var/log/messages) or in the mldonkey-log (/var/log/mldonkey.log). mlnet (running on a server with gentoo-sources-2.6.12-r10 / gcc 3.4.4-r1 / glibc 2.3.5-r1 / nptl / mldonkey emerged with USE flags "gd" & "threads") generally works, but always suddenly crashes after a seemingly random period (can be after half an hour, or even some hours). In that case sancho-gui (0.9.4-47 running on a WinXP workstation) disconnects, and the mlnet process on the server just disappears without further notice (strangely can only be restarted , after I delete the mlnet.pid file in the mldonkey-home-dir, no idea why that is, because I didn't have to do this with mldonkey 2.5.16, and I even can't remember, that the mlnet.pid file was stored in the home-dir -> isn't that what /var/run is supposed to be there for?). I have emerged mldonkey 2.6.4 normally, since it is in portage now, and before that, I upgraded ocaml to 3.08.3 the same way (so emerged it from portage, and not using the "batch" USE-flag). As already mentioned, there is no hint, why mlnet crashes, and what exactly happens then. The only abnormal messages in mldonkey.log are the repeating lines of "[BT] Unknown BT client found please report the next line to the dev team: BTUC:.....", also I do not expect this to be causing the problem. I already searched bugs.gentoo.org, and found bug #103411, but that one is about a memory problem, which does not occure here (mlnet just only stays at a memory usage of about 6% -> that machine has 1 GB RAM). The only changes I made recently, was playing arround with the NICE setting in /etc/conf.d/mldonkey, which was set to "19" by default. At first I lowered that setting to "3" and then to "0", because I thought, it may have something to do with CPU usage. That machine has a P4 2.4, but I let the ondemand CPU govenor scale it down to 300 MHz on low load. It could be a coincidence, but I think, lowering the NICE value really helped, so that the number of crashes went down (means I have the feeling, that the periods between the crashes have become longer). I use mldonkey only on bittorrent at the moment, all other protocols are deactivated. Could it be, that mldonkey can be killed by "fake"-datapackages, "hostile"-uploaders or "hostile"-clientsoftware? Those crashes did not appear in the past. When I started with 2.5.16, the core ran stable for days without any problem. It really only got worse within the past few month, that's why I thought it may be an influence from outside (changes in the BT protocol, or problems with other client-software of uploaders). The upgrade of ocaml and mldonkey itself did not help at all. On the mldonkey forums it was suggested, that it could be a Gentoo problem, because such an issue is not known on other Linux or *BSD distributions. Isn't there any possibility of analysing that problem any further, so why the core crashes without any hints in the logs and seemingly after a random period? I would expect, that traces remain somewhere in the systems, when a process disappears. Hopefully someone has any idea concerning this matter, or is fighting with the same problem, so that this issue can be solved with collective thinking. The actual situation is very depressing, I use Gentoo on all my machines, and the mentioned server also handles some other services, so swapping to another distribution (or even FreeBSD) is not possible. I can't believe, that it stays with "mldonkey simply does not work on Gentoo linux". Reproducible: Always Steps to Reproduce: Just start mldonkey as a service (using /etc/init.d/mldonkey start). Actual Results: It crashed after a seemingly random time without any feedback. Expected Results: It should be running stable and uninterrupted for days or even weeks. Portage 2.0.51.22-r2 (default-linux/x86/2005.0, gcc-3.4.4, glibc-2.3.5-r1, 2.6.12-gentoo-r10 i686) ================================================================= System uname: 2.6.12-gentoo-r10 i686 Intel(R) Pentium(R) 4 CPU 2.40GHz Gentoo Base System version 1.6.13 distcc 2.18.3 i686-pc-linux-gnu (protocols 1 and 2) (default port 3632) [enabled] ccache version 2.3 [enabled] dev-lang/python: 2.3.5-r2 sys-apps/sandbox: 1.2.12 sys-devel/autoconf: 2.13, 2.59-r6 sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6 sys-devel/binutils: 2.15.92.0.2-r10 sys-devel/libtool: 1.5.18-r1 virtual/os-headers: 2.6.11-r2 ACCEPT_KEYWORDS="x86" AUTOCLEAN="yes" CBUILD="i686-pc-linux-gnu" CFLAGS="-O2 -march=pentium4 -pipe -fomit-frame-pointer" CHOST="i686-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3/share/config /usr/share/config /var/qmail/control" CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d" CXXFLAGS="-O2 -march=pentium4 -pipe -fomit-frame-pointer" DISTDIR="/usr/portage/distfiles" FEATURES="autoconfig ccache distcc distlocks sandbox sfperms strict" GENTOO_MIRRORS="http://gentoo.inode.at http://gentoo.osuosl.org http://www.ibiblio.org/pub/Linux/distributions/gentoo" LANG="de_DE" LC_ALL="de_DE@euro" LINGUAS="de" MAKEOPTS="-j5" PKGDIR="/usr/portage/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/usr/local/portage" SYNC="rsync://lanmaster/gentoo-portage" USE="x86 acpi apache2 bash-completion berkdb crypt eds fortran gd gpm gstreamer logrotate ncurses nls nptl ogg pam perl pic python readline samba ssl tcpd threads vorbis xml2 zlib linguas_de userland_GNU kernel_linux elibc_glibc" Unset: ASFLAGS, CTARGET, LDFLAGS
you've written, you've lowered nice level from 19 to 0 and crashes become rarely. well - in fact you've increase mlnet priority (-20 is the highest and 20 is the lowest one). it may mean, that mlnet dies, when it has not sufficient amount of cpu activity (which ofcourse shouldn't happen, but...). could you try to turn off the CPU governor, so the machine runs always with it's default 2. 4GHz and let us know it that change anything?
@Marcin Kryczek It may be worth a try, but I don't really think that has something to do with it. The reason is, that machine stays at 300 MHz most of the time, and mlnet then consumes only about 10% CPU and 6% MEM. The ondemand trigger is set to 80%, and as soon as that value is reached, the CPU immediately goes up till 2.4 GHz, so there is nothing maxing out the CPU power at any given time. On the other hand, those crashes just seem to appear totally randomly. ATM the core shows an uptime of 3.5 hours, with the CPU frequency staying at 300 MHz and 15 BT downloads. After the next crash occures, I will set the CPU govenor to "performance", to see what happens then.
I _suppose_ I'm getting mldonkey crashes, too. I say 'suppose', since the error I'm experiencing is system hang on shutdown while 'Stopping service mldonkey' and a leftover mlnet.pid. Could this be related to bug #103433? I'll try to witness such a crash, right now it's running and I can stop it without problems with /etc/init.d/mldonkey stop.
(In reply to comment #0) > distcc 2.18.3 i686-pc-linux-gnu (protocols 1 and 2) (default port 3632) > MAKEOPTS="-j5" I had some problems on Solaris with distcc, make -j5 and Ocaml applications. Try compiling Ocaml and MLDonkey without distcc and with make -j1. Maybe it helps.
Both the versions in portage and the precompiled cores from http://download.berlios.de/pub/mldonkey/spiralvoice/ crash. The 2.6.4 precomp core logs 2005/09/13 22:53:39 [cF] Checksum computation failed: Exception: os_read failed: Input/output error before dying.
@Daniel Vianna That has to be another problem, because my issue does not result in any error message. In the meantime, I tried some different things: - Recompiled ocaml 3.08.3 and mldoney 2.6.4 with the following settings: CFLAGS="-O1 -march=pentium4 -pipe -fomit-frame-pointer" MAKEOPTS="-j1" FEATURES="-ccache -distcc" - Added the following system settings: /etc/security/limits.conf * soft nproc 4096 * hard nproc 16384 * soft nofile 4096 * hard nofile 65536 /etc/sysctl.conf kernel.shmall = 2097152 kernel.shmmax = 2147483648 kernel.shmmni = 4096 kernel.sem = 250 32000 100 128 fs.file-max = 65536 I don't know, if any of these measures helped, but it seems to be more stable again. The actual uptime of the core is one day, before that it was about 9 hours (then it crashed again after adding some new torrents). BTW Since the upgrade to 2.6.4, I (again) have the problem with those phantom-commits. When a file-download is finished, commited and moved from the incoming-folder to the final destination, files with the same name and a size of 0 KB keep showing up in the incoming-folder. No idea what's that all about...
BTW Since last month there is the new ocaml version 3.08.4, which seems to be a bugfix-release. Any idea, why that one is still not in portage? It may be an idea, to reemerge mldonkey with ocaml 3.08.4 installed.
I forgot to mention, that I have set the cpufreq-govenor to "performance" since the last crash, so maybe all the other settings have no influence at all, and it was all about the P4 frequency throttling. I will do some more test with the ondemand govenor, as soon as I find the time (I really would like to have that working, the ondemand govenor works really well for all the other stuff, and why let that machine run on 2.4 GHz 24/7, if it also can operate at only 300 MHz, when load is low).
I think the problem is solved: It was indeed the "ondemand" CPU govenor! I have reversed the mentioned system changes, updated to ocaml 3.08.4 and mldonkey 2.6.4-r1 (both compiled with my systemwide standardsettings), and switched to the "performance" CPU govenor. Since that, mlnet runs without interruption for days without crash. Because I used the "ondemand" CPU govenor for quite some time, and it did not cause any problems at the beginning, I think, that something changed with one of the last kernel-upgrades. The only remaining problem is now, that I still get phantom-files with a size of 0 kb in the incoming folder after a commit. That's not really tragical, but nevertheless annoying.
*** Bug 111326 has been marked as a duplicate of this bug. ***
Reopen wrt Bug 111326.
It crashes as hell. It's impossible to use any mlnet >=2.6.5. Maybe they should be masked. 2.6.4-r2 works... well, fine. I don't use any cpufreq program. Portage 2.0.53_rc7 (default-linux/x86/2005.0, gcc-3.4.4, glibc-2.3.5-r3, 2.6.14-gentoo i686) ================================================================= System uname: 2.6.14-gentoo i686 AMD Athlon(TM) XP 1800+ Gentoo Base System version 1.12.0_pre9 ccache version 2.4 [enabled] dev-lang/python: 2.4.2 sys-apps/sandbox: 1.2.13 sys-devel/autoconf: 2.13, 2.59-r7 sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r1 sys-devel/binutils: 2.16.1 sys-devel/libtool: 1.5.20-r1 virtual/os-headers: 2.6.11-r2 ACCEPT_KEYWORDS="x86 ~x86" AUTOCLEAN="yes" CBUILD="i686-pc-linux-gnu" CFLAGS="-march=athlon-xp -mmmx -m3dnow -msse -mfpmath=sse,387 -ffast-math -O2 -fomit-frame-pointer -frename-registers -funroll-loops -pipe" CHOST="i686-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/kde/3/share/config /usr/lib/X11/xkb /usr/share/config /var/qmail/control" CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d" CXXFLAGS="-march=athlon-xp -mmmx -m3dnow -msse -mfpmath=sse,387 -ffast-math -O2 -fomit-frame-pointer -frename-registers -funroll-loops -pipe" DISTDIR="/usr/portage/distfiles" FEATURES="autoconfig ccache distlocks sandbox sfperms strict" GENTOO_MIRRORS="http://linuv.uv.es/mirror/gentoo/ http://www.caliu.info/pub/gentoo/" LANG="es_ES.UTF-8" LC_ALL="es_ES.UTF-8" LDFLAGS="-Wl,-O1" LINGUAS="es" PKGDIR="/usr/portage/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/usr/local/portage" SYNC="rsync://rsync.gentoo.org/gentoo-portage" USE="x86 16bit 3dnow 3dnowext 7zip S3TC X a52 aac aalib acpi alsa apache2 audiofile bash-completion berkdb bidi bzip2 cairo cddb cdparanoia cdr chroot cjk clock-screen crypt cscope css cups curl dba dbus dlloader dts dvd dvdr dvdread dynagraph ecc edl eds emboss erandom exif faac faad fam fbcon ffmpeg flac font-server fontconfig foomaticdb foreign-sysvinit ftp gd gdbm gif gimpprint glibc-omitfp glitz gpm graphviz gs gtk2 hal hardened hpn icecast iconv idn imagemagick imlib imlib2 immqt-bc ipv6 irmc ithreads jabber java javascript jbig jce jikes jpeg jpeg2k justify kde kdeenablefinal lcms libcaca libg++ libwww linguas_es live lm_sensors logitech-mouse logrotate lzo lzw-tiff mad matroska md5sum mikmod mmap mmx mmxext mng monkey moznocompose moznoirc moznomail mozsvg mp3 mpeg mpeg4 mpi mplayer msn musepack musicbrainz mysql mysqli ncurses network nls no-old-linux no_wxgtk1 nomac nomalloccheck nomotif nptl nptlonly ogg oggvorbis openexr opengl pam pdflib perl pic png ppds python qt quicktime rdesktop readline rtc ruby sftplogging slp speex spell sse ssl stencil-buffer svg symlink tcpd tga theora threads tiff toolbar truetype truetype-fonts udev unicode urandom usb userlocales utf8 vcd vhosts vim-with-x visualization vorbis win32codecs wmf xine xml2 xpm xprint xrandr xscreensaver xv xvid yv12 zeroconf zip zlib userland_GNU kernel_linux elibc_glibc" Unset: ASFLAGS, CTARGET, MAKEOPTS
CFLAGS="... -fomit-frame-pointer ..." see bug #111626 for more details.
ok, it seems my old problem was caused by parallel shutdown in /etc/conf.d/rc, will search or submit this as another bug
marking as closed