I'm having a problem with kernel 2.6.19-r2, r3, and r4 and NFS. My system runs without a problem when using 2.6.18-r4. I used "make oldconfig" and then "make menuconfig" to make the changes for the new SATA system. The system boots fine, all the drives seem to work fine, I can log in at the console as my user or root. X and KDM start up fine, but when I enter my password in KDM it sits there for about two minutes before the Loading KDE dialog appears, then that sits there for another two minutes, then it disappears. After another two minutes the desktop changes to my background, then a dialog box pops up with "Media protocol died unexpectedly". After a few more minutes the rest of the KDE desktop shows up and all seems to be fine. However starting almost any KDE app takes a long time, but GTK apps like GAIM and Firefox start right up, though they occasionally hang for a while, then go back to work. Firefox also crashes much more than usual. Everything is so slow that is unusable. Reverting to 2.6.18 fixes all issues. I set up a test user account with the home directory local and it works fine with 2.6.19, so it is definitely a NFS problem. Reproducible: Always Steps to Reproduce: 1.Boot with any Gentoo 2.6.19 kernel (didn't try vanilla) 2.Try to run KDE with a NFS mounted home. 3.See delays and errors Actual Results: Slow and unusable system Expected Results: Normally working system, the same as previous kernels.
Why do you think this is a NFS problem? Your bug description does not indicate where NFS is involved at all. I suggest you provide more information about your setup.
(In reply to comment #1) > Why do you think this is a NFS problem? Your bug description does not indicate > where NFS is involved at all. I suggest you provide more information about your > setup. > Sorry, I see I failed to mention the most important point. My home directory is on an NFS mounted partition. When I use an account created with a local home directory it works fine. If you need any more information just let me know. isisdvp1 ~ # emerge --info Portage 2.1.1-r2 (default-linux/x86/2006.1/desktop, gcc-4.1.1, glibc-2.4-r4, 2.6.18-gentoo-r4 i686) ================================================================= System uname: 2.6.18-gentoo-r4 i686 Intel(R) Pentium(R) 4 CPU 3.06GHz Gentoo Base System version 1.12.6 Last Sync: Mon, 15 Jan 2007 13:00:02 +0000 ccache version 2.4 [enabled] app-admin/eselect-compiler: [Not Present] dev-java/java-config: 1.3.7, 2.0.31-r2 dev-lang/python: 2.4.3-r4 dev-python/pycrypto: 2.0.1-r5 dev-util/ccache: 2.4-r6 dev-util/confcache: [Not Present] sys-apps/sandbox: 1.2.17 sys-devel/autoconf: 2.13, 2.61 sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10 sys-devel/binutils: 2.16.1-r3 sys-devel/gcc-config: 1.3.14 sys-devel/libtool: 1.5.22 virtual/os-headers: 2.6.17-r2 ACCEPT_KEYWORDS="x86" AUTOCLEAN="yes" CBUILD="i686-pc-linux-gnu" CFLAGS="-O2 -march=pentium4 -pipe" CHOST="i686-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/share/X11/xkb /usr/share/config" CONFIG_PROTECT_MASK="/etc/env.d /etc/env.d/java/ /etc/gconf /etc/java-config/vms/ /etc/revdep-rebuild /etc/terminfo" CXXFLAGS="-O2 -march=pentium4 -pipe" DISTDIR="/usr/portage/distfiles" FEATURES="autoconfig ccache distlocks metadata-transfer parallel-fetch sandbox sfperms strict" GENTOO_MIRRORS="http://distfiles.gentoo.org http://distro.ibiblio.org/pub/linux/distributions/gentoo" MAKEOPTS="-j2" PKGDIR="/usr/portage/packages" PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude='/distfiles' --exclude='/local' --exclude='/packages'" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/usr/local/portage /usr/local/layman/xeffects" SYNC="rsync://rsync.gentoo.org/gentoo-portage" USE="x86 X aim alsa alsa_cards_ali5451 alsa_cards_als4000 alsa_cards_atiixp alsa_cards_atiixp-modem alsa_cards_bt87x alsa_cards_ca0106 alsa_cards_cmipci alsa_cards_emu10k1x alsa_cards_ens1370 alsa_cards_ens1371 alsa_cards_es1938 alsa_cards_es1968 alsa_cards_fm801 alsa_cards_hda-intel alsa_cards_intel8x0 alsa_cards_intel8x0m alsa_cards_maestro3 alsa_cards_trident alsa_cards_usb-audio alsa_cards_via82xx alsa_cards_via82xx-modem alsa_cards_ymfpci alsa_pcm_plugins_adpcm alsa_pcm_plugins_alaw alsa_pcm_plugins_asym alsa_pcm_plugins_copy alsa_pcm_plugins_dmix alsa_pcm_plugins_dshare alsa_pcm_plugins_dsnoop alsa_pcm_plugins_empty alsa_pcm_plugins_extplug alsa_pcm_plugins_file alsa_pcm_plugins_hooks alsa_pcm_plugins_iec958 alsa_pcm_plugins_ioplug alsa_pcm_plugins_ladspa alsa_pcm_plugins_lfloat alsa_pcm_plugins_linear alsa_pcm_plugins_meter alsa_pcm_plugins_mulaw alsa_pcm_plugins_multi alsa_pcm_plugins_null alsa_pcm_plugins_plug alsa_pcm_plugins_rate alsa_pcm_plugins_route alsa_pcm_plugins_share alsa_pcm_plugins_shm alsa_pcm_plugins_softvol arts berkdb bitmap-fonts cairo caps cdr cli cracklib crypt cups dbus divx dlloader dri dvd dvdr elibc_glibc emboss encode esd exif fam ffmpeg firefox font-server fortran gdbm gif gimpprint glitz gmedia gnome gpm gstreamer gtk hal iconv input_devices_keyboard input_devices_mouse isdnlog java javascript jpeg kde kernel_linux kqemu ldap libg++ mad mikmod mozbranding mp3 mpeg ncurses nls nptl nptlonly nsplugin nvidia offensive ofx ogg openal opengl oracle oss pam pcre pdf perl png ppds pppd python qt3 quicktime readline real realmedia reflection sdl session socks5 spell spl ssl svg tcpd truetype truetype-fonts type1-fonts udev userland_GNU video_cards_fglrx video_cards_radeon videos vorbis win32codecs wma wmp xml xorg xv zlib" Unset: CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, LINGUAS, PORTAGE_RSYNC_EXTRA_OPTS
So NFS appears to be working, just very slowly? Is this NFSv3 or v4? Can you reproduce the problem with the latest development kernel (currently 2.6.20-rc5)
(In reply to comment #3) I'm getting some errors such as the "Media protocol died unexpectedly", and application crashes. I don't know if they would be caused just by just a slowdown. If I log in through the console the nfs mount seems to work as expected, no slowdown at all during a ls -l. Also if you look at the thread http://forums.gentoo.org/viewtopic-t-524467.html you'll see at least one other person with NFS errors on 2.6.19. Are you saying you want me to try the vanilla 2.6.20-rc5? If so I'll go get it and try it.
(In reply to comment #3) NFS3.
Yes, please test that kernel. Also look for NFS-related errors in dmesg on 2.6.19.
(In reply to comment #6) I tried 2.6.20-rc5 and it has the same problem as 2.6.19. The only nfs related errors I see in dmesg are: lockd: cannot monitor 10.184.2.23 lockd: failed to monitor 10.184.2.23 but these are on all the kernels, even the 2.6.18 that works fine. I have the latest nfs utils installed. The IP address above is the nfs server, which is running RedHat 7.3, which I can't change. I also tried compiling the 2.6.19 kernel with and without DirectIO. Didn't make any difference.
I'm having problems with 2.6.19-r* as well. NFS-Server is a "production" system running 2.6.18-r6. The complete /usr/portage tree is exported via NFS. This servers kernel and the other kernels run with these parameters. # zgrep NFS /proc/config.gz CONFIG_NFS_FS=y CONFIG_NFS_V3=y CONFIG_NFS_V3_ACL=y CONFIG_NFS_V4=y CONFIG_NFS_DIRECTIO=y CONFIG_NFSD=y CONFIG_NFSD_V2_ACL=y CONFIG_NFSD_V3=y CONFIG_NFSD_V3_ACL=y CONFIG_NFSD_V4=y CONFIG_NFSD_TCP=y CONFIG_NFS_ACL_SUPPORT=y CONFIG_NFS_COMMON=y the net-fs/nfs-utils-1.0.10 are installed. Two other systems (a desktop and a server) import the /usr/portage tree. Until 2.6.19 I never ever had a problem with "emerge -puvDN world" on this two systems. With any of the recent 2.6.19 kernels 1 out of 3 or 4 invocations of emerge would fail with a Python stack trace. Repeating the command another time usually succeeded. After switching back to 2.6.18 these problems disappeared again. My client systems are mounting the NFS file systems via autofs. /usr/portage is a symlink to /net/utensil/usr_portage. /etc/autofs/auto.utensil contains the line usr_portage -fstype=nfs,tcp,soft,vers=3,intr,retry=2,rsize=8192,wsize=8192 172.25.xxx.yy:/usr/portage
Can you attempt to confirm that networking in general is working on the newer kernels up to the performance of the old ones? You could use something like ttcp for 32kb packets. Which network hardware and drivers are being used?
Ok, I rebooted with 2.6.19-r4 and changed the mount parameters to rsize=32768 and wsize=32768. The problems persist, but not as frequently. During the day with an update of xorg-x11 to 7.2 with about 16 packages for my configuration to be installed I saw the problem twice. Here is a typical log: bx621 ~ # emerge -puvDN world These are the packages that would be merged, in order: Calculating world dependencies \Traceback (most recent call last): File "/usr/bin/emerge", line 4049, in ? emerge_main() File "/usr/bin/emerge", line 4044, in emerge_main myopts, myaction, myfiles, spinner) File "/usr/bin/emerge", line 3457, in action_build if not mydepgraph.xcreate(myaction): File "/usr/bin/emerge", line 1260, in xcreate if not self.select_dep( File "/usr/bin/emerge", line 1189, in select_dep myuse=selected_pkg[-1]): File "/usr/bin/emerge", line 824, in create if not self.select_dep("/",mydep["/"],myparent=mp,myuse=myuse): File "/usr/bin/emerge", line 1182, in select_dep myuse=selected_pkg[-1]): File "/usr/bin/emerge", line 824, in create if not self.select_dep("/",mydep["/"],myparent=mp,myuse=myuse): File "/usr/bin/emerge", line 1182, in select_dep myuse=selected_pkg[-1]): File "/usr/bin/emerge", line 764, in create iuses = set(mydbapi.aux_get(mykey, ["IUSE"])[0].split()) File "/usr/lib/portage/pym/portage.py", line 4739, in aux_get raise KeyError(mycpv) KeyError: 'net-nds/openldap-2.3.30-r2' bx621 ~ # emerge -puvDN world These are the packages that would be merged, in order: Calculating world dependencies \ !!! Ebuilds for the following packages are either all !!! masked or don't exist: media-plugins/alsa-jack ... done! Total size of downloads: 0 kB Here is some additional info about the system. bx621 ~ # ethtool eth0 Settings for eth0: Supported ports: [ FIBRE ] Supported link modes: 1000baseT/Half 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 1000baseT/Half 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: FIBRE PHYAD: 1 Transceiver: internal Auto-negotiation: on Supports Wake-on: g Wake-on: d Current message level: 0x000000ff (255) Link detected: yes Dont' know about the FIBRE output, but the system is connected with normal Ethernet cables. bx621 ~ # lspci 00:00.0 Host bridge: Broadcom CMIC-LE Host Bridge (GC-LE chipset) (rev 32) 00:00.1 Host bridge: Broadcom CMIC-LE Host Bridge (GC-LE chipset) 00:00.2 Host bridge: Broadcom CMIC-LE Host Bridge (GC-LE chipset) 00:0b.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) 00:0f.0 ISA bridge: Broadcom CSB6 South Bridge (rev a0) 00:0f.2 USB Controller: Broadcom CSB6 OHCI USB Controller (rev 05) 00:0f.3 Host bridge: Broadcom GCLE-2 Host Bridge 00:11.0 Host bridge: Broadcom CIOB-E I/O Bridge with Gigabit Ethernet (rev 12) 00:11.2 Host bridge: Broadcom CIOB-E I/O Bridge with Gigabit Ethernet (rev 12) 04:04.0 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10) 04:04.1 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10) 06:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet (rev 02) 06:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet (rev 02) The driver is tg3. However, I saw the same problem in my desktop machine, which is using a Intel 1000 EtherExpress Pro controller As for performance, the server doesn't "feel" slower. An "time emerge -puvDN world" reports: real 1m17.710s user 0m30.686s sys 0m2.472s While on my desktop with 2.6.18 the same invocation takes: real 2m16.541s user 0m44.431s sys 0m4.712s running on a 100Mbit network. Installed packages is roughly the same. This is not really scientific, but anyway.
Please test the performance another way -- some way that does not involve NFS.
(In reply to comment #11) > Please test the performance another way -- some way that does not involve NFS. > Can you please elaborate a bit on this. What numbers are you exactly after? I could for instance install the iozone benchmark and do some comparisons 2.6.18-r6 vs. 2.6.19-r4 NFS, and non NFS wise. And I could mount the portage-tree statically to see, if the automounter might influence the outcome.
Install a webserver on one end, and measure the download rate and time of a big file. Even better, use ttcp or netcat to perform a more low-level test. The purpose is to deduce that apart from NFS, your network is working comparably to as it was on the known working kernels.
Ok, here is what I did. Create 1G test file: dd if=/dev/urandom bs=32k count=32768 of=file && md5sum file On server rx300s2 as sender with 2.6.18: dd if=file bs=32k | nc -w 30 -q 0 -p 3333 -l On server bx621 as receiver with 2.6.19: nc -q 0 172.25.110.84 3333 | dd of=file bs=32k 3 runs (output of dd, MD5 alway identical): 515+627545 records in 515+627545 records out 1073741824 Bytes (1,1 GB) copied, 95,6341 s, 11,2 MB/s 356+647237 records in 356+647237 records out 1073741824 Bytes (1,1 GB) copied, 93,2711 s, 11,5 MB/s 384+624731 records in 384+624731 records out 1073741824 Bytes (1,1 GB) copied, 95,3686 s, 11,3 MB/s On server bx621 as receiver with 2.6.18: 3 runs (output of dd, MD5 alway identical): 5536+551298 records in 5536+551298 records out 1073741824 Bytes (1,1 GB) copied, 93,1862 s, 11,5 MB/s 4262+567977 records in 4262+567977 records out 1073741824 Bytes (1,1 GB) copied, 92,586 s, 11,6 MB/s 5118+540528 records in 5118+540528 records out 1073741824 Bytes (1,1 GB) copied, 94,1598 s, 11,4 MB/s Direction reversed On server rx300s2 as receiver with 2.6.18: nc -w 30 -q 0 -p 3333 -l | dd of=file bs=32k On server bx621 as sender with 2.6.19: dd if=file bs=32k | nc -q 0 172.25.110.84 3333 3 runs (output of dd, MD5 always identical): 1073741824 Bytes (1,1 GB) copied, 96,6306 s, 11,1 MB/s 1073741824 Bytes (1,1 GB) copied, 92,9207 s, 11,6 MB/s 1073741824 Bytes (1,1 GB) copied, 93,5508 s, 11,5 MB/s On server bx621 as sender with 2.6.18: 3 runs (output of dd, MD5 alway identical): 1073741824 Bytes (1,1 GB) copied, 98,1188 s, 10,9 MB/s 1073741824 Bytes (1,1 GB) copied, 98,1238 s, 10,9 MB/s 1073741824 Bytes (1,1 GB) copied, 98,3585 s, 10,9 MB/s Since the test file was always correctly transferred and with the to be expected speed (on a 100Mbit network), I guess it's fair to say, that low level TCP appears to be working correctly. What is noteworthy however is the difference in the output of the dd command. The are about a factor of 10 more complete reads of 32k on 2.6.18 than on 2.6.19 (if, what I remember about the dd output is correct).
I tried the new gentoo-sources-2.6.19-r5 since it was made stable, and I still have the same problem. I did notice one thing in the logs with the new kernel: Feb 2 13:00:20 isisdvp1 statd: server localhost not responding, timed out This is not in the logs with the 2.6.18 kernels.
It is now working. The error message about statd led me to the nfs-utils package, so I updated to the latest version. The package includes the /etc/init.d/nfsmount script which was not in my runlevel. Once I added that to the default runlevel NFS started working correctly. Why this is needed now for 2.6.19, but not for 2.6.18 I don't know, but my system is now working correctly. If this works for Frank we should consider this resolved.
Well, this might be autofs problem after all. Anyway, I updated to 2.6.19-r5 as well and also put nfsmount into the default runlevel. I was already using the latest nfs-utils, since they are mandatory for NFS4. With this configuration I still saw the problem, that an emerge sometimes failed without reason. Repeating the commands usually works. I saw two failures during these last two days in the process of installing the daily portage changes. Then I deactivated the automounting for the /usr/portage tree and mounted it statically. I also updated my desktop system to 2.6.19-r5 and mounted the portage tree here statically as well. This system was a little more out of date. So the number of packages, that needed updating was larger. The result is, that I __didn't__ see any emerge problems since the switch to static mounts. I have repeatedly executed "emerge -puvDN world" always with minor changes to the /etc/portage/package.keyword file. And a "emerge -puevDN world" for good measure. I wasn't yet able to reproduce my previous emerge problems. I'm fairly confident, that with the amount of emerges I did I should have seen the problem by now. I guess I continue to test until the end of this week and then report back.
For the last 3 days I've been running my two systems with 2.6.19-r5 and statically mounted /usr/portage trees. I happy to report, that during this time I never saw my emerge problems. I'm fairly confident, that I would've seen the problem by now, if it really persisted. Once the the automounter/autofs is out of the equation, everything is dandy. I'm going to experiment a little more with the automounter (different timeout values) during the next time, but I guess that would be stuff for another problem reports. The original "NFS Problem" indeed appears to be solved.
I think this problem was caused by baselayout including the /etc/init.d/netmount script. This script will mount any nfs drives, but it does not start statd, as nfsmount does. Perhaps netmount should be removed from baselayout so that people needing to mount nfs drive would install the nfs-utils and get nfsmount. How should this be handled? Should I open a new ticket against baselayout?
Mike, you do the nfs stuff right? Any thoughts on this?
it seems that my problem is not 100% related but it's the closest bug I've found. Since 2.6.19, my nfs servers stoped working properly. nfsd doesn't start and here is what I can find in syslog : NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory NFSD: starting 90-second grace period nfsd: last server has exited nfsd: unexporting all filesystems RPC: failed to contact portmap (errno -5). the output is the same with or without LDFLAGS="-Wl,--as-needed", even narrowing my CFLAGS to "-O2 -march=pentium2" doesn't help (applied on portmap and nfs-utils). I eventually tested both of these programs without tcpd support but it doesn't help.
According to Mike: The netmount not starting statd issue is known, and that is being worked on. For now, please use nfsmount from nfs-utils-1.0.12.