Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 162199 - NFS problems w/ kernel-2.6.19
Summary: NFS problems w/ kernel-2.6.19
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: x86 Linux
: High critical (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-01-15 13:28 UTC by Tim Ryan
Modified: 2007-04-07 15:29 UTC (History)
3 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Ryan 2007-01-15 13:28:16 UTC
I'm having a problem with kernel 2.6.19-r2, r3, and r4 and NFS. My system runs without a problem when using 2.6.18-r4. I used "make oldconfig" and then "make menuconfig" to make the changes for the new SATA system. The system boots fine, all the drives seem to work fine, I can log in at the console as my user or root. X and KDM start up fine, but when I enter my password in KDM it sits there for about two minutes before the Loading KDE dialog appears, then that sits there for another two minutes, then it disappears. After another two minutes the desktop changes to my background, then a dialog box pops up with "Media protocol died unexpectedly". After a few more minutes the rest of the KDE desktop shows up and all seems to be fine. However starting almost any KDE app takes a long time, but GTK apps like GAIM and Firefox start right up, though they occasionally hang for a while, then go back to work. Firefox also crashes much more than usual. Everything is so slow that is unusable. Reverting to 2.6.18 fixes all issues. I set up a test user account with the home directory local and it works fine with 2.6.19, so it is definitely a NFS problem.

Reproducible: Always

Steps to Reproduce:
1.Boot with any Gentoo 2.6.19 kernel (didn't try vanilla)
2.Try to run KDE with a NFS mounted home.
3.See delays and errors

Actual Results:  
Slow and unusable system

Expected Results:  
Normally working system, the same as previous kernels.
Comment 1 Daniel Drake (RETIRED) gentoo-dev 2007-01-15 14:26:08 UTC
Why do you think this is a NFS problem? Your bug description does not indicate where NFS is involved at all. I suggest you provide more information about your setup.
Comment 2 Tim Ryan 2007-01-15 15:01:12 UTC
(In reply to comment #1)
> Why do you think this is a NFS problem? Your bug description does not indicate
> where NFS is involved at all. I suggest you provide more information about your
> setup.
> 
Sorry, I see I failed to mention the most important point. My home directory is on an NFS mounted partition. When I use an account created with a local home directory it works fine. If you need any more information just let me know.

isisdvp1 ~ # emerge --info
Portage 2.1.1-r2 (default-linux/x86/2006.1/desktop, gcc-4.1.1, glibc-2.4-r4, 2.6.18-gentoo-r4 i686)
=================================================================
System uname: 2.6.18-gentoo-r4 i686 Intel(R) Pentium(R) 4 CPU 3.06GHz
Gentoo Base System version 1.12.6
Last Sync: Mon, 15 Jan 2007 13:00:02 +0000
ccache version 2.4 [enabled]
app-admin/eselect-compiler: [Not Present]
dev-java/java-config: 1.3.7, 2.0.31-r2
dev-lang/python:     2.4.3-r4
dev-python/pycrypto: 2.0.1-r5
dev-util/ccache:     2.4-r6
dev-util/confcache:  [Not Present]
sys-apps/sandbox:    1.2.17
sys-devel/autoconf:  2.13, 2.61
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10
sys-devel/binutils:  2.16.1-r3
sys-devel/gcc-config: 1.3.14
sys-devel/libtool:   1.5.22
virtual/os-headers:  2.6.17-r2
ACCEPT_KEYWORDS="x86"
AUTOCLEAN="yes"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-O2 -march=pentium4 -pipe"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/share/X11/xkb /usr/share/config"
CONFIG_PROTECT_MASK="/etc/env.d /etc/env.d/java/ /etc/gconf /etc/java-config/vms/ /etc/revdep-rebuild /etc/terminfo"
CXXFLAGS="-O2 -march=pentium4 -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoconfig ccache distlocks metadata-transfer parallel-fetch sandbox sfperms strict"
GENTOO_MIRRORS="http://distfiles.gentoo.org http://distro.ibiblio.org/pub/linux/distributions/gentoo"
MAKEOPTS="-j2"
PKGDIR="/usr/portage/packages"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude='/distfiles' --exclude='/local' --exclude='/packages'"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage /usr/local/layman/xeffects"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="x86 X aim alsa alsa_cards_ali5451 alsa_cards_als4000 alsa_cards_atiixp alsa_cards_atiixp-modem alsa_cards_bt87x alsa_cards_ca0106 alsa_cards_cmipci alsa_cards_emu10k1x alsa_cards_ens1370 alsa_cards_ens1371 alsa_cards_es1938 alsa_cards_es1968 alsa_cards_fm801 alsa_cards_hda-intel alsa_cards_intel8x0 alsa_cards_intel8x0m alsa_cards_maestro3 alsa_cards_trident alsa_cards_usb-audio alsa_cards_via82xx alsa_cards_via82xx-modem alsa_cards_ymfpci alsa_pcm_plugins_adpcm alsa_pcm_plugins_alaw alsa_pcm_plugins_asym alsa_pcm_plugins_copy alsa_pcm_plugins_dmix alsa_pcm_plugins_dshare alsa_pcm_plugins_dsnoop alsa_pcm_plugins_empty alsa_pcm_plugins_extplug alsa_pcm_plugins_file alsa_pcm_plugins_hooks alsa_pcm_plugins_iec958 alsa_pcm_plugins_ioplug alsa_pcm_plugins_ladspa alsa_pcm_plugins_lfloat alsa_pcm_plugins_linear alsa_pcm_plugins_meter alsa_pcm_plugins_mulaw alsa_pcm_plugins_multi alsa_pcm_plugins_null alsa_pcm_plugins_plug alsa_pcm_plugins_rate alsa_pcm_plugins_route alsa_pcm_plugins_share alsa_pcm_plugins_shm alsa_pcm_plugins_softvol arts berkdb bitmap-fonts cairo caps cdr cli cracklib crypt cups dbus divx dlloader dri dvd dvdr elibc_glibc emboss encode esd exif fam ffmpeg firefox font-server fortran gdbm gif gimpprint glitz gmedia gnome gpm gstreamer gtk hal iconv input_devices_keyboard input_devices_mouse isdnlog java javascript jpeg kde kernel_linux kqemu ldap libg++ mad mikmod mozbranding mp3 mpeg ncurses nls nptl nptlonly nsplugin nvidia offensive ofx ogg openal opengl oracle oss pam pcre pdf perl png ppds pppd python qt3 quicktime readline real realmedia reflection sdl session socks5 spell spl ssl svg tcpd truetype truetype-fonts type1-fonts udev userland_GNU video_cards_fglrx video_cards_radeon videos vorbis win32codecs wma wmp xml xorg xv zlib"
Unset:  CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, LINGUAS, PORTAGE_RSYNC_EXTRA_OPTS

Comment 3 Daniel Drake (RETIRED) gentoo-dev 2007-01-15 15:52:05 UTC
So NFS appears to be working, just very slowly?
Is this NFSv3 or v4?
Can you reproduce the problem with the latest development kernel (currently 2.6.20-rc5)
Comment 4 Tim Ryan 2007-01-15 16:05:05 UTC
(In reply to comment #3)
I'm getting some errors such as the "Media protocol died unexpectedly", and application crashes. I don't know if they would be caused just by just a slowdown. If I log in through the console the nfs mount seems to work as expected, no slowdown at all during a ls -l. Also if you look at the thread http://forums.gentoo.org/viewtopic-t-524467.html you'll see at least one other person with NFS errors on 2.6.19.

Are you saying you want me to try the vanilla 2.6.20-rc5? If so I'll go get it and try it.

Comment 5 Tim Ryan 2007-01-15 16:10:21 UTC
(In reply to comment #3)
NFS3.
Comment 6 Daniel Drake (RETIRED) gentoo-dev 2007-01-15 18:45:06 UTC
Yes, please test that kernel. Also look for NFS-related errors in dmesg on 2.6.19.
Comment 7 Tim Ryan 2007-01-15 19:11:56 UTC
(In reply to comment #6)
I tried 2.6.20-rc5 and it has the same problem as 2.6.19. The only nfs related errors I see in dmesg are:

lockd: cannot monitor 10.184.2.23
lockd: failed to monitor 10.184.2.23

but these are on all the kernels, even the 2.6.18 that works fine. I have the latest nfs utils installed. The IP address above is the nfs server, which is running RedHat 7.3, which I can't change. I also tried compiling the 2.6.19 kernel with and without DirectIO. Didn't make any difference.
Comment 8 Frank Ridderbusch 2007-01-17 12:57:17 UTC
I'm having problems with 2.6.19-r* as well. NFS-Server is a "production" system
running 2.6.18-r6. The complete /usr/portage tree is exported via NFS. This
servers kernel and the other kernels run with these parameters.

# zgrep NFS /proc/config.gz
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
CONFIG_NFS_DIRECTIO=y
CONFIG_NFSD=y
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_NFSD_TCP=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y

the net-fs/nfs-utils-1.0.10 are installed. 

Two other systems (a desktop and a server) import the /usr/portage tree. Until
2.6.19 I never ever had a problem with "emerge -puvDN world" on this two
systems. With any of the recent 2.6.19 kernels 1 out of 3 or 4 invocations of
emerge would fail with a Python stack trace. Repeating the command another
time usually succeeded. After switching back to 2.6.18 these problems
disappeared again.

My client systems are mounting the NFS file systems via autofs. /usr/portage
is a symlink to /net/utensil/usr_portage. /etc/autofs/auto.utensil contains
the line

usr_portage -fstype=nfs,tcp,soft,vers=3,intr,retry=2,rsize=8192,wsize=8192  172.25.xxx.yy:/usr/portage
Comment 9 Daniel Drake (RETIRED) gentoo-dev 2007-01-25 21:46:59 UTC
Can you attempt to confirm that networking in general is working on the newer kernels up to the performance of the old ones? You could use something like ttcp for 32kb packets.

Which network hardware and drivers are being used?
Comment 10 Frank Ridderbusch 2007-01-26 16:21:10 UTC
Ok, I rebooted with 2.6.19-r4 and changed the mount parameters to rsize=32768
and wsize=32768. The problems persist, but not as frequently. During the day
with an update of xorg-x11 to 7.2 with about 16 packages for my configuration
to be installed I saw the problem twice. Here is a typical log:

bx621 ~ # emerge -puvDN world

These are the packages that would be merged, in order:

Calculating world dependencies \Traceback (most recent call last):
  File "/usr/bin/emerge", line 4049, in ?
    emerge_main()
  File "/usr/bin/emerge", line 4044, in emerge_main
    myopts, myaction, myfiles, spinner)
  File "/usr/bin/emerge", line 3457, in action_build
    if not mydepgraph.xcreate(myaction):
  File "/usr/bin/emerge", line 1260, in xcreate
    if not self.select_dep(
  File "/usr/bin/emerge", line 1189, in select_dep
    myuse=selected_pkg[-1]):
  File "/usr/bin/emerge", line 824, in create
    if not self.select_dep("/",mydep["/"],myparent=mp,myuse=myuse):
  File "/usr/bin/emerge", line 1182, in select_dep
    myuse=selected_pkg[-1]):
  File "/usr/bin/emerge", line 824, in create
    if not self.select_dep("/",mydep["/"],myparent=mp,myuse=myuse):
  File "/usr/bin/emerge", line 1182, in select_dep
    myuse=selected_pkg[-1]):
  File "/usr/bin/emerge", line 764, in create
    iuses = set(mydbapi.aux_get(mykey, ["IUSE"])[0].split())
  File "/usr/lib/portage/pym/portage.py", line 4739, in aux_get
    raise KeyError(mycpv)
KeyError: 'net-nds/openldap-2.3.30-r2'
bx621 ~ # emerge -puvDN world

These are the packages that would be merged, in order:

Calculating world dependencies \
!!! Ebuilds for the following packages are either all
!!! masked or don't exist:
media-plugins/alsa-jack

... done!

Total size of downloads: 0 kB

Here is some additional info about the system. 

bx621 ~ # ethtool eth0
Settings for eth0:
        Supported ports: [ FIBRE ]
        Supported link modes:   1000baseT/Half 1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  1000baseT/Half 1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: g
        Wake-on: d
        Current message level: 0x000000ff (255)
        Link detected: yes

Dont' know about the FIBRE output, but the system is connected with
normal Ethernet cables.

bx621 ~ # lspci
00:00.0 Host bridge: Broadcom CMIC-LE Host Bridge (GC-LE chipset) (rev 32)
00:00.1 Host bridge: Broadcom CMIC-LE Host Bridge (GC-LE chipset)
00:00.2 Host bridge: Broadcom CMIC-LE Host Bridge (GC-LE chipset)
00:0b.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
00:0f.0 ISA bridge: Broadcom CSB6 South Bridge (rev a0)
00:0f.2 USB Controller: Broadcom CSB6 OHCI USB Controller (rev 05)
00:0f.3 Host bridge: Broadcom GCLE-2 Host Bridge
00:11.0 Host bridge: Broadcom CIOB-E I/O Bridge with Gigabit Ethernet (rev 12)
00:11.2 Host bridge: Broadcom CIOB-E I/O Bridge with Gigabit Ethernet (rev 12)
04:04.0 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10)
04:04.1 SCSI storage controller: Adaptec AIC-7902B U320 (rev 10)
06:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet (rev 02)
06:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet (rev 02)

The driver is tg3.

However, I saw the same problem in my desktop machine, which is
using a Intel 1000 EtherExpress Pro controller 

As for performance, the server doesn't "feel" slower. An
"time emerge -puvDN world" reports:

real    1m17.710s
user    0m30.686s
sys     0m2.472s

While on my desktop with 2.6.18 the same invocation takes:

real    2m16.541s
user    0m44.431s
sys     0m4.712s

running on a 100Mbit network. Installed packages is roughly the same.
This is not really scientific, but anyway.
Comment 11 Daniel Drake (RETIRED) gentoo-dev 2007-01-26 19:14:47 UTC
Please test the performance another way -- some way that does not involve NFS.
Comment 12 Frank Ridderbusch 2007-01-28 16:28:55 UTC
(In reply to comment #11)
> Please test the performance another way -- some way that does not involve NFS.
> 

Can you please elaborate a bit on this. What numbers are you exactly after?
I could for instance install the iozone benchmark and do some comparisons
2.6.18-r6 vs. 2.6.19-r4 NFS, and non NFS wise. 

And I could mount the portage-tree statically to see, if the automounter might
influence the outcome.
Comment 13 Daniel Drake (RETIRED) gentoo-dev 2007-01-28 18:51:04 UTC
Install a webserver on one end, and measure the download rate and time of a big file. Even better, use ttcp or netcat to perform a more low-level test.

The purpose is to deduce that apart from NFS, your network is working comparably to as it was on the known working kernels.
Comment 14 Frank Ridderbusch 2007-01-29 15:03:56 UTC
Ok, here is what I did.

Create 1G test file:
dd if=/dev/urandom bs=32k count=32768 of=file && md5sum file

On server rx300s2 as sender with 2.6.18:
dd if=file bs=32k | nc -w 30 -q 0 -p 3333 -l

On server bx621 as receiver with 2.6.19:
nc -q 0 172.25.110.84 3333 | dd of=file bs=32k

3 runs (output of dd, MD5 alway identical):
515+627545 records in
515+627545 records out
1073741824 Bytes (1,1 GB) copied, 95,6341 s, 11,2 MB/s
356+647237 records in
356+647237 records out
1073741824 Bytes (1,1 GB) copied, 93,2711 s, 11,5 MB/s
384+624731 records in
384+624731 records out
1073741824 Bytes (1,1 GB) copied, 95,3686 s, 11,3 MB/s

On server bx621 as receiver with 2.6.18:
3 runs (output of dd, MD5 alway identical):
5536+551298 records in 
5536+551298 records out
1073741824 Bytes (1,1 GB) copied, 93,1862 s, 11,5 MB/s
4262+567977 records in
4262+567977 records out
1073741824 Bytes (1,1 GB) copied, 92,586 s, 11,6 MB/s
5118+540528 records in
5118+540528 records out
1073741824 Bytes (1,1 GB) copied, 94,1598 s, 11,4 MB/s

Direction reversed

On server rx300s2 as receiver with 2.6.18:
nc -w 30 -q 0 -p 3333 -l | dd of=file bs=32k

On server bx621 as sender with 2.6.19:
dd if=file bs=32k | nc -q 0 172.25.110.84 3333

3 runs (output of dd, MD5 always identical):
1073741824 Bytes (1,1 GB) copied, 96,6306 s, 11,1 MB/s
1073741824 Bytes (1,1 GB) copied, 92,9207 s, 11,6 MB/s
1073741824 Bytes (1,1 GB) copied, 93,5508 s, 11,5 MB/s

On server bx621 as sender with 2.6.18:
3 runs (output of dd, MD5 alway identical):
1073741824 Bytes (1,1 GB) copied, 98,1188 s, 10,9 MB/s
1073741824 Bytes (1,1 GB) copied, 98,1238 s, 10,9 MB/s
1073741824 Bytes (1,1 GB) copied, 98,3585 s, 10,9 MB/s

Since the test file was always correctly transferred and with the
to be expected speed (on a 100Mbit network), I guess it's fair to say,
that low level TCP appears to be working correctly. 

What is noteworthy however is the difference in the output of the dd
command. The are about a factor of 10 more complete reads of 32k on
2.6.18 than on 2.6.19 (if, what I remember about the dd output is correct).
Comment 15 Tim Ryan 2007-02-02 18:10:52 UTC
I tried the new gentoo-sources-2.6.19-r5 since it was made stable, and I still have the same problem. I did notice one thing in the logs with the new kernel:

Feb  2 13:00:20 isisdvp1 statd: server localhost not responding, timed out

This is not in the logs with the 2.6.18 kernels.
Comment 16 Tim Ryan 2007-02-02 18:52:19 UTC
It is now working. The error message about statd led me to the nfs-utils package, so I updated to the latest version. The package includes the /etc/init.d/nfsmount script which was not in my runlevel. Once I added that to the default runlevel NFS started working correctly. Why this is needed now for 2.6.19, but not for 2.6.18 I don't know, but my system is now working correctly.

If this works for Frank we should consider this resolved.
Comment 17 Frank Ridderbusch 2007-02-06 14:05:15 UTC
Well, this might be autofs problem after all.

Anyway, I updated to 2.6.19-r5 as well and also put nfsmount into the 
default runlevel. I was already using the latest nfs-utils, since they
are mandatory for NFS4. 

With this configuration I still saw the problem, that an emerge
sometimes failed without reason. Repeating the commands usually
works. I saw two failures during these last two days in the process
of installing the daily portage changes.

Then I deactivated the automounting for the /usr/portage tree and mounted
it statically. I also updated my desktop system to 2.6.19-r5 and mounted
the portage tree here statically as well. This system was a little more out
of date. So the number of packages, that needed updating was larger. 

The result is, that I __didn't__ see any emerge problems since the switch
to static mounts. I have repeatedly executed "emerge -puvDN world" always
with minor changes to the /etc/portage/package.keyword file. And a
"emerge -puevDN world" for good measure. I wasn't yet able to reproduce my
previous emerge problems. I'm fairly confident, that with the amount of
emerges I did I should have seen the problem by now.

I guess I continue to test until the end of this week and then report back.
Comment 18 Frank Ridderbusch 2007-02-09 14:09:56 UTC
For the last 3 days I've been running my two systems with 2.6.19-r5 and
statically mounted /usr/portage trees. I happy to report, that during this
time I never saw my emerge problems. I'm fairly confident, that I would've
seen the problem by now, if it really persisted. Once the
the automounter/autofs is out of the equation, everything is dandy.

I'm going to experiment a little more with the automounter (different 
timeout values) during the next time, but I guess that would be stuff
for another problem reports. The original "NFS Problem" indeed appears
to be solved.
Comment 19 Tim Ryan 2007-02-09 14:33:37 UTC
I think this problem was caused by baselayout including the /etc/init.d/netmount script. This script will mount any nfs drives, but it does not start statd, as nfsmount does. Perhaps netmount should be removed from baselayout so that people needing to mount nfs drive would install the nfs-utils and get nfsmount. 

How should this be handled? Should I open a new ticket against baselayout?
Comment 20 Daniel Drake (RETIRED) gentoo-dev 2007-03-06 18:20:01 UTC
Mike, you do the nfs stuff right? Any thoughts on this?
Comment 21 Gilles Dartiguelongue (RETIRED) gentoo-dev 2007-03-10 19:10:20 UTC
it seems that my problem is not 100% related but it's the closest bug I've found.

Since 2.6.19, my nfs servers stoped working properly. nfsd doesn't start and here is what I can find in syslog :

NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
NFSD: starting 90-second grace period
nfsd: last server has exited
nfsd: unexporting all filesystems
RPC: failed to contact portmap (errno -5).

the output is the same with or without LDFLAGS="-Wl,--as-needed", even narrowing my CFLAGS to "-O2 -march=pentium2" doesn't help (applied on portmap and nfs-utils). I eventually tested both of these programs without tcpd support but it doesn't help.



Comment 22 Daniel Drake (RETIRED) gentoo-dev 2007-04-07 15:29:48 UTC
According to Mike:

The netmount not starting statd issue is known, and that is being worked on. For now, please use nfsmount from nfs-utils-1.0.12.