There are dozens of reports of rsync timesouts in io.c floating around, and no consistant pattern. Different people find different "solutions" - others just try again and then the rsync timeout goes away for no known reason. Some people reported that only some machines in their possession exhibit the timeout error repeatedly while other machines never have the problem. Today I believe I found a detail that is related to the common ingredient in all of the failure reports of the type: emerge sync [ ... ] 749 100% 6.42kB/s 0:00:00 (51945, 100.0% of 111850) xfce-extra/xfwm4-themes/xfwm4-themes-4.2.0.ebuild 399 100% 1.62kB/s 0:00:00 (51946, 100.0% of 111850) io timeout after 60 seconds - exiting rsync error: timeout in data send/receive (code 30) at io.c(153) rsync: connection unexpectedly closed (2593213 bytes received so far) [generator] rsync error: error in rsync protocol data stream (code 12) at io.c(359) >>> retry ... The problem is CPU load. If your load is too high, rsync has bogus timeouts. If the load can decrease for a few minutes, emerge sync will complete OK. See more details on my own discovery in the Gentoo forum link in this bug report. Reproducible: Sometimes Steps to Reproduce: 1.Induce a significant CPU load (say around 2 load factor or better) 2.emerge sync 3.rsync timeout error in io.c at some random point (this is somewhat theory based on the experience I had) Actual Results: If there is an easy way to artifically induce the CPU load I had with 2 frozen browsers and a busy X session, I can try to reproduce this. Expected Results: Emerge sync should have completed once in the 20 attempts I had made today while the system was under moderate load. Emerge-websync worked OK and I had ssh session running without dropping packets or any other noticable performance loss - so I don't see why rsync should bail out so easily. I'm not alone with this problem. I've seen it before sporatically and never understood why it would work OK on a second or third try. Some users report that some of their servers consistantly fail to emerge sync. This could be due to the load factor on those particular servers. My system is dual PII-450 with 512 MB RAM. emerge info Portage 2.0.51-r15 (default-linux/x86/2004.0, gcc-3.3.5, glibc-2.3.4.20040808-r1, 2.4.28-gentoo-r5 i686) ================================================================= System uname: 2.4.28-gentoo-r5 i686 Pentium II (Deschutes) Gentoo Base System version 1.4.16 Python: dev-lang/python-2.3.4-r1,dev-lang/python-2.1.3-r1 [2.3.4 (#1, Feb 13 2005, 23:24:29)] ccache version 2.3 [enabled] dev-lang/python: 2.3.4-r1, 2.1.3-r1 sys-devel/autoconf: 2.13, 2.59-r6 sys-devel/automake: 1.5, 1.7.9-r1, 1.8.5-r3, 1.6.3, 1.4_p6, 1.9.4 sys-devel/binutils: 2.15.92.0.2-r1 sys-devel/libtool: 1.5.10-r4 virtual/os-headers: 2.4.21-r1 ACCEPT_KEYWORDS="x86" AUTOCLEAN="yes" CFLAGS="-march=i686 -O2 -pipe" CHOST="i686-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3/share/config /usr/lib/X11/xkb /usr/lib/mozilla/defaults/pref /usr/share/config /usr/share/texmf/dvipdfm/config/ /usr/share/texmf/dvips/config/ /usr/share/texmf/tex/generic/config/ /usr/share/texmf/tex/platex/config/ /usr/share/texmf/xdvi/ /var/bind /var/qmail/control" CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d" CXXFLAGS="-march=i686 -O2 -pipe" DISTDIR="/usr/portage/distfiles" FEATURES="autoaddcvs autoconfig ccache distlocks sandbox sfperms" GENTOO_MIRRORS="http://mirror.clarkson.edu/pub/distributions/gentoo/ http://mirror.datapipe.net/gentoo http://open-systems.ufl.edu/mirrors/gentoo" MAKEOPTS="-j3" PKGDIR="/usr/portage/packages" PORTAGE_TMPDIR="/usr/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/usr/portage-overlay" SYNC="rsync://rsync.namerica.gentoo.org/gentoo-portage" USE="x86 X apm avi berkdb bitmap-fonts cdr crypt cups curl emboss encode f77 fam flac font-server foomaticdb fortran gdbm gif gnome gstreamer gtk gtk2 gtkhtml imagemagick imap imlib ipv6 java jpeg junit libg++ libwww mad mikmod mmx motif mozilla mpeg mysql ncurses nls noaudio oggvorbis opengl pam pdflib perl png postgres python quicktime readline samba sdl slang spell ssl svga tcltk tcpd tetex tiff truetype truetype-fonts type1-fonts usb xml xml2 xv zlib video_cards_matrox" Unset: ASFLAGS, CBUILD, CTARGET, LANG, LC_ALL, LDFLAGS
Are you sure it isn't an IO and context problem? If you can't read fast enough, then you obviously can't check your local files, verify the downloaded statistics, and request the differences. This is not really something can easily fix. I would imagine the timeout code in rsync is a little off as it should only be concerned with a real timeout if it has actually attempted to send any data recently. This is an upstream bug, not a portage bug.
It has something to do with server load, I believe, due to the number of other people reporting the same error. I don't think they all have KVM switches throwing the digit 1 into mozilla. I suspect some of them are just busy servers and they fail for the same reason as I've found. I don't really know of a way to test for IO/context, so this is mainly conjecture. Based on my experience, I believe my evidence and theory is at least more useful than other reports on the forums, such as "rm -rf /usr/portage/* and then it worked" or "I rebooted and it was fixed". Whatever is going on, and whoever owns this, it is critical to the success of Gentoo. rsync is probably more important to Gentoo than to any other distro. Personally I fought this all day, and I was starting to wonder if I was committed to sticking with Gentoo since no one seemed to have a handle on this very old and sporadic problem.
A good fix is to increase the timeout in make.conf. If you want to test using IO contention, copy and paste: for X in $(seq 1 30); dd if=/dev/hda of=/dev/null bs=1 count=100000000 skip=$(( 100000000 * $X )) & done ; emerge sync ; killall dd And then try to sync. It should fail pretty quickly, I'd imagine. It creates 30 "readers" of your hard disk simultaneously reading different 100M portions of the disk in 1 byte intervals. You box _will_ come to a crawl or you have some amazing disks and ram.
I've seen people reporting the same error with timeout=300. I made mine 60 and it didn't help. This isn't a real timeout error, but a bug in rsync that makes it think it has experienced a timeout. I think I could have set my timeout value to 24 hours and it wouldn't have made a difference - the problem was in whatever type of system load the 2 frozen browsers and X session had on the OS. I've tried the dd load test on two systems - the one having the problem all of yesterday, and my Sun Sparc Ultra 60 (300 Mhz). Rsync was not impacted by the load. It completed OK in a first try. So perhaps it has a relation to the CPU workload? Both of these systems have experienced the io.c timeout error from rsync intermittantly in the past.
Some additional links to Gentoo users having the same bug (note how many find "me too" responses to the thread): http://forums.gentoo.org/viewtopic-t-288434-highlight-rsync+timeout.html http://forums.gentoo.org/viewtopic-t-246397-highlight-rsync+timeout.html http://forums.gentoo.org/viewtopic-t-291760-highlight-rsync+timeout.html (note user hjnenc says here: I usually have rsync problems only, when the box is heavily loaded at the time the rsync runs.) http://forums.gentoo.org/viewtopic-t-281918-highlight-rsync+timeout.html http://forums.gentoo.org/viewtopic-t-281397-highlight-rsync+timeout.html (again, note the problem was initially "fixed by a reboot" - which can be a crude way to eliminate system resource hogs) http://forums.gentoo.org/viewtopic-t-273811-highlight-rsync+timeout.html I hope this will help convince someone that increasing the timeout has nothing to do with the real problem. It has been around for awhile and seems to be a little more common lately. My errors are from rsync 2.6.3. Looking at the cooresponding line numbers from the error messages... The line 153 in my error result is merely the comparison test of the timeout. That isn't the source of the timeout. The line 359 is the end of this procedure in io.c /** * It's almost always an error to get an EOF when we're trying to read * from the network, because the protocol is self-terminating. * * However, there is one unfortunate cases where it is not, which is * rsync <2.4.6 sending a list of modules on a server, since the list * is terminated by closing the socket. So, for the section of the * program where that is a problem (start_socket_client), * kludge_around_eof is True and we just exit. */ static void whine_about_eof(int fd) { if (kludge_around_eof && fd == sock_f_in) exit_cleanup(0); rprintf(FERROR, RSYNC_NAME ": connection unexpectedly closed " "(%.0f bytes received so far) [%s]\n", (double)stats.total_read, who_am_i()); exit_cleanup(RERR_STREAMIO); } Here is one place in io.c that the kludge procedure is called (around line 502 in rsync 2.6.3): n = read(fd, buf, len); if (n <= 0) { if (n == 0) whine_about_eof(fd); /* Doesn't return. */ if (errno == EINTR || errno == EWOULDBLOCK || errno == EAGAIN) continue; /* Don't write errors on a dead socket. */ if (fd == sock_f_in) close_multiplexing_out(); rsyserr(FERROR, errno, "read error"); exit_cleanup(RERR_STREAMIO); } Can anyone think of a reason that CPU load could cause this to be erroneously triggered?
if the socket is non blocking then errno could be EAGAIN if (n == 0) whine_about_eof(fd); seems like it should be after the checks for errno.
Read would return -1, not 0, in the case of EAGAIN. Information about the servers you have the issue with would be good... emerge.log should contain the servers you connected to. Can you list the ones that fail, and if possible, try to sync with them explicitly and repeat it.
Here are a list of rsync servers (uniqued by eye) that failed for me on Feb 24th: Failed: rsync://204.89.193.10/gentoo-portage rsync://206.167.194.132/gentoo-portage rsync://206.169.166.57/gentoo-portage rsync://206.251.252.32/gentoo-portage rsync://141.219.155.230/gentoo-portage rsync://146.186.218.75/gentoo-portage rsync://216.170.153.145/gentoo-portage rsync://216.194.67.33/gentoo-portage rsync://63.150.226.21/gentoo-portage rsync://66.11.11.187/gentoo-portage rsync://128.213.5.35/gentoo-portage rsync://129.89.70.108/gentoo-portage (after this point I upgraded to rsync 2.6.3) Failed: rsync://206.251.252.32/gentoo-portage rsync://209.47.158.135/gentoo-portage rsync://216.170.153.145/gentoo-portage rsync://206.167.194.132/gentoo-portage rsync://206.169.166.57/gentoo-portage rsync://204.89.193.10/gentoo-portage rsync://129.89.70.108/gentoo-portage rsync://132.207.4.160/gentoo-portage rsync://141.219.155.230/gentoo-portage rsync://216.194.67.33/gentoo-portage rsync://63.150.226.21/gentoo-portage Then after the mozilla browsers and X were killed, there were sucessful emerge sync on Feb 25th against: rsync://132.207.4.160/gentoo-portage rsync://128.213.5.35/gentoo-portage rsync://63.150.226.21/gentoo-portage rsync://129.89.70.108/gentoo-portage (as I ran the dd load tests suggested) It isn't hard to run against one that had previously failed, without doing so deliberately. As you can see, there are 3 rsync servers in that list which appear in the list of those that previously failed. I don't see where this testing is going. If you read the linked items I included, you can see that this is not specific to me. What has happened to the earlier cases in the forums is that they likely just kept trying, or rebooted and it cleared up any issue that triggers the bug. At most times, it is an intermittant bug that people just put up with. You don't even hear about that since you can try again and it might work. So far I'm not having any io.c timeouts in emerge sync against some of the previously failing rsync servers (I set RSYNC in make.conf to do this). The only errors I see are other ones such as connection refused or no response resulting in an error in clientserver.c Of course I can only test one per 30 minutes...
The line of questioning is just looking for hints as to the cause. As you have said yourself, it's intermittent and you're having trouble reproducing it even under the circumstances you believe cause it. As for myself, I've also experienced this problem. From what I've read, rsync is a multi-threaded (4 threads?) protocol. There are at least two visible threads; one scans for modified files and one downloads new files. From what I've seen, this problem occurs when *one* of those threads times out.
Actually, I've been using other methods to simulate the previous trigger of the failure, since it is a weird thing and I'd rather see it happen with a generic load. But if that isn't going to work, I can also try the original situation. I now have a browser and X that are running with 90% load on CPU 1 and 98% load on CPU 2. Once I have waited for the emerge timestamp to age, I can try this again and likely reproduce the bug. The question is, once I'm there and can reproduce the timeout from io.c, what do we want to know? Is there any stat from lsof, for example, that would help? If you want to know what I've done to reproduce the scenario: start X, launch 2 mozilla browsers (maybe one would work), enter a continuous stream of some character into the address bar of mozilla. I can do this with double clicking scroll lock and hitting some digit after. It is the same thing as putting an object on the keyboard (or just one key) so that it autorepeats some character forever. That is what makes the load on X and mozilla in my case (and one per CPU for my case, as well, or else I'd still have a CPU free to process cycles quickly). Again, I'm only using this weird scenario to reproduce what happened just to me. I believe it will also simulate other types of system saturation with different causes that also lead to the io.c timeout error.
Try settings RSYNC_TIMEOUT=0 in make.conf. That will prevent rsync from timing out altogether. If you can then get it to hang, rsync was timing out correctly and the error lies elsewhere. Checking rsync's home page, I found this: http://samba.anu.edu.au/rsync/issues.html It suggests that the problem is either a server or network issue. If it's server, there should apparently be something in the logs that'll help resolve it. Otherwise, there's not much that can be done. Infra, can you guys shed any light on this?
I have been playing around with the coda filesystem lately. This included moving the portage tree on a coda volume. If I call emerge sync on the system that runs the coda server as well as the coda client the machine get under heavy load and the timeout error occurs nearly every time. Setting "RSYNC_TIMEOUT=0" did not fix the problem.
Believe this is a bug better suited for the rsync maintainer. Reassigning to vapier since he's the one who's touched it most recently.
2.6.4 is in portage now ... you could give that a twirl ...
I just opened a bug upstream to see if we can shed some light on this since I'm starting to see some more issues on our servers with this. https://bugzilla.samba.org/show_bug.cgi?id=2783
Cross posting from the other bug: https://bugzilla.samba.org/show_bug.cgi?id=2783 --------------------------------- (Wayne Davison) Rsync uses a large amount of CPU on the sending side due to the fact that the rsync algorithm is trading CPU and disk I/O to reduce network I/O (and using an encryption algorithm on top of that only makes the CPU that much higher, so using a daemon connection is less CPU intensive). The only way to reduce this is to use the option --whole-file, which makes rsync retransfer each changed file in its entirety rather than try to use CPU to figure out the differences. The gentoo bug you cite seems to be primarily concerned with timeouts, and this is one of the things that 2.6.5 (and to a lesser extent, 2.6.4) tries to fix. You do need to set things up correctly, though: (1) both sides need to be running at least 2.6.4 for any timeout-avoidance to occur (and the server needs to be running 2.6.5 for maximal timeout avoidance); (2) both sides need to know about the timeout, so if the server is a daemon that has a timeout specified in its config file, the client needs to have the same (or lower) timeout set via the --timeout command-line option or the client will not know to send the keep-live packets to the sender. --------------------------------- I've been having 2.6.4 running on one of our boxes already. I could try bumping it to 2.6.5 and have people having problems with it try to sync to raptor.gentoo.org with 2.6.4. If that does indeed solve the problem, I suggest we consider bumping 2.6.4 to stable in the next few weeks and see how that goes. Before you do that, please let me update our servers and email our admins so they can make that upgrade. I have a feeling that if they don't upgrade, we'll have more connection problems. Any other thoughts?
I encountered this problem on my local machines. The rsync server (local mirror) is an Athlon XP 2500+, 512MB DDR, 7200RPM drives, with minimal load (the same isntall as the client (router) for package building). The rsync client is a Cyrix 133MHz machine with 48MB EDO RAM, plus a 8GB hard drive. If I sync with this server the timeout happens every time. The load on the client is wery big (wery old machine, swapping). But if I change the client to sync with an official Gentoo mirror, it runs fine. It's a lot of time, but there are no timeouts. It's looks like the local server overloads (100Mbit LAN) the client. The rsync version is the stable x86 in portage.
makes sense ... your internet link is slow enough for the client try increasing your timeout on the client ... i usually set it to like 300 for my slower/embedded machines
I set the timeout in make.conf to 300, and it's working for the first try. I usually update this machine twice a month. If I find a problem I will report it. An emerge sync time is 1 hour and 20 minutes! emerge.log snippet: ---------- 1125641538: Started emerge on: Sep 02, 2005 08:12:18 1125641538: *** emerge --buildpkg sync 1125641538: === sync 1125641539: >>> starting rsync with rsync://192.168.1.15/gentoo-portage 1125642481: === Sync completed with rsync://192.168.1.15/gentoo-portage 1125646349: *** terminating. ---------- Finnished on 9:32 (In reply to comment #18)
I have another assumption. On my working machine (3000mhz) sync runs fine and never takes longer than 3 mins, on my server (500mhz) sync naturaly takes longer (more than 3 mins) and the error occured every single sync after around 3 mins after I started the sync. Later on I set the timeout to 1800seconds and the error disappeared. My assumption is that the timeout counter in emerge doesn't count the time since the server responded but the time the sync process is running, or it continues counting the time even though the sync process is already doing the things it doesn't need the server for and of course the time out is then triggered . This would explain why this error only appears on heavy cpu load, because then the sync process takes longer and perhaps longer than the timeout.
your logic is flawed because the rsync timeout setting is something portage passes to rsync and rsync itself errors ... emerge does not have any timeout checks of its own
re-open if 2.6.8 still breaks