83254 – rsync in 'emerge sync' excessively sensitive to CPU load

Bug 83254 - rsync in 'emerge sync' excessively sensitive to CPU load

Summary: rsync in 'emerge sync' excessively sensitive to CPU load

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	Portage Development
Classification:	Unclassified
Component:	Core - Interface (emerge) (show other bugs)
Hardware:	x86 Linux

Importance:	High normal (vote)
Assignee:	Gentoo's Team for Core System packages

URL:	http://forums.gentoo.org/viewtopic-t-...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-02-24 19:35 UTC by Labrador
Modified:	2006-05-11 20:42 UTC (History)
CC List:	9 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Labrador 2005-02-24 19:35:19 UTC

There are dozens of reports of rsync timesouts in io.c floating
around, and no consistant pattern.  Different people find different
"solutions" - others just try again and then the rsync timeout
goes away for no known reason.  Some people reported that only
some machines in their possession exhibit the timeout error repeatedly
while other machines never have the problem.

Today I believe I found a detail that is related to the common ingredient
in all of the failure reports of the type:

emerge sync 
[ ... ]
         749 100%    6.42kB/s    0:00:00  (51945, 100.0% of 111850)
xfce-extra/xfwm4-themes/xfwm4-themes-4.2.0.ebuild
         399 100%    1.62kB/s    0:00:00  (51946, 100.0% of 111850)
io timeout after 60 seconds - exiting
rsync error: timeout in data send/receive (code 30) at io.c(153)
rsync: connection unexpectedly closed (2593213 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(359)
>>> retry ... 

The problem is CPU load.  If your load is too high, rsync has bogus
timeouts.  If the load can decrease for a few minutes, emerge sync will
complete OK.

See more details on my own discovery in the Gentoo forum link in this
bug report.


Reproducible: Sometimes
Steps to Reproduce:
1.Induce a significant CPU load (say around 2 load factor or better)
2.emerge sync
3.rsync timeout error in io.c at some random point
(this is somewhat theory based on the experience I had)
Actual Results:  
If there is an easy way to artifically induce the CPU load I had
with 2 frozen browsers and a busy X session, I can try to reproduce this.


Expected Results:  
Emerge sync should have completed once in the 20 attempts I had
made today while the system was under moderate load.

Emerge-websync worked OK and I had ssh session running without dropping
packets or any other noticable performance loss - so I don't
see why rsync should bail out so easily.

I'm not alone with this problem.

I've seen it before sporatically and never understood why it would
work OK on a second or third try.  Some users report that some of
their servers consistantly fail to emerge sync.  This could be due to
the load factor on those particular servers.


My system is dual PII-450 with 512 MB RAM.

emerge info
Portage 2.0.51-r15 (default-linux/x86/2004.0, gcc-3.3.5,
glibc-2.3.4.20040808-r1, 2.4.28-gentoo-r5 i686)
=================================================================
System uname: 2.4.28-gentoo-r5 i686 Pentium II (Deschutes)
Gentoo Base System version 1.4.16
Python:              dev-lang/python-2.3.4-r1,dev-lang/python-2.1.3-r1 [2.3.4
(#1, Feb 13 2005, 23:24:29)]
ccache version 2.3 [enabled]
dev-lang/python:     2.3.4-r1, 2.1.3-r1
sys-devel/autoconf:  2.13, 2.59-r6
sys-devel/automake:  1.5, 1.7.9-r1, 1.8.5-r3, 1.6.3, 1.4_p6, 1.9.4
sys-devel/binutils:  2.15.92.0.2-r1
sys-devel/libtool:   1.5.10-r4
virtual/os-headers:  2.4.21-r1
ACCEPT_KEYWORDS="x86"
AUTOCLEAN="yes"
CFLAGS="-march=i686 -O2 -pipe"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3/share/config
/usr/lib/X11/xkb /usr/lib/mozilla/defaults/pref /usr/share/config
/usr/share/texmf/dvipdfm/config/ /usr/share/texmf/dvips/config/
/usr/share/texmf/tex/generic/config/ /usr/share/texmf/tex/platex/config/
/usr/share/texmf/xdvi/ /var/bind /var/qmail/control"
CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d"
CXXFLAGS="-march=i686 -O2 -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoaddcvs autoconfig ccache distlocks sandbox sfperms"
GENTOO_MIRRORS="http://mirror.clarkson.edu/pub/distributions/gentoo/
http://mirror.datapipe.net/gentoo http://open-systems.ufl.edu/mirrors/gentoo"
MAKEOPTS="-j3"
PKGDIR="/usr/portage/packages"
PORTAGE_TMPDIR="/usr/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/portage-overlay"
SYNC="rsync://rsync.namerica.gentoo.org/gentoo-portage"
USE="x86 X apm avi berkdb bitmap-fonts cdr crypt cups curl emboss encode f77 fam
flac font-server foomaticdb fortran gdbm gif gnome gstreamer gtk gtk2 gtkhtml
imagemagick imap imlib ipv6 java jpeg junit libg++ libwww mad mikmod mmx motif
mozilla mpeg mysql ncurses nls noaudio oggvorbis opengl pam pdflib perl png
postgres python quicktime readline samba sdl slang spell ssl svga tcltk tcpd
tetex tiff truetype truetype-fonts type1-fonts usb xml xml2 xv zlib
video_cards_matrox"
Unset:  ASFLAGS, CBUILD, CTARGET, LANG, LC_ALL, LDFLAGS

Comment 1 Nicholas Jones (RETIRED) gentoo-dev

2005-02-24 19:39:21 UTC

Are you sure it isn't an IO and context problem?

If you can't read fast enough, then you obviously can't check your local files,
verify the downloaded statistics, and request the differences.

This is not really something can easily fix. I would imagine the timeout
code in rsync is a little off as it should only be concerned with a real
timeout if it has actually attempted to send any data recently.

This is an upstream bug, not a portage bug.

Comment 2 Labrador 2005-02-24 20:10:31 UTC

It has something to do with server load, I believe, due to the
number of other people reporting the same error.  I don't think
they all have KVM switches throwing the digit 1 into mozilla.
I suspect some of them are just busy servers and they fail
for the same reason as I've found.  I don't really know of
a way to test for IO/context, so this is mainly conjecture.

Based on my experience, I believe my evidence and 
theory is at least more useful than other reports on
the forums, such as "rm -rf /usr/portage/* and then it worked"
or "I rebooted and it was fixed".

Whatever is going on, and whoever owns this, it is critical to
the success of Gentoo.  rsync is probably more important to
Gentoo than to any other distro.

Personally I fought this all day, and I was starting to wonder if
I was committed to sticking with Gentoo since no one seemed to have
a handle on this very old and sporadic problem.

Comment 3 Nicholas Jones (RETIRED) gentoo-dev

2005-02-24 21:27:49 UTC

A good fix is to increase the timeout in make.conf.

If you want to test using IO contention, copy and paste:

for X in $(seq 1 30);
  dd if=/dev/hda of=/dev/null bs=1 count=100000000 skip=$(( 100000000 * $X )) &
done ; emerge sync ; killall dd

And then try to sync. It should fail pretty quickly, I'd imagine.
It creates 30 "readers" of your hard disk simultaneously reading different
100M portions of the disk in 1 byte intervals. You box _will_ come to a crawl
or you have some amazing disks and ram.

Comment 4 Labrador 2005-02-25 08:15:16 UTC

I've seen people reporting the same error with timeout=300.
I made mine 60 and it didn't help.  This isn't a real timeout
error, but a bug in rsync that makes it think it has experienced
a timeout.

I think I could have set my timeout value to 24 hours and it wouldn't have
made a difference - the problem was in whatever type of system load
the 2 frozen browsers and X session had on the OS.

I've tried the dd load test on two systems - the one having the problem
all of yesterday, and my Sun Sparc Ultra 60 (300 Mhz).  Rsync was
not impacted by the load.  It completed OK in a first try.  So perhaps
it has a relation to the CPU workload?

Both of these systems have experienced the io.c timeout error from rsync
intermittantly in the past.

Comment 5 Labrador 2005-02-25 11:06:23 UTC

Some additional links to Gentoo users having the same bug (note how
many find "me too" responses to the thread):

http://forums.gentoo.org/viewtopic-t-288434-highlight-rsync+timeout.html

http://forums.gentoo.org/viewtopic-t-246397-highlight-rsync+timeout.html

http://forums.gentoo.org/viewtopic-t-291760-highlight-rsync+timeout.html
(note user hjnenc says here: I usually have rsync problems only, when the box is heavily loaded at the time the rsync runs.)

http://forums.gentoo.org/viewtopic-t-281918-highlight-rsync+timeout.html

http://forums.gentoo.org/viewtopic-t-281397-highlight-rsync+timeout.html
(again, note the problem was initially "fixed by a reboot" - which can
be a crude way to eliminate system resource hogs)

http://forums.gentoo.org/viewtopic-t-273811-highlight-rsync+timeout.html

I hope this will help convince someone that increasing the timeout has
nothing to do with the real problem.  It has been around for awhile
and seems to be a little more common lately.

My errors are from rsync 2.6.3.  Looking at the cooresponding line numbers
from the error messages...

The line 153 in my error result is merely the comparison test
of the timeout.  That isn't the source of the timeout.

The line 359 is the end of this procedure in io.c

/**
 * It's almost always an error to get an EOF when we're trying to read
 * from the network, because the protocol is self-terminating.
 *
 * However, there is one unfortunate cases where it is not, which is
 * rsync <2.4.6 sending a list of modules on a server, since the list
 * is terminated by closing the socket. So, for the section of the
 * program where that is a problem (start_socket_client),
 * kludge_around_eof is True and we just exit.
 */
static void whine_about_eof(int fd)
{
        if (kludge_around_eof && fd == sock_f_in)
                exit_cleanup(0);

        rprintf(FERROR, RSYNC_NAME ": connection unexpectedly closed "
                "(%.0f bytes received so far) [%s]\n",
                (double)stats.total_read, who_am_i());

        exit_cleanup(RERR_STREAMIO);
}

Here is one place in io.c that the kludge procedure is called
(around line 502 in rsync 2.6.3):

                n = read(fd, buf, len);

                if (n <= 0) {
                        if (n == 0)
                                whine_about_eof(fd); /* Doesn't return. */
                        if (errno == EINTR || errno == EWOULDBLOCK
                            || errno == EAGAIN)
                                continue;

                        /* Don't write errors on a dead socket. */
                        if (fd == sock_f_in)
                                close_multiplexing_out();
                        rsyserr(FERROR, errno, "read error");
                        exit_cleanup(RERR_STREAMIO);
                }

Can anyone think of a reason that CPU load could cause this to
be erroneously triggered?

Comment 6 solar (RETIRED) gentoo-dev

2005-02-25 12:52:10 UTC

if the socket is non blocking then errno could be EAGAIN

if (n == 0) whine_about_eof(fd); seems like it should be after the checks for errno.

Comment 7 Nicholas Jones (RETIRED) gentoo-dev

2005-02-26 00:27:24 UTC

Read would return -1, not 0, in the case of EAGAIN.

Information about the servers you have the issue with would be
good... emerge.log should contain the servers you connected to.
Can you list the ones that fail, and if possible, try to sync
with them explicitly and repeat it.

Comment 8 Labrador 2005-02-26 05:35:40 UTC

Here are a list of rsync servers (uniqued by eye)
that failed for me on Feb 24th:

Failed:
rsync://204.89.193.10/gentoo-portage
rsync://206.167.194.132/gentoo-portage
rsync://206.169.166.57/gentoo-portage
rsync://206.251.252.32/gentoo-portage
rsync://141.219.155.230/gentoo-portage
rsync://146.186.218.75/gentoo-portage
rsync://216.170.153.145/gentoo-portage
rsync://216.194.67.33/gentoo-portage
rsync://63.150.226.21/gentoo-portage
rsync://66.11.11.187/gentoo-portage
rsync://128.213.5.35/gentoo-portage
rsync://129.89.70.108/gentoo-portage

(after this point I upgraded to rsync 2.6.3)
Failed:
rsync://206.251.252.32/gentoo-portage
rsync://209.47.158.135/gentoo-portage
rsync://216.170.153.145/gentoo-portage
rsync://206.167.194.132/gentoo-portage
rsync://206.169.166.57/gentoo-portage
rsync://204.89.193.10/gentoo-portage
rsync://129.89.70.108/gentoo-portage
rsync://132.207.4.160/gentoo-portage
rsync://141.219.155.230/gentoo-portage
rsync://216.194.67.33/gentoo-portage
rsync://63.150.226.21/gentoo-portage

Then after the mozilla browsers and X were killed,
there were sucessful emerge sync on Feb
25th against:
rsync://132.207.4.160/gentoo-portage
rsync://128.213.5.35/gentoo-portage
rsync://63.150.226.21/gentoo-portage
rsync://129.89.70.108/gentoo-portage
(as I ran the dd load tests suggested)

It isn't hard to run against one that had
previously failed, without doing so deliberately.
As you can see, there are 3 rsync servers in that list
which appear in the list of those that previously failed.

I don't see where this testing is going.

If you read the linked items I included, you can see that this
is not specific to me.  What has happened to the earlier
cases in the forums is that they likely just kept trying,
or rebooted and it cleared up any issue that triggers
the bug.

At most times, it is an intermittant bug that
people just put up with.  You don't even hear about that
since you can try again and it might work.

So far I'm not having any io.c timeouts in emerge sync
against some of the previously failing rsync servers
(I set RSYNC in make.conf to do this).  The only errors
I see are other ones such as connection refused or
no response resulting in an error in clientserver.c

Of course I can only test one per 30 minutes...

Comment 9 Jason Stubbs (RETIRED) gentoo-dev

2005-02-26 08:19:42 UTC

The line of questioning is just looking for hints as to the cause. As you have said yourself, it's intermittent and you're having trouble reproducing it even under the circumstances you believe cause it.

As for myself, I've also experienced this problem. From what I've read, rsync is a multi-threaded (4 threads?) protocol. There are at least two visible threads; one scans for modified files and one downloads new files. From what I've seen, this problem occurs when *one* of those threads times out.

Comment 10 Labrador 2005-02-26 12:56:56 UTC

Actually, I've been using other methods to simulate the previous trigger
of the failure, since it is a weird thing and I'd rather see
it happen with a generic load.  But if that isn't going to work,
I can also try the original situation.

I now have a browser and X that are running with 90% load on
CPU 1 and 98% load on CPU 2.  Once I have waited for the emerge
timestamp to age, I can try this again and likely reproduce the bug.

The question is, once I'm there and can reproduce the timeout from
io.c, what do we want to know?  Is there any stat from lsof, 
for example, that would help?

If you want to know what I've done to reproduce the scenario:
start X, launch 2 mozilla browsers (maybe one would work),
enter a continuous stream of some character into the address
bar of mozilla.  I can do this with double clicking scroll lock
and hitting some digit after.  It is the same thing as putting
an object on the keyboard (or just one key) so that it autorepeats
some character forever.  That is what makes the load on X and
mozilla in my case (and one per CPU for my case, as well, or else
I'd still have a CPU free to process cycles quickly).

Again, I'm only using this weird scenario to reproduce what
happened just to me.  I believe it will also simulate other
types of system saturation with different causes that also
lead to the io.c timeout error.

Comment 11 Jason Stubbs (RETIRED) gentoo-dev

2005-02-26 18:42:19 UTC

Try settings RSYNC_TIMEOUT=0 in make.conf. That will prevent rsync from timing out altogether. If you can then get it to hang, rsync was timing out correctly and the error lies elsewhere.

Checking rsync's home page, I found this:
http://samba.anu.edu.au/rsync/issues.html

It suggests that the problem is either a server or network issue. If it's server, there should apparently be something in the logs that'll help resolve it. Otherwise, there's not much that can be done. Infra, can you guys shed any light on this?

Comment 12 Gunnar Wrobel (RETIRED) gentoo-dev

2005-03-08 05:04:08 UTC

I have been playing around with the coda filesystem lately. This included moving the portage tree on a coda volume. If I call emerge sync on the system that runs the coda server as well as the coda client the machine get under heavy load and the timeout error occurs nearly every time. Setting "RSYNC_TIMEOUT=0" did not fix the problem.

Comment 13 Kurt Lieber (RETIRED) gentoo-dev

2005-03-29 05:52:47 UTC

Believe this is a bug better suited for the rsync maintainer.  Reassigning to vapier since he's the one who's touched it most recently.

Comment 14 SpanKY gentoo-dev

2005-04-05 17:14:01 UTC

2.6.4 is in portage now ... you could give that a twirl ...

Comment 15 Lance Albertson (RETIRED) gentoo-dev

2005-06-09 07:20:40 UTC

I just opened a bug upstream to see if we can shed some light on this since I'm
starting to see some more issues on our servers with this.

https://bugzilla.samba.org/show_bug.cgi?id=2783

Comment 16 Lance Albertson (RETIRED) gentoo-dev

2005-06-10 14:20:54 UTC

Cross posting from the other bug:
https://bugzilla.samba.org/show_bug.cgi?id=2783
---------------------------------
(Wayne Davison)

Rsync uses a large amount of CPU on the sending side due to the fact that the
rsync algorithm is trading CPU and disk I/O to reduce network I/O (and using an
encryption algorithm on top of that only makes the CPU that much higher, so
using a daemon connection is less CPU intensive).  The only way to reduce this
is to use the option --whole-file, which makes rsync retransfer each changed
file in its entirety rather than try to use CPU to figure out the differences.

The gentoo bug you cite seems to be primarily concerned with timeouts, and this
is one of the things that 2.6.5 (and to a lesser extent, 2.6.4) tries to fix.
You do need to set things up correctly, though: (1) both sides need to be
running at least 2.6.4 for any timeout-avoidance to occur (and the server needs
to be running 2.6.5 for maximal timeout avoidance); (2) both sides need to know
about the timeout, so if the server is a daemon that has a timeout specified in
its config file, the client needs to have the same (or lower) timeout set via
the --timeout command-line option or the client will not know to send the
keep-live packets to the sender.
---------------------------------

I've been having 2.6.4 running on one of our boxes already. I could try bumping
it to 2.6.5 and have people having problems with it try to sync to
raptor.gentoo.org with 2.6.4. If that does indeed solve the problem, I suggest
we consider bumping 2.6.4 to stable in the next few weeks and see how that goes.
Before you do that, please let me update our servers and email our admins so
they can make that upgrade. I have a feeling that if they don't upgrade, we'll
have more connection problems. 

Any other thoughts?

Comment 17 ProTech 2005-09-01 06:54:30 UTC

I encountered this problem on my local machines.

The rsync server (local mirror) is an Athlon XP 2500+, 512MB DDR, 7200RPM
drives, with minimal load (the same isntall as the client (router) for package
building).
The rsync client is a Cyrix 133MHz machine with 48MB EDO RAM, plus a 8GB hard drive.

If I sync with this server the timeout happens every time. The load on the
client is wery big (wery old machine, swapping). But if I change the client to
sync with an official Gentoo mirror, it runs fine. It's a lot of time, but there
are no timeouts.

It's looks like the local server overloads (100Mbit LAN) the client. The rsync
version is the stable x86 in portage.

Comment 18 SpanKY gentoo-dev

2005-09-01 07:58:37 UTC

makes sense ... your internet link is slow enough for the client

try increasing your timeout on the client ... i usually set it to like 300 for
my slower/embedded machines

Comment 19 ProTech 2005-09-02 00:35:55 UTC

I set the timeout in make.conf to 300, and it's working for the first try. I
usually update this machine twice a month. If I find a problem I will report it.

An emerge sync time is 1 hour and 20 minutes!

emerge.log snippet:
----------
1125641538: Started emerge on: Sep 02, 2005 08:12:18
1125641538:  *** emerge --buildpkg sync
1125641538:  === sync
1125641539: >>> starting rsync with rsync://192.168.1.15/gentoo-portage
1125642481: === Sync completed with rsync://192.168.1.15/gentoo-portage
1125646349:  *** terminating.
----------
Finnished on 9:32

(In reply to comment #18)

Comment 20 patrick 2005-09-09 06:45:07 UTC

I have another assumption. On my working machine (3000mhz) sync runs fine and
never takes longer than 3 mins, on my server (500mhz) sync naturaly takes longer
(more than 3 mins) and the error occured every single sync after around 3 mins
after I started the sync. Later on I set the timeout to 1800seconds and the
error disappeared. My assumption is that the timeout counter in emerge doesn't
count the time since the server responded but the time the sync process is
running, or it continues counting the time even though the sync process is
already doing the things it doesn't need the server for and of course the time
out is then triggered . This would explain why this error only appears on heavy
cpu load, because then the sync process takes longer and perhaps longer than the
timeout.

Comment 21 SpanKY gentoo-dev

2005-09-09 08:56:37 UTC

your logic is flawed because the rsync timeout setting is something portage
passes to rsync and rsync itself errors ... emerge does not have any timeout
checks of its own

Comment 22 SpanKY gentoo-dev

2006-05-11 20:42:10 UTC

re-open if 2.6.8 still breaks