Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 524284 - =www-servers/varnish-4.0.1 - varnishncsa: Can't open VSM file (Abandoned VSM file (Varnish not running?) /var/lib/varnish/gentoo-t1.sygic.local/_.vsm
Summary: =www-servers/varnish-4.0.1 - varnishncsa: Can't open VSM file (Abandoned VSM ...
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Server (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Anthony Basile
URL:
Whiteboard:
Keywords: PATCH
Depends on:
Blocks:
 
Reported: 2014-10-02 13:20 UTC by Tomáš Mózes
Modified: 2016-03-17 19:06 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tomáš Mózes 2014-10-02 13:20:31 UTC
After adding varnishd and varnishncsa to the default runlevel and booting the machine, varnishncsa is not started

 * Starting varnishd ...
 [ ok ]
 * Starting varnishncsa ...
Can't open VSM file (Abandoned VSM file (Varnish not running?) /var/lib/varnish/gentoo-t1.sygic.local/_.vsm
)
 * start-stop-daemon: failed to start `/usr/bin/varnishncsa'
 * Failed to start varnishncsa
 [ !! ]
 * ERROR: varnishncsa failed to start

Seems it's sufficient to wait a while and then it starts just fine. Tested on amd64 on a few reboots.

# diff -u /etc/init.d/varnishncsa.orig /etc/init.d/varnishncsa
--- /etc/init.d/varnishncsa.orig        2014-10-02 12:26:25.000000000 +0200
+++ /etc/init.d/varnishncsa     2014-10-02 15:10:28.704935808 +0200
@@ -14,6 +14,11 @@
 description_rotate="Rotate the log file"
 description_flush="Flush any outstanding transactions"
 
+start_pre() {
+       # wait for varnishd
+       sleep 2
+}
+
 rotate() {
        ebegin "Rotating log file"
        start-stop-daemon -p ${VARNISHNCSA_PID} -s SIGHUP
Comment 1 Tomáš Mózes 2014-10-02 13:21:42 UTC
Portage 2.2.8-r1 (hardened/linux/amd64, gcc-4.7.3, glibc-2.19-r1, 3.14.17-hardened x86_64)
=================================================================
System uname: Linux-3.14.17-hardened-x86_64-Intel-R-_Xeon-R-_CPU_E5620_@_2.40GHz-with-gentoo-2.2
KiB Mem:     2997372 total,   1726456 free
KiB Swap:          0 total,         0 free
Timestamp of tree: Thu, 02 Oct 2014 08:45:01 +0000
ld GNU ld (Gentoo 2.23.2 p1.0) 2.23.2
app-shells/bash:          4.2_p51
dev-lang/python:          2.7.7, 3.3.5-r1
dev-util/pkgconfig:       0.28-r1
sys-apps/baselayout:      2.2
sys-apps/openrc:          0.12.4
sys-apps/sandbox:         2.6-r1
sys-devel/autoconf:       2.69
sys-devel/automake:       1.13.4
sys-devel/binutils:       2.23.2
sys-devel/gcc:            4.7.3-r1
sys-devel/gcc-config:     1.7.3
sys-devel/libtool:        2.4.2-r1
sys-devel/make:           3.82-r4
sys-kernel/linux-headers: 3.13 (virtual/os-headers)
sys-libs/glibc:           2.19-r1
Repositories: gentoo
ACCEPT_KEYWORDS="amd64"
ACCEPT_LICENSE="*"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-mtune=native -O2 -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /var/bind"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo"
CXXFLAGS="-mtune=native -O2 -pipe"
DISTDIR="/usr/portage/distfiles"
FCFLAGS="-O2 -pipe"
FEATURES="assume-digests binpkg-logs config-protect-if-modified distlocks ebuild-locks fixlafiles merge-sync news parallel-fetch preserve-libs protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync xattr"
FFLAGS="-O2 -pipe"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
USE="acl amd64 berkdb bindist bzip2 cli cracklib crypt cxx dri gdbm hardened iconv justify mmx modules multilib ncurses nls nptl openmp pam pax_kernel pcre readline session sse sse2 ssl tcpd unicode urandom xattr xtpax zlib" ABI_X86="64" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan sheets stage tables krita karbon braindump author" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ublox ubx" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" NGINX_MODULES_HTTP="access charset gzip limit_conn limit_req log proxy rewrite" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php5-5" PYTHON_SINGLE_TARGET="python2_7" PYTHON_TARGETS="python2_7 python3_3" RUBY_TARGETS="ruby19 ruby20" USERLAND="GNU" VIDEO_CARDS="fbdev glint intel mach64 mga nouveau nv r128 radeon savage sis tdfx trident vesa via vmware dummy v4l" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, USE_PYTHON
Comment 2 Jeroen Roovers (RETIRED) gentoo-dev 2014-10-02 20:42:19 UTC
sleep? ew!
Comment 3 Anthony Basile gentoo-dev 2014-10-02 21:04:44 UTC
(In reply to Jeroen Roovers from comment #2)
> sleep? ew!

(In reply to Tomas Mozes from comment #0)
> --- /etc/init.d/varnishncsa.orig        2014-10-02 12:26:25.000000000 +0200
> +++ /etc/init.d/varnishncsa     2014-10-02 15:10:28.704935808 +0200
> @@ -14,6 +14,11 @@
>  description_rotate="Rotate the log file"
>  description_flush="Flush any outstanding transactions"
>  
> +start_pre() {
> +       # wait for varnishd
> +       sleep 2
> +}
> +
>  rotate() {
>         ebegin "Rotating log file"
>         start-stop-daemon -p ${VARNISHNCSA_PID} -s SIGHUP

Yeah, I have to say eww here too.  Is there any other way to check if varnishd has done its initialization?  What resource are we waiting to become available?  If you can tell me that, I'll see if I can find some way of signaling rather than blindly sleeping.

Thanks for the report :)
Comment 4 Tomáš Mózes 2014-10-03 07:34:54 UTC
I tried several other methods:
- checking for varnishd pid
- checking for file /var/lib/varnish/`uname -n`/_.vsm
- blind read from /var/lib/varnish/`uname -n`/_.vsm
- tried to find some hints in the manual pages

with no luck. The error situation looks like this:

uname({sys="Linux", node="gentoo-t1.sygic.local", ...}) = 0
open("/var/lib/varnish/gentoo-t1.sygic.local/_.vsm", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=84934656, ...}) = 0
read(3, "VSMHEAD00\0\0\0\0\0\0\0\0\0\20\5\0\0\0\0000\0\0\0\0\0\0\0"..., 48) = 48
close(3)                                = 0
write(2, "Can't open VSM file (Abandoned V"..., 109Can't open VSM file (Abandoned VSM file (Varnish not running?) /var/lib/varnish/gentoo-t1.sygic.local/_.vsm)) = 109
write(2, "\n", 1)                       = 1

The good one looks like this:
uname({sys="Linux", node="gentoo-t1.sygic.local", ...}) = 0
open("/var/lib/varnish/gentoo-t1.sygic.local/_.vsm", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=84934656, ...}) = 0
read(3, "VSMHEAD00\0\0\0\0\0\0\0\0\0\20\5\0\0\0\0000\0\0\0\0\0\0\0"..., 48) = 48
mmap(NULL, 84934656, PROT_READ, MAP_SHARED, 3, 0) = 0x7f2ee607f000
clock_gettime(CLOCK_MONOTONIC, {609, 64473513}) = 0
rt_sigaction(SIGHUP, {0x403b70, [HUP], SA_RESTORER|SA_RESTART, 0x7f2eeb41e350}, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGINT, {0x403b80, [INT], SA_RESTORER|SA_RESTART, 0x7f2eeb41e350}, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGTERM, {0x403b80, [TERM], SA_RESTORER|SA_RESTART, 0x7f2eeb41e350}, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGUSR1, {0x403b90, [USR1], SA_RESTORER|SA_RESTART, 0x7f2eeb41e350}, {SIG_DFL, [], 0}, 8) = 0
clock_gettime(CLOCK_MONOTONIC, {609, 64910775}) = 0
clock_gettime(CLOCK_MONOTONIC, {609, 65017857}) = 0
clock_gettime(CLOCK_MONOTONIC, {609, 65123411}) = 0
clock_gettime(CLOCK_MONOTONIC, {609, 65204597}) = 0
clock_gettime(CLOCK_MONOTONIC, {609, 65307478}) = 0
clock_gettime(CLOCK_MONOTONIC, {609, 65366954}) = 0
clock_gettime(CLOCK_MONOTONIC, {609, 65469630}) = 0
munmap(0x7f2ee607f000, 84934656)        = 0
close(3)                                = 0

Since all varnish programs use the same error handling code, we can use some non-destructive commands like varnishtop -1 and wait until it finishes with status 0. Or an alternative:

--- varnishncsa.orig    2014-10-02 12:26:25.000000000 +0200
+++ varnishncsa 2014-10-03 09:23:48.207908492 +0200
@@ -14,6 +14,18 @@
 description_rotate="Rotate the log file"
 description_flush="Flush any outstanding transactions"
 
+start_pre() {
+       TIMEOUT=${TIMEOUT:-5}
+
+       local i=0
+       while [ $i -lt ${TIMEOUT} ]; do
+               echo status | varnishadm 2>/dev/null | grep -q "Child in state running"
+               test $? -eq 0  && break
+                
+               sleep 1 && i=$(expr $i + 1)
+       done
+}
+
 rotate() {
        ebegin "Rotating log file"
        start-stop-daemon -p ${VARNISHNCSA_PID} -s SIGHUP
Comment 5 Anthony Basile gentoo-dev 2014-10-03 11:25:34 UTC
(In reply to Tomas Mozes from comment #4)
> 
> uname({sys="Linux", node="gentoo-t1.sygic.local", ...}) = 0
> open("/var/lib/varnish/gentoo-t1.sygic.local/_.vsm", O_RDONLY) = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=84934656, ...}) = 0
> read(3, "VSMHEAD00\0\0\0\0\0\0\0\0\0\20\5\0\0\0\0000\0\0\0\0\0\0\0"..., 48)
> = 48
> close(3)                                = 0
> write(2, "Can't open VSM file (Abandoned V"..., 109Can't open VSM file
> (Abandoned VSM file (Varnish not running?)
> /var/lib/varnish/gentoo-t1.sygic.local/_.vsm)) = 109
> write(2, "\n", 1)                       = 1

This strace doesn't say much.  The file is opened and then close by an internal decision.  Reading the code we see the failure is happening in lib/libvarnishapi/vsm.c around line 260 which reads

        if (!vd->N_opt && slh.alloc_seq == 0) {
                AZ(close(vd->vsm_fd));
                vd->vsm_fd = -1;
                return (vsm_diag(vd,
                        "Abandoned VSM file (Varnish not running?) %s\n",
                        vd->fname));
        }


slh is successfully read from vd->vsm_fd at line 246 otherwise a different error would be thrown.  From the strace, you actually see part of the struct read (see include/vapi/vsm_int.h), but it isn't all of it so I can't make out alloc_seq's value in there.  It might be nice to fprintf(stderr ...) the values of vd->N_opt and slh.alloc_seq both on failure and success.  This is starting to feel like a race condition that upstream should know about.

Having said that, your start_pre() is much better.  Let's see what upstream has to say first.
Comment 6 Anthony Basile gentoo-dev 2014-10-03 11:35:41 UTC
(In reply to Anthony Basile from comment #5)
> (In reply to Tomas Mozes from comment #4)
> > 
> > uname({sys="Linux", node="gentoo-t1.sygic.local", ...}) = 0
> > open("/var/lib/varnish/gentoo-t1.sygic.local/_.vsm", O_RDONLY) = 3
> > fstat(3, {st_mode=S_IFREG|0644, st_size=84934656, ...}) = 0
> > read(3, "VSMHEAD00\0\0\0\0\0\0\0\0\0\20\5\0\0\0\0000\0\0\0\0\0\0\0"..., 48)
> > = 48
> > close(3)                                = 0
> > write(2, "Can't open VSM file (Abandoned V"..., 109Can't open VSM file
> > (Abandoned VSM file (Varnish not running?)
> > /var/lib/varnish/gentoo-t1.sygic.local/_.vsm)) = 109
> > write(2, "\n", 1)                       = 1
> 
> This strace doesn't say much.  The file is opened and then close by an
> internal decision.  Reading the code we see the failure is happening in
> lib/libvarnishapi/vsm.c around line 260 which reads
> 
>         if (!vd->N_opt && slh.alloc_seq == 0) {
>                 AZ(close(vd->vsm_fd));
>                 vd->vsm_fd = -1;
>                 return (vsm_diag(vd,
>                         "Abandoned VSM file (Varnish not running?) %s\n",
>                         vd->fname));
>         }
> 
> 
> slh is successfully read from vd->vsm_fd at line 246 otherwise a different
> error would be thrown.  From the strace, you actually see part of the struct
> read (see include/vapi/vsm_int.h), but it isn't all of it so I can't make
> out alloc_seq's value in there.  It might be nice to fprintf(stderr ...) the
> values of vd->N_opt and slh.alloc_seq both on failure and success.  This is
> starting to feel like a race condition that upstream should know about.
> 
> Having said that, your start_pre() is much better.  Let's see what upstream
> has to say first.

Oh wait, I'm stupid.  This is varnishncsa failing, not varnishd.  So yeah, we just have to wait til varnishd is ready.  I thought it was a thread race since varnishd is *extremely* thread heavy.

@Tomas.  Did you test your latest start_pre?
Comment 7 Tomáš Mózes 2014-10-03 13:58:42 UTC
Yes, I've tried restarting varnishd and varnishnca in a loop, seems to work fine. I also tried shutting down the machine and then starting it to see if the services start properly (multiple times on xen-domU).

I've attached the strace output just to point out that for removing "sleep" we would probably need to reuse some code from varnish (and to keep it in sync with the specific varnish version).
Comment 8 Anthony Basile gentoo-dev 2014-10-03 18:18:07 UTC
Okay I put the fix in varnish-4.0.1-r1.  Please reopen this bug if there's still a problem.
Comment 9 Christian Ruppert (idl0r) gentoo-dev 2015-11-16 14:59:39 UTC
Hi,

I don't think a default sleep/timeout is necessary in 4.1. anymore. We should verify that, since there were some related fixes/enhancements.
Comment 10 Tomáš Mózes 2015-12-08 06:39:53 UTC
(In reply to Christian Ruppert (idl0r) from comment #9)
> Hi,
> 
> I don't think a default sleep/timeout is necessary in 4.1. anymore. We
> should verify that, since there were some related fixes/enhancements.

Thanks for the suggestion Christian. Yes, I can confirm it now works on a testing machine without the sleep/timeout part with version 4.1.0.
Comment 11 Christian Ruppert (idl0r) gentoo-dev 2016-03-17 19:06:49 UTC
start_pre() has been removed in 4.1.2 since it's no longer required.