So I finally filed a bug about this bugger. Both Jorge and I have looked into it, a few notes I wanted to take down. 1) This has nothing to do with the webservers rebooting, that was a different bug regarding apache. This bug has to do with the database backends (grebe / grouse). 2) Both the loadbalancer, icinga, and users report the outage. The LBs hit '/status.php', icinga hits / and users hit whatever. status.php is the 'easiest' page to understand. It just sources the forums config, connects the DB, makes 1 query, frees it, and returns OK. It completes normally in < 1s. During outages, it exceeds the LB configured timeout of 10s. Why does a simple connection to the mysql db time out? lets see!
For those not in the know, Grebe and Grouse run loadbalancing for the forums, and also run mysql. Grebe's LB logs: 9/01/2013 Outage began 3:33, ended 4:09. zegrep '(Disabling|Enabling)' /var/log/messages | less Grebe' sar numbers: 00:00:01 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 03:30:02 3668.74 1446.59 322.05 0.01 457.56 0.00 0.00 0.00 0.00 03:40:02 7406.57 2126.09 381.13 0.10 1963.16 1215.83 0.00 1215.82 100.00 03:50:02 5787.58 2159.74 1409.06 0.14 3219.06 1464.49 0.00 1464.49 100.00 04:10:01 3359.44 1402.28 873.41 0.11 3142.33 1018.54 0.00 1018.54 100.00 04:20:01 0.18 434.65 348.73 0.00 226.70 0.00 0.00 0.00 0.00 This basically says 'wow we are moving tons of pages in and out.' 00:00:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact 03:20:02 5394980 11067292 67.23 680724 7061900 3639672 14.88 7793564 2606520 03:30:02 2624224 13838048 84.06 683056 9764440 3685528 15.07 7835368 5262008 03:40:02 271112 16191160 98.35 678512 12132944 3690996 15.09 7843888 7623772 03:50:02 310220 16152052 98.12 668032 12153024 3653032 14.93 6973288 8471164 04:10:01 6586740 9875532 59.99 658396 5989484 3643032 14.89 6956520 2352204 So we see here that during the outage interval, we are quite low on memory on a percentage basis. Now of course we still have ~250 MB free, so its not the end of the world. Sep 1 03:24:05 grebe kernel: [12150592.407818] grsec: mount of /dev/mapper/vg-var_lib_mysql_snapshot to /var/tmp/mylvmbackup/mnt/grebe.gentoo.org-mysql-maste r-backup by /bin/mount[mount:25522] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/mylvmbackup[mylvmbackup:25483] uid/euid:0/0 gid/egid:0/0 Sep 1 04:01:44 grebe kernel: [12152849.445516] grsec: unmount of /dev/mapper/vg-var_lib_mysql_snapshot by /bin/umount[umount:28191] uid/euid:0/0 gid/egid:0/0 , parent /usr/bin/mylvmbackup[mylvmbackup:25483] uid/euid:0/0 gid/egid:0/0 Implies the outage for grebe is 3:24 -> 4:01.
I added debugging for mylvmbackup, and I disabled mylvmbackup on grouse (by chmoding the script 000.)
mylvmbackup seems to jibe with the outage times. I've disabled backups on both grouse and grebe to see if we can go 24hrs without an outage. -A
No outage today confirms mylvmbackup, now we just need to figure out how to limit memory use... -A