Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 483310

Summary: Forums flap every day between 3:30am and 4:30am UTC
Product: Gentoo Infrastructure Reporter: Alec Warner <antarus>
Component: ForumsAssignee: Gentoo Infrastructure <infra-bugs>
Status: RESOLVED OBSOLETE    
Severity: normal CC: forum-mods
Priority: Normal    
Version: unspecified   
Hardware: All   
OS: Linux   
Whiteboard:
Package list:
Runtime testing required: ---

Description Alec Warner (RETIRED) archtester gentoo-dev Security 2013-09-01 22:44:44 UTC
So I finally filed a bug about this bugger. Both Jorge and I have looked into it, a few notes I wanted to take down.

1) This has nothing to do with the webservers rebooting, that was a different bug regarding apache. This bug has to do with the database backends (grebe / grouse).

2) Both the loadbalancer, icinga, and users report the outage. The LBs hit '/status.php', icinga hits / and users hit whatever. status.php is the 'easiest' page to understand. It just sources the forums config, connects the DB, makes 1 query, frees it, and returns OK. It completes normally in < 1s. During outages, it exceeds the LB configured timeout of 10s.

Why does a simple connection to the mysql db time out?

lets see!
Comment 1 Alec Warner (RETIRED) archtester gentoo-dev Security 2013-09-01 23:01:17 UTC
For those not in the know, Grebe and Grouse run loadbalancing for the forums, and also run mysql.

Grebe's LB logs: 9/01/2013

Outage began 3:33, ended 4:09. 

zegrep '(Disabling|Enabling)' /var/log/messages | less

Grebe' sar numbers:

00:00:01     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
03:30:02      3668.74   1446.59    322.05      0.01    457.56      0.00      0.00      0.00      0.00
03:40:02      7406.57   2126.09    381.13      0.10   1963.16   1215.83      0.00   1215.82    100.00
03:50:02      5787.58   2159.74   1409.06      0.14   3219.06   1464.49      0.00   1464.49    100.00
04:10:01      3359.44   1402.28    873.41      0.11   3142.33   1018.54      0.00   1018.54    100.00
04:20:01         0.18    434.65    348.73      0.00    226.70      0.00      0.00      0.00      0.00

This basically says 'wow we are moving tons of pages in and out.'
00:00:01    kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact
03:20:02      5394980  11067292     67.23    680724   7061900   3639672     14.88   7793564   2606520
03:30:02      2624224  13838048     84.06    683056   9764440   3685528     15.07   7835368   5262008
03:40:02       271112  16191160     98.35    678512  12132944   3690996     15.09   7843888   7623772
03:50:02       310220  16152052     98.12    668032  12153024   3653032     14.93   6973288   8471164
04:10:01      6586740   9875532     59.99    658396   5989484   3643032     14.89   6956520   2352204

So we see here that during the outage interval, we are quite low on memory on a percentage basis. Now of course we still have ~250 MB free, so its not the end of the world.

Sep  1 03:24:05 grebe kernel: [12150592.407818] grsec: mount of /dev/mapper/vg-var_lib_mysql_snapshot to /var/tmp/mylvmbackup/mnt/grebe.gentoo.org-mysql-maste
r-backup by /bin/mount[mount:25522] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/mylvmbackup[mylvmbackup:25483] uid/euid:0/0 gid/egid:0/0
Sep  1 04:01:44 grebe kernel: [12152849.445516] grsec: unmount of /dev/mapper/vg-var_lib_mysql_snapshot by /bin/umount[umount:28191] uid/euid:0/0 gid/egid:0/0
, parent /usr/bin/mylvmbackup[mylvmbackup:25483] uid/euid:0/0 gid/egid:0/0

Implies the outage for grebe is 3:24 -> 4:01.
Comment 2 Alec Warner (RETIRED) archtester gentoo-dev Security 2013-09-01 23:07:10 UTC
I added debugging for mylvmbackup, and I disabled mylvmbackup on grouse (by chmoding the script 000.)
Comment 3 Alec Warner (RETIRED) archtester gentoo-dev Security 2013-09-09 07:15:04 UTC
mylvmbackup seems to jibe with the outage times. I've disabled backups on both grouse and grebe to see if we can go 24hrs without an outage.

-A
Comment 4 Alec Warner (RETIRED) archtester gentoo-dev Security 2013-09-10 05:48:05 UTC
No outage today confirms mylvmbackup, now we just need to figure out how to limit memory use...

-A