483310 – Forums flap every day between 3:30am and 4:30am UTC

Bug 483310 - Forums flap every day between 3:30am and 4:30am UTC

Summary: Forums flap every day between 3:30am and 4:30am UTC

Status:	RESOLVED OBSOLETE

Alias:	None

Product:	Gentoo Infrastructure
Classification:	Unclassified
Component:	Forums (show other bugs)
Hardware:	All Linux

Importance:	Normal normal (vote)
Assignee:	Gentoo Infrastructure

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-09-01 22:44 UTC by Alec Warner
Modified:	2018-12-26 18:45 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alec Warner (RETIRED) archtester

2013-09-01 22:44:44 UTC

So I finally filed a bug about this bugger. Both Jorge and I have looked into it, a few notes I wanted to take down.

1) This has nothing to do with the webservers rebooting, that was a different bug regarding apache. This bug has to do with the database backends (grebe / grouse).

2) Both the loadbalancer, icinga, and users report the outage. The LBs hit '/status.php', icinga hits / and users hit whatever. status.php is the 'easiest' page to understand. It just sources the forums config, connects the DB, makes 1 query, frees it, and returns OK. It completes normally in < 1s. During outages, it exceeds the LB configured timeout of 10s.

Why does a simple connection to the mysql db time out?

lets see!

Comment 1 Alec Warner (RETIRED) archtester

2013-09-01 23:01:17 UTC

For those not in the know, Grebe and Grouse run loadbalancing for the forums, and also run mysql.

Grebe's LB logs: 9/01/2013

Outage began 3:33, ended 4:09. 

zegrep '(Disabling|Enabling)' /var/log/messages | less

Grebe' sar numbers:

00:00:01     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
03:30:02      3668.74   1446.59    322.05      0.01    457.56      0.00      0.00      0.00      0.00
03:40:02      7406.57   2126.09    381.13      0.10   1963.16   1215.83      0.00   1215.82    100.00
03:50:02      5787.58   2159.74   1409.06      0.14   3219.06   1464.49      0.00   1464.49    100.00
04:10:01      3359.44   1402.28    873.41      0.11   3142.33   1018.54      0.00   1018.54    100.00
04:20:01         0.18    434.65    348.73      0.00    226.70      0.00      0.00      0.00      0.00

This basically says 'wow we are moving tons of pages in and out.'
00:00:01    kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact
03:20:02      5394980  11067292     67.23    680724   7061900   3639672     14.88   7793564   2606520
03:30:02      2624224  13838048     84.06    683056   9764440   3685528     15.07   7835368   5262008
03:40:02       271112  16191160     98.35    678512  12132944   3690996     15.09   7843888   7623772
03:50:02       310220  16152052     98.12    668032  12153024   3653032     14.93   6973288   8471164
04:10:01      6586740   9875532     59.99    658396   5989484   3643032     14.89   6956520   2352204

So we see here that during the outage interval, we are quite low on memory on a percentage basis. Now of course we still have ~250 MB free, so its not the end of the world.

Sep  1 03:24:05 grebe kernel: [12150592.407818] grsec: mount of /dev/mapper/vg-var_lib_mysql_snapshot to /var/tmp/mylvmbackup/mnt/grebe.gentoo.org-mysql-maste
r-backup by /bin/mount[mount:25522] uid/euid:0/0 gid/egid:0/0, parent /usr/bin/mylvmbackup[mylvmbackup:25483] uid/euid:0/0 gid/egid:0/0
Sep  1 04:01:44 grebe kernel: [12152849.445516] grsec: unmount of /dev/mapper/vg-var_lib_mysql_snapshot by /bin/umount[umount:28191] uid/euid:0/0 gid/egid:0/0
, parent /usr/bin/mylvmbackup[mylvmbackup:25483] uid/euid:0/0 gid/egid:0/0

Implies the outage for grebe is 3:24 -> 4:01.

Comment 2 Alec Warner (RETIRED) archtester

2013-09-01 23:07:10 UTC

I added debugging for mylvmbackup, and I disabled mylvmbackup on grouse (by chmoding the script 000.)

Comment 3 Alec Warner (RETIRED) archtester

2013-09-09 07:15:04 UTC

mylvmbackup seems to jibe with the outage times. I've disabled backups on both grouse and grebe to see if we can go 24hrs without an outage.

-A

Comment 4 Alec Warner (RETIRED) archtester

2013-09-10 05:48:05 UTC

No outage today confirms mylvmbackup, now we just need to figure out how to limit memory use...

-A