as per description both site are unavailable via http although ping indicates that they appear to be live .. but not serving pages.
barbet (the host) is not functioning sufficiently to serve queries. 13:10 -willikins:#gentoo-infra- (nagios) **PROBLEM** Service: Content Check HTTP planet.gentoo.org | Host: barbet | State: CRITICAL | Info: CRITICAL - Socket timeout after 21 seconds | Date: Wed Oct 17 13:10:15 UTC 2012 13:41 -willikins:#gentoo-infra- (nagios) **PROBLEM** Service: Content Check HTTP get.gentoo.org | Host: barbet | State: CRITICAL | Info: CRITICAL - Socket timeout after 21 seconds | Date: Wed Oct 17 13:41:15 UTC 2012 13:42 -willikins:#gentoo-infra- (nagios) **PROBLEM** Service: Content Check HTTP devmanual.gentoo.org | Host: barbet | State: CRITICAL | Info: CRITICAL - Socket timeout after 21 seconds | Date: Wed Oct 17 13:42:55 UTC 2012 13:45 -willikins:#gentoo-infra- (nagios) **PROBLEM** Service: Content Check HTTP packages.gentoo.org | Host: barbet | State: CRITICAL | Info: CRITICAL - Socket timeout after 21 seconds | Date: Wed Oct 17 13:45:15 UTC 2012 I'll try to spare an hour to move the services somewhere else. These machines are notorious for bad memory :/
Barbet cannot merge packages, nor can it start varnish (to serve) due to what I can only imagine are memory and disk errors (lots of SMART errors in the logs.)
All services except for bouncer and packages are restored to service.
planet.gentoo.org has fallen over again ..
apache/varnishd kicked. apache had some stuck processes that had to be beaten with kill -9.
Hi all, packages.gentoo.org website seem dead again now (althouth I've seem it work about 2 or 3 hours ago).
Don't know if related, but devmanual.gentoo.org is also unresponsive. Checked from multiple locations.
*** Bug 439520 has been marked as a duplicate of this bug. ***
*** Bug 439252 has been marked as a duplicate of this bug. ***
So an update. 80% of infra was at a conference (and then conf cleanup / travel home.) A bunch of low-risk http services run on the same box. Due to what I suspect is a bug in our cfengine deployment, packages.gentoo.org is not currently serving content properly. This eventually causes all the apache workers to hang waiting on mod_python to serve content. Ideally we would not be using mod_python (fcgi is probably better.) There also appears to be an issue where varnish is hitting its open-file-descriptor limit. I need to poke it when it is in this state and figure out what all the descriptors are for. I'm guessing that too will lead me to hung apache workers and generally 'lots of search engines like to hammer packages.gentoo.org.' Currently p.g.o is not enabled, so we are serving a number of 404's or the default service page for apache. This was done primarily to spare the other vhosts on the machine (get,planet,devmanual) -A
I wasted another two evenings on this and I've given up on the existing codebase. We are in the process of provisioning some VMs for a 2012 GSOC project to replace the existing codebase, so we might as well deploy it. Expect the site to stay down a few more days. -A
Should work again.