Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 650906 - Dipper instability
Summary: Dipper instability
Status: RESOLVED FIXED
Alias: None
Product: Mirrors
Classification: Unclassified
Component: Server Problem (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Mirror Admins
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 650960
  Show dependency tree
 
Reported: 2018-03-19 15:42 UTC by Alec Warner
Modified: 2018-07-22 15:35 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alec Warner (RETIRED) archtester gentoo-dev Security 2018-03-19 15:42:37 UTC
Dipper has crashed twice recently. It doesn't look great based on the IPMI data:

<@robbat2> IPMI SEL has:
<@robbat2>   b1 | 03/17/2018 | 23:36:07 | Processor #0x0d | Transition to Non-recoverable | Asserted

The internet seems to imply this is a hardware or heat issue; will do more research later.

P1: Move blogs db somewhere else. Not sure why its hosted on dipper; but I suspect we can find replacement HW in the OSL (even a tiny VM is better than this I suspect.)

P2: Its plausible one of the processors on dipper is going bad. More research needed. Recommend a couple of approaches:
 - Configure watchdog. Its possible the IPMI has a watchdog that can auto-kick the machine. We probably don't care a ton about 1 reboot / month or similar problems; the mirrors can take that no problem. This may risk data corruption; its not a super great thing to happen on the mastermirror.
 - Try to debug the bad CPU and replace it.
 - Replace dipper entirely.
Comment 1 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-03-19 20:12:06 UTC
Dipped has crashed a third time in 2018.
Comment 2 Matthew Thode ( prometheanfire ) archtester Gentoo Infrastructure gentoo-dev Security 2018-03-19 20:15:35 UTC
yep, this is getting annoying as well, we should have some debug data in netdata if we can get the machine back up, but we'll have to see.
Comment 3 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-03-19 23:17:30 UTC
Thinking of moving mastermirror / mastersync services to blackcap (now that we have a ganeti cluster; blackcap is a bit idle.)

I think we need to get the IPMI on dipper working again (I can work with the OSL on this tomorrow.)

We can swap mastermirror roles tomorrow to blackcap once dipper is under control (to avoid IP ownership issues.)

In addition to mirrors, we have to move the mysql database off of dipper.

This will leave dipper idle and we can investigate further without more service disruptions.
Comment 4 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-03-20 01:43:24 UTC
Also proposed, a disk swap between dipper and blackcap.

Seems reasonable. Will propose to OSL tomorrow.

-A
Comment 5 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-03-20 18:40:22 UTC
OSL ticket 30009.

Lance is offsite and its finals week, so expect O(days) for disk swap. Other infra members have IPMI details for mastermirror and will endeavor to keep it functioning until swap can be scheduled.

-A
Comment 6 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-03-20 22:38:15 UTC
Disk swap has happened.

AI antarus: Change the machine names on the LCD so OSL doesn't confuse machines on future maintenance.
AI antarus: Update networking in /etc/udev/rules.d/80-net-name-slot.rules
Comment 7 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-03-21 01:07:49 UTC
Disk swap was a success in that have 1 working machine with dippers disks in it.

I have hackily fixed dipper's network, I need to spent another 30-60 minutes auditing it; it is really confusing.

-A
Comment 8 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-03-21 01:13:51 UTC
(For clarification, I am going to bed and we will do the audit later in the week; assuming the disk swap continues to yield positive results.)

Rebooting dipper may not be safe, as the net.eth1 service is a 'hack'. But if it dies, feel free to log into the ipmi and start it.

-A
Comment 9 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-03-23 01:44:53 UTC
The kind fellow from the OSL notes that:

"according to the LCD CPU 2 is causing the error"

I think its possible to configure the BIOS to not boot CPU 2, and we could run on 1 CPU to try to confirm. The machine crashes frequently enough that running on CPU 1 for a while (e.g. a week) without failure should lead to some confidence.

We can then proceed with either staying on a single CPU or ordering a replacement (but note this HW is old; circa 2010, IIRC.)

-A
Comment 10 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-03-26 22:23:25 UTC
(In reply to Alec Warner from comment #9)
> The kind fellow from the OSL notes that:
> 
> "according to the LCD CPU 2 is causing the error"
> 
> I think its possible to configure the BIOS to not boot CPU 2, and we could
> run on 1 CPU to try to confirm. The machine crashes frequently enough that
> running on CPU 1 for a while (e.g. a week) without failure should lead to
> some confidence.
> 
> We can then proceed with either staying on a single CPU or ordering a
> replacement (but note this HW is old; circa 2010, IIRC.)
> 
> -A

I could not disable the CPUs, but I could reduce the cores. So cores reduced to 2 (so 12 core machine is now 4 cores.) Seeing if that improves stability.

-A
Comment 11 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-04-27 00:11:18 UTC
I think at this point blackcap (the chassis with bad CPUs) is marked as dead, and Dipper is in the working chassis.

We should consider moving blackcap to the spare bin.

-A
Comment 12 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-07-22 15:35:06 UTC
(In reply to Alec Warner from comment #11)
> I think at this point blackcap (the chassis with bad CPUs) is marked as
> dead, and Dipper is in the working chassis.
> 
> We should consider moving blackcap to the spare bin.
> 
> -A

I'm working on getting two replacement r410's.

We can consider replacing blackcap with one of these and using blackcap for parts.

-A