Dipper has crashed twice recently. It doesn't look great based on the IPMI data: <@robbat2> IPMI SEL has: <@robbat2> b1 | 03/17/2018 | 23:36:07 | Processor #0x0d | Transition to Non-recoverable | Asserted The internet seems to imply this is a hardware or heat issue; will do more research later. P1: Move blogs db somewhere else. Not sure why its hosted on dipper; but I suspect we can find replacement HW in the OSL (even a tiny VM is better than this I suspect.) P2: Its plausible one of the processors on dipper is going bad. More research needed. Recommend a couple of approaches: - Configure watchdog. Its possible the IPMI has a watchdog that can auto-kick the machine. We probably don't care a ton about 1 reboot / month or similar problems; the mirrors can take that no problem. This may risk data corruption; its not a super great thing to happen on the mastermirror. - Try to debug the bad CPU and replace it. - Replace dipper entirely.
Dipped has crashed a third time in 2018.
yep, this is getting annoying as well, we should have some debug data in netdata if we can get the machine back up, but we'll have to see.
Thinking of moving mastermirror / mastersync services to blackcap (now that we have a ganeti cluster; blackcap is a bit idle.) I think we need to get the IPMI on dipper working again (I can work with the OSL on this tomorrow.) We can swap mastermirror roles tomorrow to blackcap once dipper is under control (to avoid IP ownership issues.) In addition to mirrors, we have to move the mysql database off of dipper. This will leave dipper idle and we can investigate further without more service disruptions.
Also proposed, a disk swap between dipper and blackcap. Seems reasonable. Will propose to OSL tomorrow. -A
OSL ticket 30009. Lance is offsite and its finals week, so expect O(days) for disk swap. Other infra members have IPMI details for mastermirror and will endeavor to keep it functioning until swap can be scheduled. -A
Disk swap has happened. AI antarus: Change the machine names on the LCD so OSL doesn't confuse machines on future maintenance. AI antarus: Update networking in /etc/udev/rules.d/80-net-name-slot.rules
Disk swap was a success in that have 1 working machine with dippers disks in it. I have hackily fixed dipper's network, I need to spent another 30-60 minutes auditing it; it is really confusing. -A
(For clarification, I am going to bed and we will do the audit later in the week; assuming the disk swap continues to yield positive results.) Rebooting dipper may not be safe, as the net.eth1 service is a 'hack'. But if it dies, feel free to log into the ipmi and start it. -A
The kind fellow from the OSL notes that: "according to the LCD CPU 2 is causing the error" I think its possible to configure the BIOS to not boot CPU 2, and we could run on 1 CPU to try to confirm. The machine crashes frequently enough that running on CPU 1 for a while (e.g. a week) without failure should lead to some confidence. We can then proceed with either staying on a single CPU or ordering a replacement (but note this HW is old; circa 2010, IIRC.) -A
(In reply to Alec Warner from comment #9) > The kind fellow from the OSL notes that: > > "according to the LCD CPU 2 is causing the error" > > I think its possible to configure the BIOS to not boot CPU 2, and we could > run on 1 CPU to try to confirm. The machine crashes frequently enough that > running on CPU 1 for a while (e.g. a week) without failure should lead to > some confidence. > > We can then proceed with either staying on a single CPU or ordering a > replacement (but note this HW is old; circa 2010, IIRC.) > > -A I could not disable the CPUs, but I could reduce the cores. So cores reduced to 2 (so 12 core machine is now 4 cores.) Seeing if that improves stability. -A
I think at this point blackcap (the chassis with bad CPUs) is marked as dead, and Dipper is in the working chassis. We should consider moving blackcap to the spare bin. -A
(In reply to Alec Warner from comment #11) > I think at this point blackcap (the chassis with bad CPUs) is marked as > dead, and Dipper is in the working chassis. > > We should consider moving blackcap to the spare bin. > > -A I'm working on getting two replacement r410's. We can consider replacing blackcap with one of these and using blackcap for parts. -A