guppy (HP rx3600) had a few items of redundant hardware failed and needs replacement. Tl;DR: We need replacement items for dead hardware: 1. PSU (into slot 0): "HP AD0957-2198" for rx3600/rx600. About 300$ on ebay. 2. SAS HDD 2.5" 72GB (into bay 6). About 100$? "HP DG072A9BB7" or "HP DG072A8B54". More details: 1. PSU-0: "HP AD0957-2198" for rx3600/rx600. PSU status can be checked over MP as: 'CM' > 'PS': Power supplies State ----------------------------------- Power Supply 0 Fault Power Supply 1 Normal Item type reported by MP FRU lister: 'CM' > 'DF -specific 3' FRU Entry # 3 : FRU NAME: Power Supply 0 ID:0003 CHASSIS INFO: BOARD INFO: Mfg Date/Time : 5574753 Manufacturer : C&D Product Name : BULK POWER SUPPLY S/N : R627040079 Part Number : 0957-2198 Fru File ID : 10 Custom Info : 00000000 Custom Info : 0627 Custom Info : 04 Custom Info : 0 It's name is 0957-2198. Looking at available ebay items it costs around 300$. 2.SAS HDD 2.5" 72GB. Dead HDD: # cciss_vol_status -s /dev/cciss/c0d0 /dev/cciss/c0d0: (Smart Array P600) RAID 5 Volume 0 status: Using interim recovery mode. Failed drives: connector 1I box 1 bay 6 HP DH072ABAA6 3PD0YA8B00009816N8B5 HPD4 All HDDs: # cciss_vol_status -V /dev/cciss/c0d0 Controller: Smart Array P600 Board ID: 0x3225103c Logical drives: 0 Running firmware: 1.52 ROM firmware: 1.52 /dev/cciss/c0d0: (Smart Array P600) RAID 5 Volume 0 status: Using interim recovery mode. Failed drives: connector 1I box 1 bay 6 HP DH072ABAA6 3PD0YA8B00009816N8B5 HPD4 Total of 1 failed physical drives detected on this logical drive. Physical drives: 7 connector 1I box 1 bay 8 HP DG072A8B54 3LB0RFWF00007703FJ9Y HPD7 OK connector 1I box 1 bay 7 HP DG072A9BB7 B365P6A072YP0641 HPD0 OK connector 1I box 1 bay 5 HP DG072A9BB7 B365P6A074CF0641 HPD0 OK connector 2I box 1 bay 4 HP DG072A9BB7 B365P6A073U40641 HPD0 OK connector 2I box 1 bay 3 HP DG072A9BB7 B365P6A073KC0641 HPD0 OK connector 2I box 1 bay 2 HP DG072A9BB7 B365P6904NHC0635 HPD0 OK connector 2I box 1 bay 1 HP DG072A9BB7 B365P6A072RM0641 HPD0 OK /dev/cciss/c0d0(Smart Array P600:0): Non-Volatile Cache status: Cache configured: Yes Total cache memory: 224 MiB Cache Ratio: 50% Read / 50% Write Read cache memory: 112 MiB Write cache memory: 112 MiB Write cache enabled: No Write cache temporarily disabled Temporary disable condition. Posted write operations have been disabled due to the fact that less than 75% of the battery packs are at the sufficient voltage level. Note: Most HDDs are of DG072A9BB7 type. I suggest picking the same.
trustees: I see the parts for significantly cheaper: $200USD for the PSU and $30 for the drives, but stock shifts. Any concerns, or can I go ahead with this from the standing Infra hardware repair budget?
based on age of drives, and the size of the array (RAID5 over 8x disks) also adding a cold spare for now (I'd personally prefer to migrate to RAID1 SSD, but not easy in the chassis I gather).
slyfox: can you look into why smartmontools segfaults? I wanted to know the health of the remaining disks # strace -ff smartctl -d cciss,0 /dev/cciss/c0d0 ... openat(AT_FDCWD, "/dev/cciss/c0d0", O_RDWR|O_NONBLOCK) = 3 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 ioctl(3, CCISS_PASSTHRU, 0x60000fffff95b090) = 0 ioctl(3, CCISS_PASSTHRU, 0x60000fffff95b090) = 0 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x4000001000286c2c} ---
(In reply to Robin Johnson from comment #3) > slyfox: can you look into why smartmontools segfaults? I wanted to know the > health of the remaining disks > > # strace -ff smartctl -d cciss,0 /dev/cciss/c0d0 > ... > openat(AT_FDCWD, "/dev/cciss/c0d0", O_RDWR|O_NONBLOCK) = 3 > fcntl(3, F_SETFD, FD_CLOEXEC) = 0 > ioctl(3, CCISS_PASSTHRU, 0x60000fffff95b090) = 0 > ioctl(3, CCISS_PASSTHRU, 0x60000fffff95b090) = 0 > --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, > si_addr=0x4000001000286c2c} --- Rebuilt the binary with current toolchain. Seems to have worked around the failure. Now reports data: # smartctl -a -d cciss,0 /dev/cciss/c0d0 smartctl 6.6 2017-11-05 r4594 [ia64-linux-4.9.95-gentoo] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: HP Product: DG072A8B54 ...
(In reply to Robin Johnson from comment #1) > trustees: > I see the parts for significantly cheaper: $200USD for the PSU and $30 for > the drives, but stock shifts. > > Any concerns, or can I go ahead with this from the standing Infra hardware > repair budget? This was approved in the December Foundation meeting. -A
Two PSUs have been ordered, tracking number to follow. Cost was back around $300USD for units already in the US (the $200 units were in China).
Also ordered a 4-pack of brand new drives for $40USD, without sleds.
slyfox: the new PSU is installed, but the host status still seems to be offline. Can you please reach out to me via IRC in #gentoo-infra; and/or check the console3 iLO access yourself to the host?
(In reply to Robin Johnson from comment #8) > slyfox: > the new PSU is installed, but the host status still seems to be offline. > > Can you please reach out to me via IRC in #gentoo-infra; and/or check the > console3 iLO access yourself to the host? guppy hung up a few days ago. MP/iLO can't access it's BMC at all and I can't check PSU status (all iLO commands timeout due to BMC out of reach). Last time on-site reboot by OSU staff helped. Standard credentials over telnet should work if you want to poke at it as well. Typical session that should work: [guppy] MP> CM [guppy] MP:CM> SS SS The query of the System Processors' State failed. Unfortunately none of resets have any effect: PC (cycle), RB, RS, TC. All seem to go over BMC.
Looks like guppy was power-cycled successfully \o/ PSUs are visible now: [guppy] MP:CM> PS PS For System Processor Status see the SS command. System Power state: On System Power usage: 567 Watts Temperature : Normal Power supplies State ----------------------------------------------------------- Power Supply 0 Normal Power Supply 1 Normal Fans State ----------------------------------------------------------- System Fan 1 Normal System Fan 2 Normal System Fan 3 Normal Can't get to hdd status yet as ext4's journal is broken and kernel fails to boot. Trying to repair.
(In reply to Sergei Trofimovich from comment #10) > Can't get to hdd status yet as ext4's journal is broken and kernel fails to > boot. Trying to repair. Managed to boot guppy into normal state. We'll still need to replace broken disks as ext4 more frequently locks-up in read only due to IO failures. Current HDD status: # /root/cciss_vol_status-1.12/cciss_vol_status -s /dev/cciss/c0d0 /dev/cciss/c0d0: (Smart Array P600) RAID 5 Volume 0 status: Using interim recovery mode. Failed drives: connector 1I box 1 bay 6 HP DH072ABAA6 3PD0YA8B00009816N8B5 HPD4 Total of 1 failed physical drives detected on this logical drive. /dev/cciss/c0d0(Smart Array P600:0): Non-Volatile Cache status: Cache configured: Yes Total cache memory: 224 MiB Cache Ratio: 50% Read / 50% Write Read cache memory: 112 MiB Write cache memory: 112 MiB Write cache enabled: No Write cache temporarily disabled Temporary disable condition. Posted write operations have been disabled due to the fact that less than 75% of the battery packs are at the sufficient voltage level. # /root/cciss_vol_status-1.12/cciss_vol_status -V /dev/cciss/c0d0 Controller: Smart Array P600 Board ID: 0x3225103c Logical drives: 0 Running firmware: 1.52 ROM firmware: 1.52 /dev/cciss/c0d0: (Smart Array P600) RAID 5 Volume 0 status: Using interim recovery mode. Failed drives: connector 1I box 1 bay 6 HP DH072ABAA6 3PD0YA8B00009816N8B5 HPD4 Total of 1 failed physical drives detected on this logical drive. Physical drives: 7 connector 1I box 1 bay 8 HP DG072A8B54 3LB0RFWF00007703FJ9Y HPD7 OK connector 1I box 1 bay 7 HP DG072A9BB7 B365P6A072YP0641 HPD0 OK connector 1I box 1 bay 5 HP DG072A9BB7 B365P6A074CF0641 HPD0 OK connector 2I box 1 bay 4 HP DG072A9BB7 B365P6A073U40641 HPD0 OK connector 2I box 1 bay 3 HP DG072A9BB7 B365P6A073KC0641 HPD0 OK connector 2I box 1 bay 2 HP DG072A9BB7 B365P6904NHC0635 HPD0 OK connector 2I box 1 bay 1 HP DG072A9BB7 B365P6A072RM0641 HPD0 OK ...
Looks like we are fine now. Thank you! # /root/cciss_vol_status-1.12/cciss_vol_status -V /dev/cciss/c0d0 Controller: Smart Array P600 Board ID: 0x3225103c Logical drives: 0 Running firmware: 1.52 ROM firmware: 1.52 /dev/cciss/c0d0: (Smart Array P600) RAID 5 Volume 0 status: OK. Physical drives: 8 connector 1I box 1 bay 8 HP DG072A8B54 3LB0RFWF00007703FJ9Y HPD7 OK connector 1I box 1 bay 7 HP DG072A9BB7 B365P6A072YP0641 HPD0 OK connector 1I box 1 bay 6 HP DG072A9BB7 B365P68036YJ0632 HPD0 OK connector 1I box 1 bay 5 HP DG072A9BB7 B365P6A074CF0641 HPD0 OK connector 2I box 1 bay 4 HP DG072A9BB7 B365P6A073U40641 HPD0 OK connector 2I box 1 bay 3 HP DG072A9BB7 B365P6A073KC0641 HPD0 OK connector 2I box 1 bay 2 HP DG072A9BB7 B365P6904NHC0635 HPD0 OK connector 2I box 1 bay 1 HP DG072A9BB7 B365P6A072RM0641 HPD0 OK /dev/cciss/c0d0(Smart Array P600:0): Non-Volatile Cache status: Cache configured: Yes Total cache memory: 224 MiB Cache Ratio: 50% Read / 50% Write Read cache memory: 112 MiB Write cache memory: 112 MiB Write cache enabled: No Write cache temporarily disabled Temporary disable condition. Posted write operations have been disabled due to the fact that less than 75% of the battery packs are at the sufficient voltage level.