Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 671420 - guppy (ia64 machine) needs hardware swap: broken PSU and broken HDD
Summary: guppy (ia64 machine) needs hardware swap: broken PSU and broken HDD
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Infrastructure
Classification: Unclassified
Component: Dev box issues (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Gentoo Infrastructure
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-18 12:08 UTC by Sergei Trofimovich (RETIRED)
Modified: 2019-12-01 12:15 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sergei Trofimovich (RETIRED) gentoo-dev 2018-11-18 12:08:06 UTC
guppy (HP rx3600) had a few items of redundant hardware failed and needs replacement.

Tl;DR:
  We need replacement items for dead hardware:
    1. PSU (into slot 0): "HP AD0957-2198" for rx3600/rx600. About 300$ on ebay.
    2. SAS HDD 2.5" 72GB (into bay 6). About 100$? "HP DG072A9BB7" or "HP DG072A8B54".

More details:

1. PSU-0: "HP AD0957-2198" for rx3600/rx600.

  PSU status can be checked over MP as: 'CM' > 'PS':
    Power supplies                State                         
    -----------------------------------
    Power Supply 0                Fault                           
    Power Supply 1                Normal 

  Item type reported by MP FRU lister: 'CM' > 'DF -specific 3'

  FRU Entry #   3 :
  FRU NAME: Power Supply 0 ID:0003

  CHASSIS INFO:

  BOARD INFO:
   Mfg Date/Time      : 5574753
   Manufacturer       : C&D
   Product Name       : BULK POWER SUPPLY
   S/N                : R627040079
   Part Number        : 0957-2198
   Fru File ID        : 10
   Custom Info        : 00000000
   Custom Info        : 0627
   Custom Info        : 04
   Custom Info        : 0

  It's name is 0957-2198. Looking at available ebay items it costs around 300$.

2.SAS HDD 2.5" 72GB.

Dead HDD:

# cciss_vol_status -s /dev/cciss/c0d0
/dev/cciss/c0d0: (Smart Array P600) RAID 5 Volume 0 status: Using interim recovery mode. 
  Failed drives:
         connector 1I box 1 bay 6                 HP      DH072ABAA6                           3PD0YA8B00009816N8B5     HPD4

All HDDs:

# cciss_vol_status -V /dev/cciss/c0d0
Controller: Smart Array P600
  Board ID: 0x3225103c
  Logical drives: 0
  Running firmware: 1.52
  ROM firmware: 1.52
/dev/cciss/c0d0: (Smart Array P600) RAID 5 Volume 0 status: Using interim recovery mode. 
  Failed drives:
         connector 1I box 1 bay 6                 HP      DH072ABAA6                           3PD0YA8B00009816N8B5     HPD4

    Total of 1 failed physical drives detected on this logical drive.
  Physical drives: 7
         connector 1I box 1 bay 8                 HP      DG072A8B54                           3LB0RFWF00007703FJ9Y     HPD7 OK
         connector 1I box 1 bay 7                 HP      DG072A9BB7                               B365P6A072YP0641     HPD0 OK
         connector 1I box 1 bay 5                 HP      DG072A9BB7                               B365P6A074CF0641     HPD0 OK
         connector 2I box 1 bay 4                 HP      DG072A9BB7                               B365P6A073U40641     HPD0 OK
         connector 2I box 1 bay 3                 HP      DG072A9BB7                               B365P6A073KC0641     HPD0 OK
         connector 2I box 1 bay 2                 HP      DG072A9BB7                               B365P6904NHC0635     HPD0 OK
         connector 2I box 1 bay 1                 HP      DG072A9BB7                               B365P6A072RM0641     HPD0 OK
/dev/cciss/c0d0(Smart Array P600:0): Non-Volatile Cache status:
                   Cache configured: Yes
                 Total cache memory: 224 MiB
                        Cache Ratio: 50% Read / 50% Write
                  Read cache memory: 112 MiB
                 Write cache memory: 112 MiB
                Write cache enabled: No
   Write cache temporarily disabled
           Temporary disable condition. Posted write operations have
been disabled due to the fact that less than 75% of the
battery packs are at the sufficient voltage level.

Note: Most HDDs are of DG072A9BB7 type. I suggest picking the same.
Comment 1 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2018-12-03 00:00:25 UTC
trustees:
I see the parts for significantly cheaper: $200USD for the PSU and $30 for the drives, but stock shifts.

Any concerns, or can I go ahead with this from the standing Infra hardware repair budget?
Comment 2 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2018-12-03 05:06:30 UTC
based on age of drives, and the size of the array (RAID5 over 8x disks) also adding  a cold spare for now (I'd personally prefer to migrate to RAID1 SSD, but not easy in the chassis I gather).
Comment 3 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2018-12-03 05:10:50 UTC
slyfox: can you look into why smartmontools segfaults? I wanted to know the health of the remaining disks

# strace -ff smartctl -d cciss,0 /dev/cciss/c0d0
...
openat(AT_FDCWD, "/dev/cciss/c0d0", O_RDWR|O_NONBLOCK) = 3
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
ioctl(3, CCISS_PASSTHRU, 0x60000fffff95b090) = 0
ioctl(3, CCISS_PASSTHRU, 0x60000fffff95b090) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x4000001000286c2c} ---
Comment 4 Sergei Trofimovich (RETIRED) gentoo-dev 2018-12-03 08:03:06 UTC
(In reply to Robin Johnson from comment #3)
> slyfox: can you look into why smartmontools segfaults? I wanted to know the
> health of the remaining disks
> 
> # strace -ff smartctl -d cciss,0 /dev/cciss/c0d0
> ...
> openat(AT_FDCWD, "/dev/cciss/c0d0", O_RDWR|O_NONBLOCK) = 3
> fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
> ioctl(3, CCISS_PASSTHRU, 0x60000fffff95b090) = 0
> ioctl(3, CCISS_PASSTHRU, 0x60000fffff95b090) = 0
> --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR,
> si_addr=0x4000001000286c2c} ---

Rebuilt the binary with current toolchain. Seems to have worked around the failure. Now reports data:

  # smartctl -a -d cciss,0 /dev/cciss/c0d0
  smartctl 6.6 2017-11-05 r4594 [ia64-linux-4.9.95-gentoo] (local build)
  Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

  === START OF INFORMATION SECTION ===
  Vendor:               HP
  Product:              DG072A8B54
  ...
Comment 5 Alec Warner (RETIRED) archtester gentoo-dev Security 2018-12-08 22:39:10 UTC
(In reply to Robin Johnson from comment #1)
> trustees:
> I see the parts for significantly cheaper: $200USD for the PSU and $30 for
> the drives, but stock shifts.
> 
> Any concerns, or can I go ahead with this from the standing Infra hardware
> repair budget?

This was approved in the December Foundation meeting.

-A
Comment 6 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2019-11-01 17:05:34 UTC
Two PSUs have been ordered, tracking number to follow. Cost was back around $300USD for units already in the US (the $200 units were in China).
Comment 7 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2019-11-01 20:03:10 UTC
Also ordered a 4-pack of brand new drives for $40USD, without sleds.
Comment 8 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2019-11-08 21:42:21 UTC
slyfox:
the new PSU is installed, but the host status still seems to be offline.

Can you please reach out to me via IRC in #gentoo-infra; and/or check the console3 iLO access yourself to the host?
Comment 9 Sergei Trofimovich (RETIRED) gentoo-dev 2019-11-08 23:45:39 UTC
(In reply to Robin Johnson from comment #8)
> slyfox:
> the new PSU is installed, but the host status still seems to be offline.
> 
> Can you please reach out to me via IRC in #gentoo-infra; and/or check the
> console3 iLO access yourself to the host?

guppy hung up a few days ago. MP/iLO can't access it's BMC at all and I can't check PSU status (all iLO commands timeout due to BMC out of reach).

Last time on-site reboot by OSU staff helped. Standard credentials over telnet should work if you want to poke at it as well.

Typical session that should work:

  [guppy] MP> CM
  [guppy] MP:CM> SS
  SS
  The query of the System Processors' State failed.

Unfortunately none of resets have any effect: PC (cycle), RB, RS, TC. All seem to go over BMC.
Comment 10 Sergei Trofimovich (RETIRED) gentoo-dev 2019-11-12 19:49:18 UTC
Looks like guppy was power-cycled successfully \o/

PSUs are visible now:

[guppy] MP:CM> PS


PS
For System Processor Status see the SS command.
System Power state: On
System Power usage: 567 Watts
Temperature       : Normal


Power supplies                State
-----------------------------------------------------------
Power Supply 0                Normal
Power Supply 1                Normal

Fans                          State
-----------------------------------------------------------
System Fan 1                  Normal
System Fan 2                  Normal
System Fan 3                  Normal

Can't get to hdd status yet as ext4's journal is broken and kernel fails to boot. Trying to repair.
Comment 11 Sergei Trofimovich (RETIRED) gentoo-dev 2019-11-12 22:19:24 UTC
(In reply to Sergei Trofimovich from comment #10)
> Can't get to hdd status yet as ext4's journal is broken and kernel fails to
> boot. Trying to repair.

Managed to boot guppy into normal state.

We'll still need to replace broken disks as ext4 more frequently locks-up in read only due to IO failures.

Current HDD status:

# /root/cciss_vol_status-1.12/cciss_vol_status -s /dev/cciss/c0d0
/dev/cciss/c0d0: (Smart Array P600) RAID 5 Volume 0 status: Using interim recovery mode.
  Failed drives:
         connector 1I box 1 bay 6                 HP      DH072ABAA6                           3PD0YA8B00009816N8B5     HPD4

    Total of 1 failed physical drives detected on this logical drive.
/dev/cciss/c0d0(Smart Array P600:0): Non-Volatile Cache status:
                   Cache configured: Yes
                 Total cache memory: 224 MiB
                        Cache Ratio: 50% Read / 50% Write
                  Read cache memory: 112 MiB
                 Write cache memory: 112 MiB
                Write cache enabled: No
   Write cache temporarily disabled
           Temporary disable condition. Posted write operations have
been disabled due to the fact that less than 75% of the
battery packs are at the sufficient voltage level.

# /root/cciss_vol_status-1.12/cciss_vol_status -V /dev/cciss/c0d0
Controller: Smart Array P600
  Board ID: 0x3225103c
  Logical drives: 0
  Running firmware: 1.52
  ROM firmware: 1.52
/dev/cciss/c0d0: (Smart Array P600) RAID 5 Volume 0 status: Using interim recovery mode.
  Failed drives:
         connector 1I box 1 bay 6                 HP      DH072ABAA6                           3PD0YA8B00009816N8B5     HPD4

    Total of 1 failed physical drives detected on this logical drive.
  Physical drives: 7
         connector 1I box 1 bay 8                 HP      DG072A8B54                           3LB0RFWF00007703FJ9Y     HPD7 OK
         connector 1I box 1 bay 7                 HP      DG072A9BB7                               B365P6A072YP0641     HPD0 OK
         connector 1I box 1 bay 5                 HP      DG072A9BB7                               B365P6A074CF0641     HPD0 OK
         connector 2I box 1 bay 4                 HP      DG072A9BB7                               B365P6A073U40641     HPD0 OK
         connector 2I box 1 bay 3                 HP      DG072A9BB7                               B365P6A073KC0641     HPD0 OK
         connector 2I box 1 bay 2                 HP      DG072A9BB7                               B365P6904NHC0635     HPD0 OK
         connector 2I box 1 bay 1                 HP      DG072A9BB7                               B365P6A072RM0641     HPD0 OK
...
Comment 12 Sergei Trofimovich (RETIRED) gentoo-dev 2019-12-01 12:15:47 UTC
Looks like we are fine now. Thank you!

# /root/cciss_vol_status-1.12/cciss_vol_status -V /dev/cciss/c0d0
Controller: Smart Array P600
  Board ID: 0x3225103c
  Logical drives: 0
  Running firmware: 1.52
  ROM firmware: 1.52
/dev/cciss/c0d0: (Smart Array P600) RAID 5 Volume 0 status: OK.
  Physical drives: 8
         connector 1I box 1 bay 8                 HP      DG072A8B54                           3LB0RFWF00007703FJ9Y     HPD7 OK
         connector 1I box 1 bay 7                 HP      DG072A9BB7                               B365P6A072YP0641     HPD0 OK
         connector 1I box 1 bay 6                 HP      DG072A9BB7                               B365P68036YJ0632     HPD0 OK
         connector 1I box 1 bay 5                 HP      DG072A9BB7                               B365P6A074CF0641     HPD0 OK
         connector 2I box 1 bay 4                 HP      DG072A9BB7                               B365P6A073U40641     HPD0 OK
         connector 2I box 1 bay 3                 HP      DG072A9BB7                               B365P6A073KC0641     HPD0 OK
         connector 2I box 1 bay 2                 HP      DG072A9BB7                               B365P6904NHC0635     HPD0 OK
         connector 2I box 1 bay 1                 HP      DG072A9BB7                               B365P6A072RM0641     HPD0 OK
/dev/cciss/c0d0(Smart Array P600:0): Non-Volatile Cache status:
                   Cache configured: Yes
                 Total cache memory: 224 MiB
                        Cache Ratio: 50% Read / 50% Write
                  Read cache memory: 112 MiB
                 Write cache memory: 112 MiB
                Write cache enabled: No
   Write cache temporarily disabled
           Temporary disable condition. Posted write operations have
been disabled due to the fact that less than 75% of the
battery packs are at the sufficient voltage level.