this in fact did not start after update of smartmontools but after reboot to gentoo-sources-6.1.27-r1 (from gentoo-sources-6.1.19). i have three same nvme disks in the machine and all behave the same way: # smartctl -a /dev/nvme0 smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.27-gentoo-r1] (local build) Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: XPG GAMMIX S70 Serial Number: XXXXXXXXXXXX Firmware Version: 3.2.F.P7 PCI Vendor ID: 0x1dbe PCI Vendor Subsystem ID: 0x5236 IEEE OUI Identifier: 0x00abcd Controller ID: 0 NVMe Version: 1.4 Number of Namespaces: 1 Namespace 1 Size/Capacity: 2,048,408,248,320 [2.04 TB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 494e4e 4f47524954 Local Time is: Mon May 15 07:31:25 2023 CEST Firmware Updates (0x0e): 7 Slots Optional Admin Commands (0x0007): Security Format Frmw_DL Optional NVM Commands (0x0014): DS_Mngmt Sav/Sel_Feat Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Maximum Data Transfer Size: 512 Pages Warning Comp. Temp. Threshold: 110 Celsius Critical Comp. Temp. Threshold: 120 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 3.50W - - 0 0 0 0 5 5 1 + 3.30W - - 1 1 1 1 50 100 2 + 2.80W - - 2 2 2 2 50 200 3 - 0.1000W - - 3 3 3 3 500 5000 4 - 0.0080W - - 4 4 4 4 2000 60000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 - 512 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 45 Celsius Available Spare: 100% Available Spare Threshold: 25% Percentage Used: 7% Data Units Read: 1,032,640,168 [528 TB] Data Units Written: 109,588,566 [56.1 TB] Host Read Commands: 3,229,572,930 Host Write Commands: 1,736,867,904 Controller Busy Time: 0 Power Cycles: 0 Power On Hours: 0 Unsafe Shutdowns: 0 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Read 16 entries from Error Information Log failed: NVMe Status 0x13 there is nothing in dmesg log nor system log.
i just tried to reboot to 6.1.28 and the issue persists. i also tried to reboot back to 6.1.19 and the issue is still there so it might not be kernel related. i also tried smartmontools-9999 but the issue is the same. i even created an ebuild for smartmontools-7.2, but even there the issue persists. it still seems weird to me that all three disks would fail to provide the data at the exactly same time. also, reading the error log using `nvme error-log /dev/nvme0n1` works just fine. smartctl links against glibc and gcc. gcc was last updated in january before this issue occurred, so it is probably unrelated, glibc was last updated on may 12. the issue appeared on may 14 after the reboot.
Would you mind reporting this upstream? I don't have any guesses yet :(
tbh idk what is the source of the issue. it started just after reboot, it did not appear before, but rebooting back to the original kernel did not evade the issue. upgrading/downgrading smartmontools had no effect either. (imo) the only thing i didn't try yet is downgrading glibc back to 2.36-r7. i just wonder why zabbix did not start to report this issue after glibc upgrade, but only after reboot. i'll try the glibc downgrade now.
i just downgraded glibc to 2.36-r7, rebooted the server, and the issue persists...
reported to smartmontools