906370 – sys-apps/smartmontools-7.3 fails to read nvme error log: Read 16 entries from Error Information Log failed: NVMe Status 0x13

Bug 906370 - sys-apps/smartmontools-7.3 fails to read nvme error log: Read 16 entries from Error Information Log failed: NVMe Status 0x13

Summary: sys-apps/smartmontools-7.3 fails to read nvme error log: Read 16 entries from...

Status:	CONFIRMED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal (vote)
Assignee:	Gentoo's Team for Core System packages

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2023-05-15 05:37 UTC by Miroslav Šulc
Modified:	2023-05-21 06:27 UTC (History)
CC List:	1 user (show)

See Also:	https://github.com/smartmontools/smartmontools/issues/193
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Miroslav Šulc gentoo-dev

2023-05-15 05:37:45 UTC

this in fact did not start after update of smartmontools but after reboot to gentoo-sources-6.1.27-r1 (from gentoo-sources-6.1.19). i have three same nvme disks in the machine and all behave the same way:

# smartctl -a /dev/nvme0
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.27-gentoo-r1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       XPG GAMMIX S70
Serial Number:                      XXXXXXXXXXXX
Firmware Version:                   3.2.F.P7
PCI Vendor ID:                      0x1dbe
PCI Vendor Subsystem ID:            0x5236
IEEE OUI Identifier:                0x00abcd
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            494e4e 4f47524954
Local Time is:                      Mon May 15 07:31:25 2023 CEST
Firmware Updates (0x0e):            7 Slots
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x0014):     DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x0e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     110 Celsius
Critical Comp. Temp. Threshold:     120 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.50W       -        -    0  0  0  0        5       5
 1 +     3.30W       -        -    1  1  1  1       50     100
 2 +     2.80W       -        -    2  2  2  2       50     200
 3 -   0.1000W       -        -    3  3  3  3      500    5000
 4 -   0.0080W       -        -    4  4  4  4     2000   60000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    100%
Available Spare Threshold:          25%
Percentage Used:                    7%
Data Units Read:                    1,032,640,168 [528 TB]
Data Units Written:                 109,588,566 [56.1 TB]
Host Read Commands:                 3,229,572,930
Host Write Commands:                1,736,867,904
Controller Busy Time:               0
Power Cycles:                       0
Power On Hours:                     0
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Read 16 entries from Error Information Log failed: NVMe Status 0x13


there is nothing in dmesg log nor system log.

Comment 1 Miroslav Šulc gentoo-dev

2023-05-21 05:27:01 UTC

i just tried to reboot to 6.1.28 and the issue persists. i also tried to reboot back to 6.1.19 and the issue is still there so it might not be kernel related. i also tried smartmontools-9999 but the issue is the same. i even created an ebuild for smartmontools-7.2, but even there the issue persists. it still seems weird to me that all three disks would fail to provide the data at the exactly same time. also, reading the error log using `nvme error-log /dev/nvme0n1` works just fine.

smartctl links against glibc and gcc. gcc was last updated in january before this issue occurred, so it is probably unrelated, glibc was last updated on may 12. the issue appeared on may 14 after the reboot.

Comment 2 Sam James archtester

2023-05-21 05:28:13 UTC

Would you mind reporting this upstream? I don't have any guesses yet :(

Comment 3 Miroslav Šulc gentoo-dev

2023-05-21 06:06:56 UTC

tbh idk what is the source of the issue. it started just after reboot, it did not appear before, but rebooting back to the original kernel did not evade the issue. upgrading/downgrading smartmontools had no effect either. (imo) the only thing i didn't try yet is downgrading glibc back to 2.36-r7. i just wonder why zabbix did not start to report this issue after glibc upgrade, but only after reboot. i'll try the glibc downgrade now.

Comment 4 Miroslav Šulc gentoo-dev

2023-05-21 06:17:27 UTC

i just downgraded glibc to 2.36-r7, rebooted the server, and the issue persists...

Comment 5 Miroslav Šulc gentoo-dev

2023-05-21 06:27:30 UTC

reported to smartmontools