Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 565104 - sys-kernel/(vanilla-sources|gentoo-sources)-3.18+ breaks PCI-E devices on Lenovo nx360 m5 platform
Summary: sys-kernel/(vanilla-sources|gentoo-sources)-3.18+ breaks PCI-E devices on Len...
Status: RESOLVED UPSTREAM
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: AMD64 Linux
: Normal major
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-11-08 06:01 UTC by Adam Tygart
Modified: 2017-03-02 04:11 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Tygart 2015-11-08 06:01:35 UTC
For the past couple weeks, I have been attempting to get Gentoo with a recent kernel running on the Lenovo nx360 m5 platform with 40GbE Mellanox ConnectX-3 cards in the PCI-E slot. I found that earlier versions of the kernel had the cards working correctly, and have done a git bisect to track the commit that broke kernel support of PCI-E for this hardware. I have tracked it to the following commit. Reverting the single line change in drivers/pci/probe.c (at least in the the current head of linus' tree {4.3+ -- 3e069adabc9487b5e28065a17e6a228da3412dfd}) has restored functionality to the NICs. However, I am sure this is not the correct solution. I believe this platform either needs a quirk defined to fix, or perhaps Lenovo needs to fix their BIOS/ACPI tables. If you've got any other things I should test, please let me know.

Breaking commit can be viewed here:
https://github.com/torvalds/linux/commit/7a1562d4f2d01721ad07c3a326db7512077ceea9


Reproducible: Always

Steps to Reproduce:
1. Put Mellanox ConnectX-3 Pro card in exLOM->pci-e riser of Lenovo nx360 m5 blade.
2. Start linux with kernel newer than 3.18-rc5
Actual Results:  
Kernel fails to initialize the firmware of the NIC, NMI 2d or 3d occurs as initialization is supposed to occur. ~5 minutes later, kernel registers a reset.

Expected Results:  
firmware of NIC loads, and functions as a NIC should. Kernel doesn't spontaneously reboot.

Working:
mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
mlx4_core: Initializing 0000:06:00.0
mlx4_core 0000:06:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
mlx4_core 0000:06:00.0: PCIe link width is x8, device supports x8

Not Working:
mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
mlx4_core: Initializing 0000:06:00.0
Uhhuh. NMI received for unknown reason 3d on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
mlx4_core 0000:06:00.0: command 0xfff failed: fw status = 0x1
mlx4_core 0000:06:00.0: MAP_FA command failed, aborting
mlx4_core 0000:06:00.0: Failed to start FW, aborting
mlx4_core 0000:06:00.0: Failed to init fw, aborting.
<time passes (approx 5 minutes)>
*reboot*

I've tested with the ConnectX-3 Pro card in the exLOM->pci-e riser throughout, failure happens with or without a ConnectX-3 40GbE card in a proper PCI-E slot.

As I don't have a proper install, at the moment, I am leaving out the emerge --info. Come Monday, I'll finish the install with my patched kernel and I can provide it then, if desired.
Comment 1 Mike Pagano gentoo-dev 2017-03-02 00:40:46 UTC
Is this still an issue with later kernels?
Comment 2 Adam Tygart 2017-03-02 04:11:35 UTC
I worked with the product vendor to get a new UEFI/BIOS version that correctly enumerates the pci-e devices as expected by the kernel. I had completely forgotten this bug was still open.