For the past couple weeks, I have been attempting to get Gentoo with a recent kernel running on the Lenovo nx360 m5 platform with 40GbE Mellanox ConnectX-3 cards in the PCI-E slot. I found that earlier versions of the kernel had the cards working correctly, and have done a git bisect to track the commit that broke kernel support of PCI-E for this hardware. I have tracked it to the following commit. Reverting the single line change in drivers/pci/probe.c (at least in the the current head of linus' tree {4.3+ -- 3e069adabc9487b5e28065a17e6a228da3412dfd}) has restored functionality to the NICs. However, I am sure this is not the correct solution. I believe this platform either needs a quirk defined to fix, or perhaps Lenovo needs to fix their BIOS/ACPI tables. If you've got any other things I should test, please let me know. Breaking commit can be viewed here: https://github.com/torvalds/linux/commit/7a1562d4f2d01721ad07c3a326db7512077ceea9 Reproducible: Always Steps to Reproduce: 1. Put Mellanox ConnectX-3 Pro card in exLOM->pci-e riser of Lenovo nx360 m5 blade. 2. Start linux with kernel newer than 3.18-rc5 Actual Results: Kernel fails to initialize the firmware of the NIC, NMI 2d or 3d occurs as initialization is supposed to occur. ~5 minutes later, kernel registers a reset. Expected Results: firmware of NIC loads, and functions as a NIC should. Kernel doesn't spontaneously reboot. Working: mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014) mlx4_core: Initializing 0000:06:00.0 mlx4_core 0000:06:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s mlx4_core 0000:06:00.0: PCIe link width is x8, device supports x8 Not Working: mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014) mlx4_core: Initializing 0000:06:00.0 Uhhuh. NMI received for unknown reason 3d on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue mlx4_core 0000:06:00.0: command 0xfff failed: fw status = 0x1 mlx4_core 0000:06:00.0: MAP_FA command failed, aborting mlx4_core 0000:06:00.0: Failed to start FW, aborting mlx4_core 0000:06:00.0: Failed to init fw, aborting. <time passes (approx 5 minutes)> *reboot* I've tested with the ConnectX-3 Pro card in the exLOM->pci-e riser throughout, failure happens with or without a ConnectX-3 40GbE card in a proper PCI-E slot. As I don't have a proper install, at the moment, I am leaving out the emerge --info. Come Monday, I'll finish the install with my patched kernel and I can provide it then, if desired.
Is this still an issue with later kernels?
I worked with the product vendor to get a new UEFI/BIOS version that correctly enumerates the pci-e devices as expected by the kernel. I had completely forgotten this bug was still open.