HL15 Boot Drive PCI-E Issue

Good day!

I’ve had my HL15 for a few months and things have been running smoothly. I’ve made no new changes/updates lately, nor has anything auto-updated.

I’ve got one of the HL15’s with the X11SPH-nCTF motherboard, and while trying to access a SMB share tonight, I wasn’t able to connect with my data pool. Stepping into the IPMI virtual console, I could see an IO failure notice in the boot pool. The system quickly rebooted and continued to not boot.

In the boot event logs, I got nothing but error EFI 03051002 (DXE BS Driver Unrecognized) and in the system logs I have PCIE removed on CPU PCIE Slot 4 - Assertion, also with no other recent logs. The BMC heartbeat is alive, and so is the SAS LED, but the m.2 led at DLE1 is NOT blinking.

My naive instinct took me to a simple drive failure, particularly since I’ve been out of town and haven’t been doing much to/with this server, but I want to know if there’s anything worth trying before I try replacing that m.2.

That boot drive is living with a truenas install on it, and I’ve got 6 data drives in a Raid-Z2 pool. No other pci-e slots filled. I can’t think of any other relevant configuration notes or hardware, but happy to provide if necessary.

EDIT: Worth noting, I did reseat that drive to no effect. No (visible) physical issues either.

The trick is knowing what the system thinks CPU PCIE Slot 4 is. If this message follows the numbering on the board, there is no Slot 4. The m.2 is connected through the PCH, so if Slot 4 is related to the m.2, there would seem to be a communication problem between the CPU and the PCH. The other options based on the block diagram in the manual are the CPU connection to the on-board HBA, and to the Oculink ports. My guess it is the connection to the on-board HBA. Someone else here may know. Perhaps it can be determined from the UEFI.

Those two errors seem to point to either; a) a problem loading the driver for the HBA from the m.2 SSD, b) the driver for the HBA is having a problem communicating with it when it tries to load. Since you’re not trying to boot from the HBA-connected drives, I’m not quite sure why it halts the boot process.

FWIW, rough order I’d try things;

  • Try a CMOS reset.
  • Does the UEFI seem to be showing settings for the on-board HBA? Does it list the drives connected to the HBA (which slots are your six spinning drives in, 1 to 8 or 9 to 15?)? If not, and it’s not an electrical issue, maybe re-flashing the motherboard BIOS and/or the firmware related to the HBA might help.
  • Do you have another computer that will accept the m.2? Try to either boot from it in another PC, or mount it as a non-bootable drive. Or in an external USB/m.2 enclosure (eg, Amazon.com: Xiaobi M.2 NVMe SSD Enclosure, Tool-Free Installation, USB 3.2 Gen 2 (10 Gbps) to NVME M-Key/(B+M) Key PCIe NVMe Adapter Support UASP and Trim, for SSDs 2230/2242 /2260/2280 : Electronics)
  • If you can mount it in another PC, examine the SMART data.
  • Can you use the USB ports on the HL15? Those are also connected through the PCH.
  • If you can use the USB ports, then it is likely just the m.2. Try to boot an OS off of USB (Linux LiveCD or something) and you can do things like lspci, mount the m.2 etc to investigate. Install truenas to a new m.2 drive and restore your config backup.
  • If you can’t use the USB ports, then it is maybe something more then the m.2. You might have to try re-seating the CPU, or booting off of something like an NVME carrier card in one of the PCIE slots connected directly to the CPU (SLOTS 3, 5, or 6) to try to ascertain the nature of the hardware failure and next steps for service..

Others may have better ideas.

1 Like

So not your exact problem, but I do experience slot assertion errors on my HL15 v1.0 from time to time. For me, it’s always on Slot 6 where my Intel Arc GPU resides. That’s obviously less of a pain for me then your scenario. I just loose transcoding until I can fix it. I have found that reboots don’t always bring it back. I sometimes need to power off and unplug the system for 30 seconds or so. I’m assuming you did that when you reseated the M.2 but posting here just in case.

I second this suggestion.

This is another good suggestion from David. I too suspect that maybe the PCH is no longer available. You should also try to reseat the CPU. Check for bent pins on the LGA. It’s also possible the CPU is the problem as it’s responsible for most of the PCIe.

You may have found this already, but SuperMicro has an FAQ on this error. Not the most helpful but does call out PCIe devices having issues as a culprit.

FAQ Entry | Online Support | Support - Super Micro Computer, Inc.

1 Like