I have a fully configured prebuilt HL15 from 45HomeLab. Yesterday the data on my HL15 zfs pool was no longer accessible. From the Houston UI, the ZFS tab showed the pool was Suspended. I rebooted the HL15 and the OS would never boot. The startup would hang at “A start job is running for Mount ZFS filesystems (time / no limit)”.
The only fix I found to allow the OS to boot was to use rescue mode and delete /etc/zfs/zpool.cache. When logged back into Houston UI, the ZFS pool would only import if in read-only mode. Slot 1-5, 1-6, 1-7, and 1-8, showed thousands of errors. Not so worried about losing my data as it was backed up from my old NAS that I had before purchasing this HL15 and new hard drives, I deleted the ZFS pool, reformatted all 15 hard drives and then rebooted the server. The HL15 booted just fine. In the Houston UI I went to create a new storage pool but 4 of the hard drives were not showing. On the 45Drives Disks tab, 1-5, 1-6, 1-7, and 1-8 showed no hard drives. I powered down the server, move 5-8 to 12-15 and put 12-15 in 5-8. Started up and still same thing. I have cycled through all 15 8TB hard drives in a separate computer just to ensure they work and they all do (they are all less than a month old). Also, even with slots 5-8 not showing, I’ve tried to create a ZFS pool and am unable. I’ve tried with just a couple of the hard drives and with all 11 and every combination in between.
I read on here from over a year ago where people were having issues with the cable to the backplane, I would assume the new correct cables are being sent, but anyways, I tried the solution of changing the write speeds to 6gb instead of 12gb and that didn’t work either to allow any hard drive in the 5-8 slot to work nor to allow creation of a zfs pool with the hard drives that are being read.
The marginal cables were mostly a problem with SAS drives. I think you are just using SATA drives.
Did you check the cabling? That the connector for slots 5-8 is secure in the motherboard and (although its a bit of disassembly) if possible that the connectors on the underside of the backplane are all securely in place?
what happens if you switch the cables around? Flip the cables for 1-4 and 5-8 with each other. Does the data corruption move to slots 1-4?
Given the odd issues people have had with the Supermicro motherboard QA, I’d suspect something wrong with the SFF-8087 connector on the motherboard before the backplane, if it’s not a loose cable.
Thank you for your quick response. My response is delayed as it took some time to get everything apart. I checked the cables, they were secure, rebooted and same issue. I swapped the cables, rebooted, same issue. I will see if I can figure out a way to check the motherboard tomorrow and see if it could be something with it.
if you swapped the cables and still have an issue with slots 1-5 to 1-8 then it would seem it might actually be a backplane problem. I’d send a note to info@45homelab.com if you have not already. How long did you have the unit before the issue appeared?
I should have been more clear in my response. When I swapped the cables, I still had the issue of 4 drive bays not being available, but it was 1-4 that became unavailable; hence, why I said I will check the motherboard. My apologies for not being clear. I’m home now and about to start going through the motherboard to see if I can find what is wrong with it.
So further troubleshooting (I’m using motherboard spot 1 and 2 to be generic that they specify the specific plug on the motherboard so you know where they’re being plugged in):
Motherboard spot 1: cable A
Backplane 1-4: cable A
HDs 1-4: show
Motherboard spot 2: cable B
Backplane 5-8: cable B
HDs 5-8: no show
Motherboard spot 1: cable A
Backplane 5-8: cable A
HDs 5-8: show
Motherboard spot 2: cable B
Backplane 1-4: cable B
HDs 1-4: no show
Motherboard spot 1: cable B
Backplane 1-4: cable B
HDs 1-4: no show
Motherboard spot 2: cable A
Backplane 5-8: cable A
HDs 5-8: show
Motherboard spot 1: cable B
Backplane 5-8: cable B
HDs 5-8: no show
Motherboard spot 2: cable A
Backplane 1-4: cable A
HDs 1-4: show
So I believe the cable is bad. Or do you think it could still be something else?
I got the HL15 in December but didn’t get it up and running until January as I had to have the motherboard replaced due to a bad NIC upon receiving. The HL15 and 15 brand new hard drives have been up and running since 1st or 2nd week of January (so about 4-5 weeks).
Oh, was it you that had the bad NIC? Sorry about these issues, it’s not typical and not my experience. I thought your original post ruled out a bad cable, but your going through all the permutations does seem (luckily) that it’s just cable B that is bad. Bad in some other way than was impacting the SAS drive speed with that other issue. Support (info@45homelab) should be able to get you a replacement, or you could get one from Amazon or such if you wanted it a few days quicker. The 10gtek ones are usually what’s recommended.
I’ve reached out to info@45homelab and referenced this posting to show all the troubleshooting that has been done. I’ll be sure to post back with what the fix ends up being.
Please let us know the solution to this, as i encountered the same error. I gave up on troubleshooting though. Ended up just connecting my HL15 directly to QNAP JBODs to have stability.
Support sent me a new backplane and it wasn’t the solution. I have also purchased an HBA (9400-16i) to connect the backplane too and wasn’t the issue.
Hey @JonCo,
Sorry to hear you are having issues with you HL15.
What exactly is the problem you are having? Is it a problem with the drives not showing or with the error code upon boot?
Please feel free to contact info@45homelab.com and we can get a support ticket entered for troubleshooting the hardware issue.
It is actually the drives in slots 5-8 constantly disconnecting, they’d show up for a bit then drop. Works for a few days to a month then drop about again. I personally did everything that would be recommended, replace the cables, install an HBA, replace the backplane, etc. The only thing that seemed to work was put the drives into a JBOD and the system has been stable ever since.
My first instinct is the power being sent to the drives. Have you ever tried switching the power connectors that connect to the bottom of the backplane to see if the problem moves to another sector?
I would try checking using the SMART statistics to see if the drives have more power on cycles than drives in different slots using
smartctl -a /dev/“drive”
I would use that command on a few drives in the bad slots and a few drives in the healthy slots to see if there is a big difference in the “power on cycle” (ID #12) statistic.
@Braeden-45Drives
yes, i tried reversing power connectors on the backplane. Also i did run the smartctl command and some drives, majority were in the 40s and then a few drives in the 60-80 range (probably based on troubleshooting back then).
It is probably either the harness that runs underneath the backplane or the board that the harness connects to next to the power supply. Its either that or the drives were triggering the overpower protection thats built into the power supply and killing one of the molex connectors from the PS.
I just got sick of troubleshooting the issue because i have over 100 TB of data in my pool and didn’t want to risk losing it.
Hey, guys just jumping in here, The drives are powered in such a way that for every power connection on the backplane, it powers 4 drives. The same is true with the data SATA cables.
So for every SATA cable it connects to 4 drives and those same 4 drives are powered using the same power connector.
This power connector is then part of 1 large harness that connects back to the midplate power distribution board.
I would suggest swapping the connector on the bottom of the backplanes for both the SATA and power connections and then see if the issue is to move or stay the same. If it stays the same then the backplane is bad. If it moves it is then power or SATA connection issues, from there we can swap the SATA connection back to the original and see if the issue moves or not.
This will then tell us if it’s a backplane, power connector, or SATA cable causing the issue.
Because this has been ongoing for some time I encourage you to reach out to info@45homelab.com so that we can get a support ticket entered and have a troubleshooting call scheduled to look into this issue in more detail instead of going back and forth on posts here. if there above troubleshooting does not yield any additional clarity.
Thank you again for choosing 45homelab as your preferred home lab server!
It sounds like you’ve been through a lot of troubleshooting. There do seem to be some other reports of flaky behavior with those drives, although they seem to be working for you in the QNAP JBOD.
Two things I didn’t see mentioned were; a) firmware update to the drives, b) swapping out the PSU. Based on what you’ve described, I would agree that if you talk to support again to try to get them to rule out a problem with the power distribution board or wiring, but I just wanted to mention those two other items for testing.