ZFS Checksum Errors With HL15 Custom Build

mango · January 3, 2025, 5:50pm

I finished my HL-15 bare metal + backplane build in October and ran my first scrub. 6 out of 8 drives have checksum errors. I’m using the SAS cables provided with the case. Would this be symptomatic of the bad batch of cables, and is there any way to check? Did the bad cable batch only affect full builds?

Thanks.

DigitalGarden · January 3, 2025, 6:23pm

I don’t think we were given much public information about the cables issue. I think the people who were having issues just contacted support and some sort of replacement was worked out. There wasn’t any preemptive recall or anything like that. I think we assume the bad cables were either removed from inventory by 45HL, or they had already worked through them on builds by the time the root cause got identified. I’d be really surprised if you got bad cables in October 2024 at least not from the same bad batch.

Just to be clear, you are trying to do 12gbps SAS? That is where this issue presented itself.

Assuming you are, one way to check if you are having a similar issue would be to find the place in this thread describing how to limit the link speed to 6 gbps.

Your other options would be to info@45homelab.com without much more research and see what they do. They might be willing to ship a new set of cables few questions asked, or they might request you do a little more digging.

Your other option would be to order a different set of cables yourself from Amazon or such. 10gtek ones are usually recommended. Front the cost to help identify if the cables really are the issue. If they are, there should be either a) a return window for the cables you buy, or b) 45HL willing to give you a credit for the purchase price of the other cables (rather then the hassle of returning those and 45HL sending you a different set).

rymandle05 · January 3, 2025, 6:51pm

Correct - I worked with 45Drives to get a replacement. I sent my cables back to them so they could check them out. I followed up once via email but never did hear a resolution. Just given the amount of time, I’m guessing only a few of us were unlucky enough to get cables out of tolerance enough to give us issues.

I’m guessing you may be running SATA drives so this thread may not be as applicable as it would otherwise be. In the default configuration, only 7 out of the 15 slots (9-15) are using the motherboard SAS controller. SATA runs at 6gpbs so, if you do have a cable issue, it’s probably even more out of tolerance than the ones I received last year. It’s possible as manufacture defects do happen but seems unlikely in my opinion.

@mango - if you do want to try to slow down the link speed to see if that helps, here is a guide I wrote to do that on the SAS controller. You should be able to take it down all the way to 3gps. You’d also have to move your drives to slots 9-15 for this to work or physically switch the SAS cables to be slots 1-8 and the SATA controller to 9-15.

You can also clear the checksum errors via zpool clear POOLNAME. I think it’s also good practice to do a second scrub after finding ZFS errors.

rymandle05 · January 3, 2025, 7:01pm

A few more details would also be helpful. I know the build is recent but were the drives brand new, refurbed, used? Brand and model? I assume it’s a Z2 configuration with those 8 drives but could you confirm?

mango · January 3, 2025, 7:36pm

Thanks for the replies! I am not trying to do 12Gbps SAS since I have the barebones build, it’s SATA drives only. I guess I missed that part of the thread.

To be clear, this was my first scrub. Two of the drives were bought new about 4 years ago and moved from a previous NAS, and the other 6 are are from Serverpartdeals.com manufacturer recertified. No SMART errors on any drives in short / long tests. Checksum errors are on both the new drives and ones from Serverpartdeals so no consistency pointing to only the used drives.

I’m running 4x2 vdev mirrors. I would have gone z2 but I didn’t have all the drives to start and wasn’t comfortable relying on the z2 expansion feature with it being so new. Mirrors still seems the most flexible.

2 drives are 4TB WD Reds CMR. The rest are split 3x3 EXOS x18 and x22. Each vdev has one of each EXOS type.

I agree that the cables being the issue is unlikely. I could first try reseating everything, but I’m afraid to scrub too many times if it’s a hardware issue. I ordered 10GTek replacement cables for good measure, they are cheap enough and I’ve used their 10G SFPs with success.

I am also going to replace the HBA with a spare, and replace some Cablemod cables I got for the PSU <> backplane with the original ones from PSU, and then re-scrub. I got these for cable management purposes since the backplane is right next to the PSU. These cables come with the custom build but not the barebones w/out PSU.

If the issue persists after all that, it will help isolate the culprit I’ll swap the PSU and run memtest at that point, maybe reseat CPU.

I have separate pools for NVMe and a SATA SSD pool for boot. Haven’t seen any issues with those. The NVMe pool has quite a few Docker apps running on it 24x7 used for various homelab activities. Probably 15 or 20 containers heavily used including IP monitoring. If those were acting up due to memory or PSU, I think it would be more apparent, so I’m focused on the storage subsystem used only by the spinning disk pool.

DigitalGarden · January 3, 2025, 8:02pm

What PSU do you have and haw many cables are connecting it to the PDU? You are using four separate SATA/PATA to 1x Molex cables? FWIW, I have that and haven’t had any issues.

Not really. If you do too much at once we won’t really know what the root cause was.

rymandle05 · January 3, 2025, 8:14pm

Awesome! I appreciate all the good details and validating some of our assumptions. Knowing you do have a mix of used and re-certified drives, I would keep an eye on the SMART data. In fact, start a document with the current output from each one of your drives using smartctl -x /dev/sdX. That will give you all the SMART information including values that won’t trigger a failure. You can then reset the ZFS errors and watch for an re-occurrences. Should you get checksum errors gain, capture again the smartctl information to compare against.

I get the hesitation but I also wouldn’t wait too long to try another scrub. Maybe a few days to a week? If you can afford it, maybe look to keep an extra drive on hand.

Memtest would be a good next step as well. It won’t cost you anything but down time.

I have read where PSU’s can be the cause of bad signalling so this possible. Did you happen to run power and cabling side by side anywhere in the case? Cross talk is a real thing and if the shielding is poor on the SAS cables then that could also impact signals.

mango · January 3, 2025, 8:52pm

This is true. My view is that I’d rather run another scrub under healthy hardware conditions to avoid data loss vs know the exact issue. If I have to write off an HBA and some cables, so be it.

I’m using a Corsair RM850x, specifically the 850 because it has the 4xperipheral ports to achieve the 1:1 PSU <> molex for the backplane. These are the Cablemod cables I ordered. FWIW, I initially set up the system with 2:2 and had no issues. I think this is a completely acceptable approach as well.

mango · January 3, 2025, 8:57pm

Thanks for the info on the SMART tracking!

It’s not that I’m hesitant to do another scrub, I definitely want to ASAP. I meant that I’m hesitant to do it before replacing some cables/HBA in case it’s a hardware issue, since another scrub might exacerbate the hardware fault.

The cables should get here Sunday so I’ll run the next scrub after I swap those out.

No crosstalk that I can think of but I’ll make sure to double check when I’m swapping things out.

Thanks!