I, too, have been error free at 12Gbps SAS-3 speeds with the same 6 wide Z2 pool on the 10Gtek mini sas cables. I haven’t been hitting it quite as hard as I have in the past but still been through quite a bit. I’m feeling bold enough to get it ready to production-ize in my home now.
So it appears to be the cables…
Now the question is: what’s the problem with the cables, and why did they appear to work with the 9300-8i?
I updated the original post to indicate that new cables have fixed the issue. My pool is now 8 wide Z2 and has been running error free at SAS-3 speeds.
@pxpunx any word from SuperMicro? How’s your setup been running? I haven’t heard anything more in my ticket.
I did hear back from Corey today and 45drives is sending out replacement cables.
Manufacturing issue leading to them being marginal for signal integrity at higher bandwidth speeds?
This is my prevailing theory. It’s quite possible @pxpunx and I “won” the lottery here and both received marginal cables. Hopefully, it’s not a larger issue with a “bad batch” and others haven’t noticed yet.
Likely … just feels weird when I can’t “see” the problem. I’d feel better if I could explain it with a bad crimp or solder joint.
Right now it’s basically “the magic has escaped the cables.”
Hats off to both of you @rymandle05 / @pxpunx for a job well done. I have learned quite a bit from reading your detective work
Thank you
I ordered new cables direct from Supermicro.
Once I told them it looked like a cable issue, and they weren’t Supermicro cables, that pretty much sealed the deal on their end.
No response from 45drives, which is unfortunate. IMHO, they’re going to need to pick up a lot of slack in support and logistics if they’re going to make this whole “home labs” experiment work.
Hey @pxpunx,
I appreciate you bringing this up. I would like to apologize for not getting back to you sooner. I noticed that your email came in during the holiday season and we were working with limited staff at that time, it was a simple mistake from me.
Please be assured that I have since forwarded your request to our technical service team and someone from there will be in touch with you as soon as possible.
Thank you for your patience and support.
Well 1st pass in FIO benchmark, no errors with the same settings @rymandle05 in post #1. More testing tomorrow. I’ve got 10 drives to test
Definitely go for more passes with FIO! I would sometimes go almost 5 tests before seeing errors. You can also run head -c 100G </dev/random > /pool0/
at the same time. I found this would usually help expedite errors to the first pass but sometimes still would happen up to 3 tests.
Morning update:
Last night’s run with no errors was because my target was on the host NVME drive (I knew it was time for bed).
Today, I configured samba and a dataset on my pool. Using FIO, with as little as 2 min runs, I see errors and zpool status shows degraded. Same blk_update_request in Syslog I/O errors. So far only on slot 1-11 ( I was not running any disks in 1-13)
Time for the disk shuffle to see if problems persist.
This pass I moved the original disk out of 1-11 and also placed a disk in 1-13, Now errors in both.
NAME STATE READ WRITE CKSUM
test DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
1-9 ONLINE 0 0 0
1-10 ONLINE 0 0 0
1-11 FAULTED 0 217 0 too many errors
1-12 ONLINE 0 0 0
1-13 FAULTED 0 28 0 too many errors
errors: No known data errors
I am assuming new cables is the solution?
I’ll keep running tests today if the power stays on here.
I’m pretty confident you have the same cable issue. First, definitely reach out to info@45homelab.com if you haven’t already. If nothing else, we want them to be aware. We still might be just the unlucky three, but there’s a chance of a bad batch of cables here.
For next steps, I would try to use lsiutil
to set the controller to SAS-2 speeds and rerun the tests. I put together a guide in an earlier post. If it runs well as SAS-2 but errors at SAS-3 then I’m really confident you have the same bad cable issue.
If you are inpatient like I was, the 10GTek Internal Mini SAS HD SFF-8643 to SFF-8643 cables are ~$25 on https://www.amazon.com/10Gtek-Internal-SFF-8643-Sideband-0-5-Meter/dp/B01AOS4LES/ref=sr_1_4?th=1.
FYI - received replacement cables yesterday from 45Drives. I noticed that replacement cables have a different part number (P/N: 45D-CBL60-013604-12G) than the original ones (P/N: 45D-CBL60-014148) in the HL15. At least, I’m assuming the label on it is the part number. Regardless, I’ll look to install them this weekend and try the benchmarking tests again to see if errors return.
@rymandle05,
Great write up on lsiutil. I’m running rocky 8 and it worked just fine. I was successfully able to change speed to 6 Gb and verified. Running tests still.
Well after a 2 hour run with FIO, no errors running at 6 Gb on my HGST. I’ll continue running for another day or two.
Thanks for your help so far.
@rymandle05 , I removed the old cables. Those parts were tagged as “45D-CBL60-009654-12G” and “45D-CBL60-009622-12G”. I installed new 10GTek cables.I reset the LSI chipset back to 12Gbps via lsiutil and have been hammering away for 24 hours. No Errrors. I’ll probably do another 24 hours of testing.
I reached out to @Vikram-45HomeLab, just to let him know more people were having this issue. This was the reply: " If you have an all SAS system, and has the I/O to accomodate 4x Mini-SAS-HD connections, that would give you 12GB for all drive connections.
A stock HL15 build would not have the I/O to make this work, you’d need either a different motherboard, or an HBA card to make this work the way you want."
So now I am a little confused. SAS3 has a 12Gbps speed . I realize that with only 7 SAS3 drives in a RaidZ2, I might not saturate the bandwidth. Is this different from the errors when trying to write to the drives at 12Gbps? (which is now possible without errors after replacing the cables)
Sounds like @Vikram-45HomeLab is confused here. He’s not wrong that that you can’t get 12Gbps speeds on all 15 drives with the full build. However, you can can achieve this for 7 of the 15 drives. More over, speed isn’t the issue. It’s the ZFS write and checksum errors.
So the third segment is a serial number? Will be interesting to know if the replacement cables with the lower number (013604) worked.