RESOLVED:ZFS Write Errors with HL15 Full Build and SAS Drives

pxpunx · December 26, 2023, 7:48pm

8TB IronWolf SATA re-silvered and online in slot 1-11 w/ zero errors.

I’m officially out of 8TB spares.

ALL of the HGST 8TB SAS disks have issues in slots 1-11 and 1-15. So far (knock on wood), there have been no issues in the other slots.

DigitalGarden · December 26, 2023, 9:32pm

It seems like it. It could be the Broadcom controller on the motherboard, but I didn’t see any issues with a very quick google search. Has rymandle05 ruled out the cables? Maybe you can swap the cables. On the backplane side, swap the cable connected to 1-15 over to one of the two that connect 1-1 through 1-8, since 1-11 is also problematic?

pxpunx · December 26, 2023, 10:04pm

Damn it! The on-board 3008 is in IR mode.

I will re-flash to IT mode and start over.

pxpunx · December 26, 2023, 10:23pm

IR mode could have caused issues in the future, so IT mode was a requirement, but it doesn’t seem to have affected this particular issue.

I still see write errors when I attempt to re-silver a replacement drive.

pxpunx · December 26, 2023, 10:36pm

Rebuilt the pool after the re-flash to IT mode, errors on 1-11. None on 1-15 so far.

I have one SFF-8643 to SFF-8643 cable and will swap one of the cables out to see what happens.

rymandle05 · December 26, 2023, 10:55pm

I haven’t ruled out cables. I was hoping to hear back today from 45Drives on that suggestion and maybe they would send new ones to try out. I have to imagine it’s either cables of the controller itself. If I had another sas controller I’d try that out but, sadly, I don’t have spare sas card either.

rymandle05 · December 26, 2023, 10:56pm

How did you figure out the cable and port configuration? I assume my full build is the same but thought I’d check. 1-9 is more common to have an issue for me than 1-15.

pxpunx · December 26, 2023, 11:16pm

I didn’t map anything out in particular, it’s just the order of the drives vs. cables vs. individual ports or slots on the controller. The fact that it’s the 3rd out of 4 on both connectors could be a coincidence, or it could be a clue. Hard to tell at this point.

I have multiple SAS HBAs I can swap in and out to see if the problem follows, but it’ll take some time to work through all the different scenarios.

EDIT: I also haven’t ruled out swapping the backplane connections around, either.

pxpunx · December 26, 2023, 11:46pm

Swapped the cable that connected bank 3 (slots 1-13, 1-14, 1-15) w/ a Supermicro SFF-8643 to SFF-8643 cable I had spare and while there were some checksum errors (likely the result of the previous failed attempts to re-silver), 1-15 passed w/out issue, but 1-11 still threw write errors.

I cleared the errors on 1-11 and 1-15, which prompted 1-11 to attempt to re-silver, which clocked more write errors as expected, but no additional checksum errors on 1-11 or 1-15.

Currently dumping some more data onto the pool to see if I can cause 1-15 to error again. If it remains healthy, I’ll put the old cable back and make sure both ends are secure, then re-check it all.

pxpunx · December 27, 2023, 12:24am

Put the stock SAS cable back. At first there were no errors on 1-15, but after several attempts to re-silver 1-11, 1-15 decided to clock some read errors.

I have some replacement SAS cables coming that should be here Thursday so I can run additional tests.

rymandle05 · December 27, 2023, 12:48am

So your bank with the spare Supermicro cable must have been stable with tests? What cables did you end up buying the cables? If it’s something I can return, I don’t mind forking over the money now.

I found these on Amazon from a brand I’ve used before. Unfortunately, it wouldn’t arrive until Saturday for me.
https://www.amazon.com/10Gtek-Internal-SFF-8643-Sideband-0-5-Meter/dp/B01AOS4LES/ref=sr_1_4?th=1

pxpunx · December 27, 2023, 2:42am

Good news! I moved the drives to an LSI 9300-8i in IT mode and the issue persists. So far I’ve clocked write errors on 1-15 … just waiting for 1-11 to catch up.

To sum up:

Swapped multiple SAS HDDs for same make/model in 1-11 and 1-15; errors persisted.
Swapped the SFF cable for bank 3 (slots 1-13, 1-14, 1-15) and the errors stopped on 1-15, but continued on 1-11.
Swapped 1-11 for a SATA HDD and the errors stopped, swapped back to another SAS HDD and the errors returned.
Installed an LSI 9800-8i (SAS 3008) and moved the connections from the on-board SAS controller (also a SAS 3008) to the 8i and the errors continued.

Because of #2 and #3, I don’t suspect the backplane, but haven’t ruled it out.

Because of #2 I suspect the SAS cables. It’s possible that a manufacturing defect hit a particular run and the same issue could affect others.

Because of #4, I’m less suspicious of the onboard SAS controller. However, since both the onboard and expansion card controllers are the same model, it makes me want to try a one of the LSI 9200s I have. But this feels like a stretch since it would imply errors specific to my drives or their feature set.

pxpunx · December 27, 2023, 2:43am

That’s them. I’ll use them to test, but will probably replace long-term with some from Supermicro.

Supermicro Internal MiniSAS HD to MiniSAS HD 60cm Cable (CBL-SAST-0593)

Hutch-45Drives · December 27, 2023, 1:00pm

Hi All,

I’m just catching up on this thread and it’s a head-scratcher.

Being that the issue is on slots 1-11 and 1-15 these would be on 2 different HBA cables and 2 different sections of the backplane.

The only thing that I see is that both the 1-9 to 1-15 use the SAS cables so there could be a bad batch of HBA SAS cables as you mentioned. I’d be curious if you switched the cables on the backplane side to see if the SAS cables being plugged into the first 1-1 to 1-8 slots then have the same issues. also if the 1-9 to 1-15 slots stop having issues with the regular SATA cables connected to them

rymandle05 · December 27, 2023, 2:53pm

@Hutch-45Drives aren’t the cables connected to 1-1 to 1-8 different ends on the motherboard as those connect to the onboard sata controller and not the SAS controller? I can double check here in a minute.

Another test I’ll try, I notice that that SAS cables are right up against the power cables underneath the backplane. Cross talk could be part of the problem here at higher 12gps SAS speeds so let me try to run the cables temporarily up above and over the drives to rule this out.

rymandle05 · December 27, 2023, 3:06pm

@Hutch-45Drives Nevermind – took me longer than I care to admit but I understand your ask now.

Hutch-45Drives · December 27, 2023, 3:30pm

@rymandle05, Haha that’s OK. and yes they are different on the motherboard but the connectors on the backplane side are all the same connections.

rymandle05 · December 27, 2023, 3:40pm

That’s a test I can do here. Right now, I’m running tests with 6 wide pool using 1-9 to 1-11 and 1-13 to 1-15 with the SAS cables away from the backplane and power cables. The first benchmark FIO test is wrapping up and I threw a bunch of video files at it over samba. No errors yet but I’ve been here before so testing continues.

pxpunx · December 27, 2023, 4:23pm

Yeah, I will move the cables on the backplane and test.

I’ve noticed over the last few tests that the drives can appear error-free for a while. In fact, at the moment, 1-11 is fine, but 1-15 has faulted twice on two drives.

That makes me feel like the SATA drive test and cable replacement test I did yesterday didn’t necessarily prove anything.

I think for this run I’ll reduce the pool size to “force” all the action on 4 drives in a RaidZ2 - two “bad,” two OK.

rymandle05 · December 27, 2023, 4:26pm

Funny you should mention it takes a while for errors. I’m half through the second FIO tests and just now 1-11 experienced write errors - just not enough yet to FAULT the drive.

  pool: SEAGATE6Z2
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffec
ted.
action: Determine if the device needs to be replaced, and clear the erro
rs
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 2.66M in 00:00:01 with 0 errors on Wed Dec 27 08:11:0
6 2023
config:

        NAME        STATE     READ WRITE CKSUM
        SEAGATE6Z2  ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            1-9     ONLINE       0     0     0
            1-10    ONLINE       0     0     0
            1-11    ONLINE       0     5     0
            1-13    ONLINE       0     0     0
            1-14    ONLINE       0     0     0
            1-15    ONLINE       0     0     0

errors: No known data errors

Here’s dmesg logs showing same error codes as before.

[Wed Dec 27 10:21:13 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Wed Dec 27 10:21:13 2023] sd 2:0:7:0: [sde] tag#689 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[Wed Dec 27 10:21:13 2023] sd 2:0:7:0: [sde] tag#689 CDB: Write(16) 8a 00 00 00 00 01 10 00 84 08 00 00 03 f8 00 00
[Wed Dec 27 10:21:13 2023] I/O error, dev sde, sector 4563436552 op 0x1:(WRITE) flags 0x700 phys_seg 49 prio class 2
[Wed Dec 27 10:21:13 2023] zio pool=SEAGATE6Z2 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=2336477937664 size=1048576 flags=40080c80