RESOLVED:ZFS Write Errors with HL15 Full Build and SAS Drives

orix · December 14, 2023, 1:34pm

So if I’m following correctly, when you did the 4 drive test you had errors, but the 3 drive tests so far aren’t? Could this be a PSU issue? I mean, I love me some Corsair and it’s just about all I’ll use nowadays but having a bad egg is bound to happen.

Those drives don’t have that pesky 3.3v pin on them do they? I can never remember if that’s a SATA only or if SAS has it as well. I don’t shuck drives so I have a lower chance of running into it.

rymandle05 · December 14, 2023, 2:18pm

You’re right on the drive pools. Because there are 7 SAS enabled slots, I went from 4 drives in the odd slot test to 3 drives in the even slot test. I’ve also been wondering about power. The label on the drive says “+5V / 0.85 A” and “+12V / 0.99 A” so I don’t think it has 3.3V pin but I can look deeper into that.

Something else I noticed. SMART data doesn’t report bad sectors but it is showing “DWORD synchronization errors” increasing over time in the “Protocol Specific port log page for SAS SSP” section.

Invalid DWORD count = 75
Running disparity error count = 74
Loss of DWORD synchronization = 166

I’ve seen others say this has been result of marginally SAS controller, marginally bad cables, or dirty power.

[hard drive - How does loss of DWORD synchronization affect the health of an SAS disk? - Server Fault] (hard drive - How does loss of DWORD synchronization affect the health of an SAS disk? - Server Fault)

I’m going to run through a few more benchmarks with what I have but here’s what I’m thinking to try next:

Try the other drives in a three wide Z1 with the same benchmark tests
Set the SAS speed to 6gbps (instead of Auto which negotiates 12gps) in the BIOS
Test with a spare 1000w PSU (I think Corsair?) that I have lying around

orix · December 14, 2023, 2:36pm

Nice, you’re on a good track troubleshooting wise. I love seeing folks get interested and take a deep dive into “what the heck is causing this and how do I fix it”. I’ve gained the most knowledge and understanding when things go wrong, not when they go right!

The 3.3v pin is strange and is a datacenter/enterprise drive thing. IIRC, it’s used to do a clear or reset of the drive without requiring hands on the hardware. But I’m having a hard time remembering or finding if it’s SATA only or SAS as well (I need more coffee before I can use my brain). More info here (about the drives, not my brain).

The dirty power thing is really interesting. I know the RMe line of PSU from Corsair is a bit newer and think there may have been some changes as it adopts the new PCIE spec’s. I have 6 spinners in right now so I won’t be of much help to compare to. Do be careful though please that you check (even if it’s Corsair) the pinout of the PSU compared to the current one. While Corsair mostly is standard amongst their models, there are a few different generations of cable types. This chart is really really helpful, and I ought to pester @Hutch-45Drives to start a sticky or pinned post with this and maybe a few other documents just to help anyone avoid the issue.

rymandle05 · December 15, 2023, 4:00am

“what the heck is causing this and how do I fix it”

This is the way (of homelab)

rymandle05 · December 15, 2023, 4:11am

Well I found something else interesting this evening. I had been noticing a reoccurring message in DMESG from mpt3sas. Looking around online, it once again pointed to same possible root causes as the DWORD issues noted above. In moving drives around, I noticed the message stopped once I removed a particular drive which was plugged into slot 11.

[Thu Dec 14 07:39:28 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Thu Dec 14 07:41:10 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Thu Dec 14 08:09:29 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Thu Dec 14 09:48:30 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Thu Dec 14 09:55:57 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Thu Dec 14 10:58:27 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Thu Dec 14 11:04:42 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Thu Dec 14 11:15:34 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)

I think before I do all the other troubleshooting steps, I’m going to try again without this drive. I suppose its possible a “noisy” drive could be impacting communication of the other drives also on that same cable back to the SAS controller. That would account for errors showing up on the other drives in some of my tests.

rymandle05 · December 15, 2023, 6:08pm

I really thought I had it figured but, alas, I just experienced write errors on slot 1-15 and 1-11 near the end of the fourth test with a 7 wide pool without that one drive periodically outputting to DMESG.

Every 2.0s: sudo zpool status -v                                   hl15: Fri Dec 15 13:20:29 2023

  pool: TEST7Z2
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
config:

       	NAME        STATE     READ WRITE CKSUM
        TEST7Z2     DEGRADED	 0     0     0
          raidz2-0  DEGRADED	 0     0     0
            1-9     ONLINE	 0     0     0
            1-10    ONLINE	 0     0     0
            1-11    ONLINE	 0     1     0
            1-12    ONLINE	 0     0     0
            1-13    ONLINE	 0     0     0
            1-14    ONLINE	 0     0     0
            1-15    FAULTED	 0    25     0  too many errors

errors: No known data errors

DMESG has the same log_info message from mpt3sas I saw from the other drive, but it only happened right before the blk_update_request message instead of periodically on it’s own.

[Fri Dec 15 12:52:08 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Fri Dec 15 12:52:08 2023] sd 1:0:15:0: [sda] tag#2525 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[Fri Dec 15 12:52:08 2023] sd 1:0:15:0: [sda] tag#2525 CDB: Write(16) 8a 00 00 00 00 00 55 cd 4f a8 00 00 04 10 00 00
[Fri Dec 15 12:52:08 2023] blk_update_request: I/O error, dev sda, sector 1439518632 op 0x1:(WRITE) flags 0x700 phys_seg 88 prio class 0
[Fri Dec 15 12:52:08 2023] zio pool=TEST7Z2 vdev=/dev/disk/by-vdev/1-15-part1 error=5 type=2 offset=737032491008 size=532480 flags=40080c80
[Fri Dec 15 13:08:56 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Fri Dec 15 13:08:56 2023] sd 1:0:21:0: [sdc] tag#247 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[Fri Dec 15 13:08:56 2023] sd 1:0:21:0: [sdc] tag#247 CDB: Write(16) 8a 00 00 00 00 00 5f 27 a4 80 00 00 00 18 00 00
[Fri Dec 15 13:08:56 2023] blk_update_request: I/O error, dev sdc, sector 1596433536 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[Fri Dec 15 13:08:56 2023] zio pool=TEST7Z2 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=817372921856 size=12288 flags=180880

I marked the three drives that passed 10 tests so I’ll take those out plus the one in slot 1-15 and run a series of tests against those in a smaller vdev. I think I’ll dig out that spare PSU as well.

rymandle05 · December 15, 2023, 6:12pm

Forget that last part about the three “good” drives plus what’s in 1-15. I just pulled out the server and was reminded that the drive in 1-15 was one of the drives that made it through 10 benchmarks in a three wide pool.

cathal1201 · December 16, 2023, 5:43pm

Interesting read, and I’m sorry you’re having problems, but I’ve had similar problems with several different SAS controllers. I have seen some funky things with SAS/SATA controllers after updating the firmware.

I would test the controller again and make sure it’s not the one causing problems. Read somewhere that Program SAS Address and SAS Address High is super important for some disks. Just can’t find that post again.

I myself ensured that the firmware was reloaded and that all parameters were set correctly - Look for SAS3Flash Utility, Quick Reference Guide, Version 1.0, October 2014. It explains all the commands in detail. Among other things, there are some really good test options for testing “test link stats” - try them

Otherwise, I am convinced that it is a faulty error on the controller, without being 100% sure, but it seems that way to me. Do you have another controller? If not, of course it’s hard to test

The theory with PSU - sounds strange and - very interesting - it might be worth looking into more. Just have no idea how to test it without trying another PSU, but it just doesn’t give any feedback on the problem.

Then it struck me with these odd numbers in the bay - it makes no sense, other than that it may be that the vibrations between disks are the cause, and when spreading them out, there are less vibrations? but I also think that it is far-fetched.

rymandle05 · December 16, 2023, 6:29pm

I appreciate you weighing in @cathal1201. I’ve used sas3flash before to do change some Dell controllers to IT mode so I’m familiar enough to be dangerous. I think it generally looks but I did notice that the SAS address alphabetical characters were lower case when they are upper case on the sticker. Unfortunately, I don’t have PCIE SAS controller to rule out the onboard SAS3008 I’m using.

Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02) 
Copyright 2008-2017 Avago Technologies. All rights reserved.

        Adapter Selected is a Avago SAS: SAS3008(C0)

        Controller Number              : 0
        Controller                     : SAS3008(C0)
        PCI Address                    : 00:19:00:00
        SAS Address                    : 5003048-#-#x##-x##x
        NVDATA Version (Default)       : 0e.01.30.28
        NVDATA Version (Persistent)    : 0e.01.30.28
        Firmware Product ID            : 0x2221 (IT)
        Firmware Version               : 16.00.10.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : LSI3008-IT
        BIOS Version                   : 08.37.00.00
        UEFI BSD Version               : 18.00.00.00
        FCODE Version                  : N/A
        Board Name                     : LSI3008-IT
        Board Assembly                 : N/A
        Board Tracer Number            : N/A

        Finished Processing Commands Successfully.
        Exiting SAS3Flash.

I found a link to the sas3flash documentation. I’ll look at those test parameters and give those a try. Thanks for the lead on that!

rymandle05 · December 20, 2023, 2:54am

It’s been a busy few days for me so I haven’t had a chance to try a different power supply but I have been plugging away at some other troubleshooting.

I tried some of the sas3flash tests (–testfw and --testlsall) and they all came back fine. I gather from the documentation they are pretty rudimentary tests so not really surprising.

As a hail Mary, I tried firmware 16.00.12.00 documented on truenas.com community post. The firmware specifically calls out the fixed issue with SATA drives so, again, no surprise that it didn’t help my situation.

MPT3SAS in dmesg always reports 0x31120303 along side the I/O error from the drive (still always on odd slot drives ).

Dec 17 19:13:45 hl15 kernel: sd 10:0:2:0: [sdc] tag#1772 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
Dec 17 19:13:45 hl15 kernel: sd 10:0:2:0: [sdc] tag#1772 CDB: Write(16) 8a 00 00 00 00 00 b2 a9 71 e8 00 00 00 50 00 00
Dec 17 19:13:45 hl15 kernel: I/O error, dev sdc, sector 2997449192 op 0x1:(WRITE) flags 0x700 phys_seg 6 prio class 2
Dec 17 19:13:45 hl15 kernel: zio pool=TEST7Z2 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=1534692937728 size=40960 flags=40080c80
Dec 17 19:30:24 hl15 kernel: mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Dec 17 19:30:24 hl15 kernel: sd 10:0:2:0: [sdc] tag#2920 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
Dec 17 19:30:24 hl15 kernel: sd 10:0:2:0: [sdc] tag#2920 CDB: Write(16) 8a 00 00 00 00 00 d4 6a 4a 90 00 00 00 38 00 00
Dec 17 19:30:24 hl15 kernel: I/O error, dev sdc, sector 3563735696 op 0x1:(WRITE) flags 0x700 phys_seg 3 prio class 2
Dec 17 19:30:24 hl15 kernel: zio pool=TEST7Z2 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=1824631627776 size=28672 flags=180880

I found a utility that helps decode the message - lsi_decode_loginfo. I ran it and received back “PL_LOGINFO_SUB_CODE_WRONG_REL_OFF_OR_FRAME_LENGTH”.

~$ python3 ./lsi_decode_loginfo.py 0x31120303
Value     	31120303h
Type:     	30000000h	SAS 
Origin:   	01000000h	PL 
Code:     	00120000h	PL_LOGINFO_CODE_ABORT See Sub-Codes below (PL_LOGINFO_SUB_CODE)
Sub Code: 	00000300h	PL_LOGINFO_SUB_CODE_WRONG_REL_OFF_OR_FRAME_LENGTH 
Unparsed  	00000003h

A little googling around and I find this description for that sub code on page 28 of (PDF) LSI SAS Error Codes - DOKUMEN.TIPS.

Firmware detected unexpected relative offset or wrong. frame length. Aborting the command.

I think this puts me squarely in the camp of hardware (sas controller \ cable \ power) or firmware problem (HP firmware on drives). I was able to find some new standard Seagate drives (st8000nm0075) on Ebay. I grabbed four of them for $100 each. That’s about 30% more than the used drives cost me but that seemed like a fair price for being new. Unfortunately, they came formatted 512 with Protection Type 2 enabled so I’m working to reformatting them without the protection.

Hutch-45Drives · December 20, 2023, 12:44pm

Hi @rymandle05, Did you order a full build from us? if so only half the slots in the server support SAS drives.

The other half would be SATA connections. Yes SAS does work in SATA but you will not be getting the benefits of SAS and our backplane also does not support SAS features. Please let me know what the results are with the Seagate drives and if you are still having issues please reach out to info@45homelab.com

rymandle05 · December 20, 2023, 1:11pm

Hey @Hutch-45Drives - yep I ordered the full build from you. I emailed info@45homelab.com last Monday (Dec 11), but have yet to hear back beyond @Vikram-45HomeLab acknowledging my email.

I am aware that only half the slots are SAS enabled. I am using slots 1-09 through 1-15 which are connected to the motherboard’s SAS3008 controller. If/when I expand, I’d get sata drives for the other side of the backplane connected to the motherboard’s sata controller.

Hutch-45Drives · December 21, 2023, 12:54pm

Ill poke @Vikram-45HomeLab and see what came about your email.

rymandle05 · December 25, 2023, 9:58pm

Merry Christmas Everyone!

I’ve been continuing to tests but now with brand new Seagate branded st8000nm0075 sas drives as suggested by Corey with 45 drives. Unfortunately, results are pretty much the same. The drives in odd numbered slots but mostly drives in slot 1-11 continue to experience ZFS write errors after a period of time. On December 4, I did a test with a 4 wide pool in all odd slots and after a matter of minutes had the 1-11 drive faulted.

Every 2.0s: zpool status -v                       hl15: Thu Dec 21 20:11:47 2023

  pool: SEAGATE4ODDZ2
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
config:

        NAME           STATE     READ WRITE CKSUM
        SEAGATE4ODDZ2  DEGRADED     0     0     0
          raidz2-0     DEGRADED     0     0     0
            1-9        ONLINE       0     0     0
            1-11       FAULTED      0    17     0  too many errors
            1-13       ONLINE       0     0     0
            1-15       ONLINE       0     0     0

errors: No known data errors

When I check dmesg, mpt3 reports the same 0x31120303 error code documented above.

[Thu Dec 21 20:10:27 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Thu Dec 21 20:10:27 2023] sd 10:0:7:0: [sdc] tag#767 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[Thu Dec 21 20:10:27 2023] sd 10:0:7:0: [sdc] tag#767 CDB: Write(16) 8a 00 00 00 00 00 00 94 50 a8 00 00 04 00 00 00
[Thu Dec 21 20:10:27 2023] I/O error, dev sdc, sector 9719976 op 0x1:(WRITE) flags 0x700 phys_seg 5 prio class 2
[Thu Dec 21 20:10:27 2023] zio pool=SEAGATE4ODDZ2 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=4975579136 size=524288 flags=180880
[Thu Dec 21 20:10:28 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Thu Dec 21 20:10:28 2023] sd 10:0:7:0: [sdc] tag#645 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[Thu Dec 21 20:10:28 2023] sd 10:0:7:0: [sdc] tag#645 CDB: Write(16) 8a 00 00 00 00 00 00 94 f0 f8 00 00 08 08 00 00
[Thu Dec 21 20:10:28 2023] I/O error, dev sdc, sector 9761016 op 0x1:(WRITE) flags 0x700 phys_seg 122 prio class 2
[Thu Dec 21 20:10:28 2023] zio pool=SEAGATE4ODDZ2 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=4996591616 size=1052672 flags=40080c80
[Thu Dec 21 20:10:30 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Thu Dec 21 20:10:30 2023] sd 10:0:7:0: [sdc] tag#702 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[Thu Dec 21 20:10:30 2023] sd 10:0:7:0: [sdc] tag#702 CDB: Write(16) 8a 00 00 00 00 00 00 a4 a0 d0 00 00 04 00 00 00
[Thu Dec 21 20:10:30 2023] I/O error, dev sdc, sector 10789072 op 0x1:(WRITE) flags 0x700 phys_seg 9 prio class 2
[Thu Dec 21 20:10:30 2023] zio pool=SEAGATE4ODDZ2 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=5522956288 size=524288 flags=180880
[Thu Dec 21 20:10:43 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Thu Dec 21 20:10:43 2023] sd 10:0:7:0: [sdc] tag#659 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[Thu Dec 21 20:10:43 2023] sd 10:0:7:0: [sdc] tag#659 CDB: Write(16) 8a 00 00 00 00 00 05 32 45 a8 00 00 08 00 00 00
[Thu Dec 21 20:10:43 2023] I/O error, dev sdc, sector 87180712 op 0x1:(WRITE) flags 0x700 phys_seg 65 prio class 2
[Thu Dec 21 20:10:43 2023] zio pool=SEAGATE4ODDZ2 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=44635475968 size=1048576 flags=40080c80
[Thu Dec 21 20:10:44 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Thu Dec 21 20:10:44 2023] sd 10:0:7:0: [sdc] tag#735 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[Thu Dec 21 20:10:44 2023] sd 10:0:7:0: [sdc] tag#735 CDB: Write(16) 8a 00 00 00 00 00 05 3a 51 a8 00 00 08 08 00 00
[Thu Dec 21 20:10:44 2023] I/O error, dev sdc, sector 87708072 op 0x1:(WRITE) flags 0x700 phys_seg 35 prio class 2
[Thu Dec 21 20:10:44 2023] zio pool=SEAGATE4ODDZ2 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=44905484288 size=1052672 flags=40080c80
[Thu Dec 21 20:10:57 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Thu Dec 21 20:10:57 2023] sd 10:0:7:0: [sdc] tag#658 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[Thu Dec 21 20:10:57 2023] sd 10:0:7:0: [sdc] tag#658 CDB: Write(16) 8a 00 00 00 00 00 07 0f bc 68 00 00 08 08 00 00
[Thu Dec 21 20:10:57 2023] I/O error, dev sdc, sector 118471784 op 0x1:(WRITE) flags 0x700 phys_seg 35 prio class 2
[Thu Dec 21 20:10:57 2023] zio pool=SEAGATE4ODDZ2 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=60656504832 size=1052672 flags=40080c80

I’ve been running a 6 wide pool today and avoiding slot 1-11. It was looking really good and passed multiple FIO benchmark tests and file copies but just a little while ago experienced write errors on 1-9.

Every 2.0s: zpool status -v                            hl15: Mon Dec 25 15:49:23 2023

  pool: SEAGATE6Z2
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 55.5G in 00:06:58 with 0 errors on Mon Dec 25 09:14:09 2023
config:

        NAME        STATE     READ WRITE CKSUM
        SEAGATE6Z2  DEGRADED     0     0     0
          raidz2-0  DEGRADED     0     0     0
            1-9     FAULTED      0    20     0  too many errors
            1-10    ONLINE       0     0     0
            1-12    ONLINE       0     0     0
            1-13    ONLINE       0     0     0
            1-14    ONLINE       0     0     0
            1-15    ONLINE       0     0     0

errors: No known data errors

Again, dmesg has the same error code reported by mpt3.

[Mon Dec 25 14:41:37 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Mon Dec 25 14:41:37 2023] sd 10:0:0:0: [sdc] tag#2883 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[Mon Dec 25 14:41:37 2023] sd 10:0:0:0: [sdc] tag#2883 CDB: Write(16) 8a 00 00 00 00 00 a8 9c 59 80 00 00 00 40 00 00
[Mon Dec 25 14:41:37 2023] I/O error, dev sdc, sector 2828818816 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 2
[Mon Dec 25 14:41:37 2023] zio pool=SEAGATE6Z2 vdev=/dev/disk/by-vdev/1-9-part1 error=5 type=2 offset=1448354185216 size=32768 flags=180880
[Mon Dec 25 14:42:00 2023] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
[Mon Dec 25 14:42:00 2023] sd 10:0:0:0: [sdc] tag#58 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
[Mon Dec 25 14:42:00 2023] sd 10:0:0:0: [sdc] tag#58 CDB: Write(16) 8a 00 00 00 00 00 af 77 4c 80 00 00 04 80 00 00
[Mon Dec 25 14:42:00 2023] I/O error, dev sdc, sector 2943831168 op 0x1:(WRITE) flags 0x700 phys_seg 63 prio class 2
[Mon Dec 25 14:42:00 2023] zio pool=SEAGATE6Z2 vdev=/dev/disk/by-vdev/1-9-part1 error=5 type=2 offset=1507240509440 size=589824 flags=40080c80

I still haven’t tried a different power supply so I’ll give that a go until I hear back from Corey and 45Drives and their suggestion on what to try next.

pxpunx · December 26, 2023, 6:27pm

I have a similar issue, but with slots 1-11 and 1-15. The others appear to be fine.

I’ve rotated thru numerous drives so far and continue to see the same results.

I will need to do some additional tests.

pxpunx · December 26, 2023, 6:41pm

Yep, same issue. Here’s a sample:

Dec 26 11:23:22 arnold kernel: mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Dec 26 11:23:22 arnold kernel: sd 2:0:8:0: [sdf] tag#853 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
Dec 26 11:23:22 arnold kernel: sd 2:0:8:0: [sdf] tag#853 CDB: Write(16) 8a 00 00 00 00 00 01 c0 88 f8 00 00 01 40 00 00
Dec 26 11:23:22 arnold kernel: blk_update_request: I/O error, dev sdf, sector 29395192 op 0x1:(WRITE) flags 0x700 phys_seg 18 prio class 0
Dec 26 11:23:22 arnold kernel: zio pool=pool0 vdev=/dev/disk/by-vdev/1-15-part1 error=5 type=2 offset=15049289728 size=163840 flags=40080caa
Dec 26 11:24:29 arnold kernel: mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Dec 26 11:24:29 arnold kernel: sd 2:0:8:0: [sdf] tag#901 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
Dec 26 11:24:29 arnold kernel: sd 2:0:8:0: [sdf] tag#901 CDB: Write(16) 8a 00 00 00 00 00 04 17 fd 10 00 00 01 40 00 00
Dec 26 11:24:29 arnold kernel: blk_update_request: I/O error, dev sdf, sector 68680976 op 0x1:(WRITE) flags 0x700 phys_seg 12 prio class 0
Dec 26 11:24:29 arnold kernel: zio pool=pool0 vdev=/dev/disk/by-vdev/1-15-part1 error=5 type=2 offset=35163611136 size=163840 flags=40080caa

Dec 26 11:29:17 arnold kernel: mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Dec 26 11:29:17 arnold kernel: sd 2:0:9:0: [sda] tag#985 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
Dec 26 11:29:17 arnold kernel: sd 2:0:9:0: [sda] tag#985 CDB: Write(16) 8a 00 00 00 00 00 01 74 28 00 00 00 01 40 00 00
Dec 26 11:29:17 arnold kernel: blk_update_request: I/O error, dev sda, sector 24389632 op 0x1:(WRITE) flags 0x700 phys_seg 7 prio class 0
Dec 26 11:29:17 arnold kernel: zio pool=pool0 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=12486443008 size=163840 flags=40080caa
Dec 26 11:29:18 arnold kernel: mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Dec 26 11:29:18 arnold kernel: sd 2:0:9:0: [sda] tag#965 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
Dec 26 11:29:18 arnold kernel: sd 2:0:9:0: [sda] tag#965 CDB: Write(16) 8a 00 00 00 00 00 01 78 dd 40 00 00 01 40 00 00
Dec 26 11:29:18 arnold kernel: blk_update_request: I/O error, dev sda, sector 24698176 op 0x1:(WRITE) flags 0x700 phys_seg 18 prio class 0
Dec 26 11:29:18 arnold kernel: zio pool=pool0 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=12644417536 size=163840 flags=40080caa
Dec 26 11:29:21 arnold kernel: mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Dec 26 11:29:21 arnold kernel: sd 2:0:9:0: [sda] tag#1014 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
Dec 26 11:29:21 arnold kernel: sd 2:0:9:0: [sda] tag#1014 CDB: Write(16) 8a 00 00 00 00 00 01 83 bb 00 00 00 01 40 00 00
Dec 26 11:29:21 arnold kernel: blk_update_request: I/O error, dev sda, sector 25410304 op 0x1:(WRITE) flags 0x700 phys_seg 18 prio class 0
Dec 26 11:29:21 arnold kernel: zio pool=pool0 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=13009027072 size=163840 flags=40080caa
Dec 26 11:29:22 arnold kernel: mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Dec 26 11:29:22 arnold kernel: sd 2:0:9:0: [sda] tag#968 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
Dec 26 11:29:22 arnold kernel: sd 2:0:9:0: [sda] tag#968 CDB: Write(16) 8a 00 00 00 00 00 01 87 ba 00 00 00 01 40 00 00
Dec 26 11:29:22 arnold kernel: blk_update_request: I/O error, dev sda, sector 25672192 op 0x1:(WRITE) flags 0x700 phys_seg 15 prio class 0
Dec 26 11:29:22 arnold kernel: zio pool=pool0 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=13143113728 size=163840 flags=40080caa

SMART tests show OK; ack’d that ZFS can expose issues that SMART may not.

# smartctl -x /dev/sdf
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-4.18.0-513.9.1.el8_9.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH728080AL5200
Revision:             A515
Compliance:           SPC-4
User Capacity:        8,001,563,222,016 bytes [8.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca23b025958
Serial number:        2EG191HJ
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Dec 26 11:37:14 2023 MST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     30 C
Drive Trip Temperature:        85 C

Manufactured in week 48 of year 2014
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  8
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  127690
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 620007469875200

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0      13536       1645.504           0
write:         0        0         0         0     301955       1839.323           0
verify:        0        0         0         0      94507          0.000           0

Non-medium error count:        0

No Self-tests have been logged

Background scan results log
  Status: scan is active
    Accumulated power on time, hours:minutes 19288:53 [1157333 minutes]
    Number of background scans performed: 116,  scan progress: 1.61%
    Number of background medium scans performed: 116

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 1
  number of phys = 1
  phy identifier = 0
    attached device type: SAS or SATA device
    attached reason: loss of dword synchronization
    reason: unknown
    negotiated logical link rate: phy enabled; 12 Gbps
    attached initiator port: ssp=1 stp=1 smp=1
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca23b025959
    attached SAS address = 0x500304801cfed211
    attached phy identifier = 6
    Invalid DWORD count = 181
    Running disparity error count = 173
    Loss of DWORD synchronization = 12
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 181
     Running disparity error count: 173
     Loss of dword synchronization count: 12
     Phy reset problem count: 0
relative target port id = 2
  generation code = 1
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca23b02595a
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0

# smartctl -x /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-4.18.0-513.9.1.el8_9.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH728080AL5200
Revision:             A515
Compliance:           SPC-4
User Capacity:        8,001,563,222,016 bytes [8.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca23b05d224
Serial number:        2EG367EJ
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Dec 26 11:37:47 2023 MST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     29 C
Drive Trip Temperature:        85 C

Manufactured in week 48 of year 2014
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  8
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  127563
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 425305076400128

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0       7595       1612.876           0
write:         0        0         0         0     169386       1494.826           0
verify:        0        0         0         0      39699          0.000           0

Non-medium error count:        0

No Self-tests have been logged

Background scan results log
  Status: scan is active
    Accumulated power on time, hours:minutes 19289:05 [1157345 minutes]
    Number of background scans performed: 116,  scan progress: 0.65%
    Number of background medium scans performed: 116

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 1
  number of phys = 1
  phy identifier = 0
    attached device type: SAS or SATA device
    attached reason: loss of dword synchronization
    reason: unknown
    negotiated logical link rate: phy enabled; 12 Gbps
    attached initiator port: ssp=1 stp=1 smp=1
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca23b05d225
    attached SAS address = 0x500304801cfed212
    attached phy identifier = 2
    Invalid DWORD count = 812
    Running disparity error count = 661
    Loss of DWORD synchronization = 15
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 812
     Running disparity error count: 661
     Loss of dword synchronization count: 15
     Phy reset problem count: 0
relative target port id = 2
  generation code = 1
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca23b05d226
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0

I had an issue with 1-15 before this and swapped the drive out, then noticed the issues with both 1-11 and 1-15 this AM. So far drive swap count is:

1-11: 3x
1-15: 4x

All drives are the same make/manufacturer. They’re old HGST 8TB He drives from an OpenStack environment. They have some hours on them, but they are nowhere near their EOL.

I have a ton of other disks I can swap out - 2TB and 4TB WD Enterprise, an 8TB Seagate IronWolf (brand new), etc. but at some point we’ll have to admit it’s not the drives.

DigitalGarden · December 26, 2023, 7:19pm

I’m just watching this thread. You have the full build (ie, the X11SPH-NCTF)? Have you/can you test any SATA drives in the affected slots or is your issue only with SAS drives?

pxpunx · December 26, 2023, 7:38pm

Yeah, I can and will do so. The 8TB IronWolf is SATA.

rymandle05 · December 26, 2023, 7:45pm

I know it doesn’t help but it’s good to know I’m not the only one with a full build experiencing this issues. I have two 1TB Seagate Seagate Constellation ES.3 (ST1000NM0033) sata drives. It’s good to test with SATA too but since the SAS drives are generally less expensive for the same capacity on ebay, I’d really like to have that option.

pxpunx · December 26, 2023, 7:47pm

Errors are basically the same, repeated multiple times per drive swap until the disk is offline’d. Here’s a sample from the last disk swap I did:

Dec 26 11:49:18 arnold kernel: sda: sda1 sda9
Dec 26 11:49:18 arnold kernel: sda: sda1 sda9
Dec 26 11:49:19 arnold zed[1440498]: eid=573 class=vdev_attach pool='pool0' vdev=1-11-part1 vdev_state=ONLINE
Dec 26 11:49:19 arnold zed[1440527]: eid=574 class=resilver_start pool='pool0'
Dec 26 11:49:20 arnold zed[1440598]: eid=576 class=config_sync pool='pool0'
Dec 26 11:49:21 arnold zed[1441388]: eid=577 class=vdev.unknown pool='pool0' vdev=old
Dec 26 11:49:21 arnold zed[1441389]: eid=578 class=statechange pool='pool0' vdev=old vdev_state=UNAVAIL
Dec 26 11:49:21 arnold zed[1441404]: error: statechange-notify.sh: eid=578: "mail" not installed
Dec 26 11:49:22 arnold kernel: mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Dec 26 11:49:22 arnold kernel: sd 2:0:10:0: [sda] tag#1037 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
Dec 26 11:49:22 arnold kernel: sd 2:0:10:0: [sda] tag#1037 CDB: Write(16) 8a 00 00 00 00 00 01 74 23 80 00 00 01 40 00 00
Dec 26 11:49:22 arnold kernel: blk_update_request: I/O error, dev sda, sector 24388480 op 0x1:(WRITE) flags 0x700 phys_seg 12 prio class 0
Dec 26 11:49:22 arnold kernel: zio pool=pool0 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=12485853184 size=163840 flags=40080caa
Dec 26 11:49:22 arnold zed[1446158]: eid=579 class=io pool='pool0' vdev=1-11-part1 size=163840 offset=12485853184 priority=3 err=5 flags=0x40080caa
Dec 26 11:49:22 arnold zed[1446168]: eid=589 class=io pool='pool0' size=28672 offset=12481683456 priority=3 err=5 flags=0x1008aa bookmark=404:23:0:1513
Dec 26 11:49:22 arnold zed[1446174]: eid=580 class=io pool='pool0' vdev=1-11-part1 size=24576 offset=12485992448 priority=3 err=5 flags=0x3808aa bookmark=404:23:0:1517
Dec 26 11:49:22 arnold zed[1446176]: eid=587 class=io pool='pool0' vdev=1-11-part1 size=28672 offset=12485877760 priority=3 err=5 flags=0x3808aa bookmark=404:23:0:1513
Dec 26 11:49:22 arnold zed[1446177]: eid=588 class=io pool='pool0' size=28672 offset=12481712128 priority=3 err=5 flags=0x1008aa bookmark=404:23:0:1514
Dec 26 11:49:22 arnold zed[1446179]: eid=583 class=io pool='pool0' size=28672 offset=12481769472 priority=3 err=5 flags=0x1008aa bookmark=404:23:0:1516
Dec 26 11:49:22 arnold zed[1446172]: eid=581 class=io pool='pool0' vdev=1-11-part1 size=28672 offset=12485963776 priority=3 err=5 flags=0x3808aa bookmark=404:23:0:1516
Dec 26 11:49:22 arnold zed[1446178]: eid=582 class=io pool='pool0' size=24576 offset=12481798144 priority=3 err=5 flags=0x1008aa bookmark=404:23:0:1517
Dec 26 11:49:22 arnold zed[1446182]: eid=590 class=io pool='pool0' vdev=1-11-part1 size=24576 offset=12485853184 priority=3 err=5 flags=0x3808aa bookmark=404:23:0:1512
Dec 26 11:49:22 arnold zed[1446183]: eid=586 class=io pool='pool0' vdev=1-11-part1 size=28672 offset=12485906432 priority=3 err=5 flags=0x3808aa bookmark=404:23:0:1514
Dec 26 11:49:22 arnold zed[1446180]: eid=584 class=io pool='pool0' vdev=1-11-part1 size=28672 offset=12485935104 priority=3 err=5 flags=0x3808aa bookmark=404:23:0:1515
Dec 26 11:49:22 arnold zed[1446187]: eid=591 class=io pool='pool0' size=24576 offset=12481658880 priority=3 err=5 flags=0x1008aa bookmark=404:23:0:1512
Dec 26 11:49:22 arnold zed[1446185]: eid=585 class=io pool='pool0' size=28672 offset=12481740800 priority=3 err=5 flags=0x1008aa bookmark=404:23:0:1515

And SMART data for the same drive, with successful short test.

# smartctl -x /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-4.18.0-513.9.1.el8_9.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUH728080AL5200
Revision:             A515
Compliance:           SPC-4
User Capacity:        8,001,563,222,016 bytes [8.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca23b04961c
Serial number:        2EG2J5ZG
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Dec 26 12:02:01 2023 MST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     32 C
Drive Trip Temperature:        85 C

Manufactured in week 48 of year 2014
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  3
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  156282
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 308671431049216

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0       2391        983.628           0
write:         0        0         0         0      22788        443.790           0
verify:        0        0         0         0      43816          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   19288                 - [-   -    -]

Long (extended) Self-test duration: 65535 seconds [1092.2 minutes]

Background scan results log
  Status: scan is active
    Accumulated power on time, hours:minutes 19289:00 [1157340 minutes]
    Number of background scans performed: 115,  scan progress: 1.09%
    Number of background medium scans performed: 115

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 1
  number of phys = 1
  phy identifier = 0
    attached device type: SAS or SATA device
    attached reason: loss of dword synchronization
    reason: unknown
    negotiated logical link rate: phy enabled; 12 Gbps
    attached initiator port: ssp=1 stp=1 smp=1
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca23b04961d
    attached SAS address = 0x500304801cfed212
    attached phy identifier = 2
    Invalid DWORD count = 2600
    Running disparity error count = 2226
    Loss of DWORD synchronization = 247
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 2600
     Running disparity error count: 2226
     Loss of dword synchronization count: 247
     Phy reset problem count: 0
relative target port id = 2
  generation code = 1
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca23b04961e
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0

Broke down the errors:

Dec 26 11:49:22 arnold kernel: sd 2:0:10:0: [sda] tag#1037 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s

DID_SOFT_ERROR is a low-level driver error; retries are OK. This could mean any number of different issues as it appears to be invoked often…and I’m not a kernel developer.

https://github.com/search?q=repo%3Atorvalds%2Flinux+DID_SOFT_ERROR&type=code

Dec 26 11:49:22 arnold kernel: sd 2:0:10:0: [sda] tag#1037 CDB: Write(16) 8a 00 00 00 00 00 01 74 23 80 00 00 01 40 00 00

Write(16) indicates SCSI command 0x8a (write) has failed.

Which is probably what prompts this:

Dec 26 11:49:44 arnold kernel: blk_update_request: I/O error, dev sda, sector 31078632 op 0x1:(WRITE) flags 0x700 phys_seg 18 prio class 0

This appears to indicate that either there is a physical error on the media or there is a problem between the media and the controller. SMART reports no issues, multiple drives have been swapped, and the errors persist on specific slots. At this point I’m inclined to believe it’s the backplane or the cables.

And then there is this:

Dec 26 11:49:30 arnold kernel: zio pool=pool0 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=14774411264 size=163840 flags=40080caa

error=5 is a read/write error. I couldn’t find an error list or other reference, but EIO / error #5 is referenced in a number of places.

What’s interesting is that in my case it’s the 3rd slot on both cables/ports.

1-9 : CABLE 1, SLOT/PORT 0
1-10 : CABLE 1, SLOT/PORT 1
1-11 : CABLE 1, SLOT/PORT 2
1-12 : CABLE 1, SLOT/PORT 3
1-13 : CABLE 2, SLOT/PORT 0
1-14 : CABLE 2, SLOT/PORT 1
1-15 : CABLE 2, SLOT/PORT 2

I have the 8TB IronWolf SATA disk in now and so far it hasn’t shown any issues, which I find very odd.