It’s been a busy few days for me so I haven’t had a chance to try a different power supply but I have been plugging away at some other troubleshooting.
I tried some of the sas3flash tests (–testfw and --testlsall) and they all came back fine. I gather from the documentation they are pretty rudimentary tests so not really surprising.
As a hail Mary, I tried firmware 16.00.12.00 documented on truenas.com community post. The firmware specifically calls out the fixed issue with SATA drives so, again, no surprise that it didn’t help my situation.
MPT3SAS in dmesg always reports 0x31120303 along side the I/O error from the drive (still always on odd slot drives
).
Dec 17 19:13:45 hl15 kernel: sd 10:0:2:0: [sdc] tag#1772 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
Dec 17 19:13:45 hl15 kernel: sd 10:0:2:0: [sdc] tag#1772 CDB: Write(16) 8a 00 00 00 00 00 b2 a9 71 e8 00 00 00 50 00 00
Dec 17 19:13:45 hl15 kernel: I/O error, dev sdc, sector 2997449192 op 0x1:(WRITE) flags 0x700 phys_seg 6 prio class 2
Dec 17 19:13:45 hl15 kernel: zio pool=TEST7Z2 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=1534692937728 size=40960 flags=40080c80
Dec 17 19:30:24 hl15 kernel: mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Dec 17 19:30:24 hl15 kernel: sd 10:0:2:0: [sdc] tag#2920 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s
Dec 17 19:30:24 hl15 kernel: sd 10:0:2:0: [sdc] tag#2920 CDB: Write(16) 8a 00 00 00 00 00 d4 6a 4a 90 00 00 00 38 00 00
Dec 17 19:30:24 hl15 kernel: I/O error, dev sdc, sector 3563735696 op 0x1:(WRITE) flags 0x700 phys_seg 3 prio class 2
Dec 17 19:30:24 hl15 kernel: zio pool=TEST7Z2 vdev=/dev/disk/by-vdev/1-11-part1 error=5 type=2 offset=1824631627776 size=28672 flags=180880
I found a utility that helps decode the message - lsi_decode_loginfo. I ran it and received back “PL_LOGINFO_SUB_CODE_WRONG_REL_OFF_OR_FRAME_LENGTH”.
~$ python3 ./lsi_decode_loginfo.py 0x31120303
Value 31120303h
Type: 30000000h SAS
Origin: 01000000h PL
Code: 00120000h PL_LOGINFO_CODE_ABORT See Sub-Codes below (PL_LOGINFO_SUB_CODE)
Sub Code: 00000300h PL_LOGINFO_SUB_CODE_WRONG_REL_OFF_OR_FRAME_LENGTH
Unparsed 00000003h
A little googling around and I find this description for that sub code on page 28 of (PDF) LSI SAS Error Codes - DOKUMEN.TIPS.
Firmware detected unexpected relative offset or wrong. frame length. Aborting the command.
I think this puts me squarely in the camp of hardware (sas controller \ cable \ power) or firmware problem (HP firmware on drives). I was able to find some new standard Seagate drives (st8000nm0075) on Ebay. I grabbed four of them for $100 each. That’s about 30% more than the used drives cost me but that seemed like a fair price for being new. Unfortunately, they came formatted 512 with Protection Type 2 enabled so I’m working to reformatting them without the protection.