RESOLVED:ZFS Write Errors with HL15 Full Build and SAS Drives

Hutch-45Drives · December 27, 2023, 4:50pm

that line in the dmesg comes up before the drive has issues. that means that the HBA driver is having an issue first. I’d be curious if you could update the controller firmware

rymandle05 · December 27, 2023, 5:01pm

Hey @Hutch-45Drives that was one of the early steps I took. I’m running the latest I could find from Supermicro - SAS controller firmware 16.00.10.00 . If you know of and have access to a newer version, I’d be happy to give that a try

Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02) 
Copyright 2008-2017 Avago Technologies. All rights reserved.

        Adapter Selected is a Avago SAS: SAS3008(C0)

        Controller Number              : 0
        Controller                     : SAS3008(C0)
        PCI Address                    : 00:19:00:00
        SAS Address                    : 5003048-#-#x##-x##x
        NVDATA Version (Default)       : 0e.01.30.28
        NVDATA Version (Persistent)    : 0e.01.30.28
        Firmware Product ID            : 0x2221 (IT)
        Firmware Version               : 16.00.10.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : LSI3008-IT
        BIOS Version                   : 08.37.00.00
        UEFI BSD Version               : 18.00.00.00
        FCODE Version                  : N/A
        Board Name                     : LSI3008-IT
        Board Assembly                 : N/A
        Board Tracer Number            : N/A

        Finished Processing Commands Successfully.
        Exiting SAS3Flash.

rymandle05 · December 27, 2023, 5:17pm

Oh and I also tried an “unreleased” version (16.00.12.00) that was made available on TrueNAS forums for a bug specific to sata drives. Still no luck so I reverted back to 16.00.10.00.

pxpunx · December 27, 2023, 5:50pm

Same, I’m on the latest release from Supermicro. It came with 15.x and I flashed to 16.0.10.0 when I switched to IT mode.

pxpunx · December 27, 2023, 5:55pm

Swapped to bank 0 and 1 on the backplane, same issues. Resilver caused read errors on 1-3 and write/checksum errors 1-7, which correspond to 1-11 and 1-15 (3rd slot) on banks 2 and 3.

I’ll run some more tests.

rymandle05 · December 27, 2023, 5:57pm

Since I think @pxpunx is trying the cable move to enable SAS on slots 1-1 to 1-9, I’m doing a test with running the drives at SAS-2 6Gps speeds with the cables still routed away from power and backplane. I was able to use SeaChestUtiliites to accomplish as I couldn’t find an option in the BIOS.

sudo ./SeaChest_Configure_x86_64-alpine-linux-musl_static -d all --phySpeed 3 --onlySeagate

Now all the drives are running a Negotiated speed of 6Gps as verified using the -i flag.

hl15:~$ sudo ./SeaChestUtilities/Linux/Non-RAID/x86_64/SeaChest_Configure_x86_64-alpine-linux-musl_static -d /dev/sg2 --onlySeagate -i
==========================================================================================
 SeaChest_Configure - Seagate drive utilities - NVMe Enabled
 Copyright (c) 2014-2023 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 SeaChest_Configure Version: 2.3.1-4_1_1 X86_64
 Build Date: Mar 27 2023
 Today: Wed Dec 27 11:50:13 2023	User: root
==========================================================================================

/dev/sg2 - ST8000NM0075 - ZA10M24Q - E004 - SCSI
	Vendor ID: SEAGATE 
	Model Number: ST8000NM0075    
	Serial Number: ZA10M24Q
	PCBA Serial Number: 0000R622H78Z
	Firmware Revision: E004
	World Wide Name: 5000C50084D54EE7
	Date Of Manufacture: Week 51, 12598
	Copyright: Copyright (c) 2017 Seagate All rights reserved 
	Drive Capacity (TB/TiB): 8.00/7.28
	Temperature Data:
		Current Temperature (C): 24
		Highest Temperature (C): Not Reported
		Lowest Temperature (C): Not Reported
	Power On Time:  8 days 21 hours 3 minutes 
	Power On Hours: 213.05
	MaxLBA: 15628053167
	Native MaxLBA: Not Reported
	Logical Sector Size (B): 512
	Physical Sector Size (B): 4096
	Sector Alignment: 0
	Rotation Rate (RPM): 7200
	Form Factor: 3.5"
	Last DST information:
		DST has never been run
	Long Drive Self Test Time:  13 hours 7 minutes 
	Interface speed:
		Port 0 (Current Port)
			Max Speed (GB/s): 12.0
			Negotiated Speed (Gb/s): 6.0
		Port 1
			Max Speed (GB/s): 12.0
			Negotiated Speed (Gb/s): Not Reported
	Annualized Workload Rate (TB/yr): 340.00
	Total Bytes Read (TB): 4.63
	Total Bytes Written (TB): 3.64
	Encryption Support: Not Supported
	Cache Size (MiB): Not Reported
	Read Look-Ahead: Enabled
	Non-Volatile Cache: Enabled
	Write Cache: Enabled
	SMART Status: Good
	ATA Security Information: Not Supported
	Firmware Download Support: Full, Segmented, Deferred
	Number of Logical Units: 1
	Specifications Supported:
		SPC-4
		SAM-5
		SAS-3
		SPL-3
		SPC-4
		SBC-3
	Features Supported:
		Protection Type 1
		Protection Type 2
		Persistent Reservations
		Application Client Logging
		Self Test
		Automatic Write Reassignment [Enabled]
		Automatic Read Reassignment [Enabled]
		EPC [Enabled]
		Informational Exceptions [Mode 0]
		Translate Address
		Rebuild Assist
		Seagate In Drive Diagnostics (IDD)
		Format Unit
		Sanitize
	Adapter Information:
		Adapter Type: PCI
		Vendor ID: 1000h
		Product ID: 0097h
		Revision: 0002h

pxpunx · December 27, 2023, 6:13pm

Built an 8-disk zpool w/ identical 8TB drives on banks 0-1 (slots 1-1 thru 1-8) and currently filling it with garbage files of various sizes locally via head -c 100G </dev/random > /pool0/test/test01 and over Samba.

EDIT: Whup, that was fast … errors on the same drives/positions.

I’ll continue to pound on this pool, then switch the cables to my stand-alone 9300-8i to see if the problem persists.

pxpunx · December 27, 2023, 6:30pm

bank 0-1 (1-1 thru 1-8)

initial pool spin-up - errors on 1-3 and 1-7 when data written to pool

replaced both drives (1)
resilver 1-3, write errors; faulted
resilver 1-7, write errors; faulted

replaced both drives (2)
resilver 1-3, write errors; faulted
resilver 1-7, write errors; faulted
also read errors on 1-4, which is new

checksum errors on multiple drives, but suspect this is due to all the disk replacements and resilvers

will move to 9300-8i

pxpunx · December 27, 2023, 7:14pm

Well, this is frustrating. After I moved to the 9300-8i, I’ve not observed any errors. I saw errors the last time I moved to the 9300-8i, so I’m not sure what to think.

I will keep peppering the pool with data in the hopes it’ll fault some drives.

rymandle05 · December 27, 2023, 8:45pm

I’m going on my fourth round of FIO Benchmarks while also running head -c 100G </dev/random> /SEAGATE6Z2/random-test. No errors yet while running at SAS-2 6Gbps speeds. I’m probably jinxing myself here but better sooner than later I guess.

Every 2.0s: zpool status -v                       hl15: Wed Dec 27 14:40:33 2023

  pool: SEAGATE6Z2
 state: ONLINE
  scan: resilvered 2.75M in 00:00:01 with 0 errors on Wed Dec 27 10:45:12 2023
config:

        NAME        STATE     READ WRITE CKSUM
        SEAGATE6Z2  ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            1-9     ONLINE       0     0     0
            1-10    ONLINE       0     0     0
            1-11    ONLINE       0     0     0
            1-13    ONLINE       0     0     0
            1-14    ONLINE       0     0     0
            1-15    ONLINE       0     0     0

errors: No known data errors

@pxpunx feel free to give this a try if you’re able to. I know you had a myriad of drives, but I assume they all have something similar to SeaChestUtilities or work with SeaChest to set the physical link speed. If 6Gpps resolves the errors then I think that still points to cables being the issue.

DigitalGarden · December 27, 2023, 8:55pm

From what you’ve posted so far, I’d say there is something marginal about the mini sas hd cables. The 9300-8i is better able to handle whatever the flaw is than the Broadcom 3008 on the mobo. Would it be worth contacting Supermicro support? I’m not sure if Broadcom/LSI/Avago would have diagnostics to help identify a faulty controller.

.

pxpunx · December 27, 2023, 9:00pm

All my 8TB drives except one are HGST He8 SAS drives. Current connection speed with the 9300-8i is 12Gb/s and no errors so far after 1.5TB written to the pool.

pxpunx · December 27, 2023, 9:09pm

It’s possible. Both controllers are SAS3008 on similar firmware versions.

Onboard controller on v16.00.10.00, which used to be on 15.00.03.00 and had the same issues.

19:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
	DeviceName: LSI SAS 3008
	Subsystem: Super Micro Computer Inc AOC-S3008L-L8e
	Flags: bus master, fast devsel, latency 0, IRQ 40, NUMA node 0
	I/O ports at 7000 [size=256]
	Memory at c5e40000 (64-bit, non-prefetchable) [size=64K]
	Memory at c5e00000 (64-bit, non-prefetchable) [size=256K]
	Expansion ROM at c5d00000 [disabled] [size=1M]
	Capabilities: [50] Power Management version 3
	Capabilities: [68] Express Endpoint, MSI 00
	Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
	Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
	Capabilities: [100] Advanced Error Reporting n
	Capabilities: [1e0] Secondary PCI Express
	Capabilities: [1c0] Power Budgeting <?>
	Capabilities: [190] Dynamic Power Allocation <?>
	Capabilities: [148] Alternative Routing-ID Interpretation (ARI)
	Kernel driver in use: mpt3sas
	Kernel modules: mpt3sas

PCIe controller on v16.00.01.00:

b3:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
	Subsystem: Broadcom / LSI SAS9300-8i
	Flags: bus master, fast devsel, latency 0, IRQ 169, NUMA node 0
	I/O ports at f000 [size=256]
	Memory at fbe40000 (64-bit, non-prefetchable) [size=64K]
	Memory at fbe00000 (64-bit, non-prefetchable) [size=256K]
	Expansion ROM at fbd00000 [disabled] [size=1M]
	Capabilities: [50] Power Management version 3
	Capabilities: [68] Express Endpoint, MSI 00
	Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
	Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [1e0] Secondary PCI Express
	Capabilities: [1c0] Power Budgeting <?>
	Capabilities: [190] Dynamic Power Allocation <?>
	Capabilities: [148] Alternative Routing-ID Interpretation (ARI)
	Kernel driver in use: mpt3sas
	Kernel modules: mpt3sas

I could flash the onboard controller down to v16.00.01.00, but a jump from 15.00.03.00 to 16.00.10.00 w/ the same issues is suspect. The firmware difference between the two controllers doesn’t seem like a solid lead at the moment.

While the controller is the same, there’s a lot that isn’t. The onboard controller is effectively an AOC-S3008L-L8i built into the motherboard, while the 9300-8i is an LSI card. I assume the overall design is different between them.

I suppose it’s possible the LSI card has better onboard error correction than the Supermicro equivalent. The cables could be a factor.

I have not contacted Supermicro or 45Drives support. Considering this was purchased as a full build, I would start with 45Drives since I would hope they have a better relationship with Supermicro than I do.

rymandle05 · December 27, 2023, 9:25pm

I had reached out to 45 Drives before the holidays and was put in contact with Corey. He suggested trying the new drives. I haven’t heard back since that despite reporting back several findings. I’m guessing staff is light with the holidays, but it would be nice to get more support here. I also noticed the part number on the SAS cables appears to be a 45Drives part number.

pxpunx · December 27, 2023, 9:42pm

I wouldn’t expect much until after the holidays are over. I’m sure a lot of people are on vacation.

pxpunx · December 27, 2023, 11:09pm

Well … it certainly seems to be connected to the onboard SAS controller.

Pool pool0 and associated test file systems were destroyed and rebuilt for each test.

test	bank	controller	pool	drives	cables	read err	write err	status	notes
1	0-1	AOC-S3008L-L8e	RaidZ2	8x 8TB HGST SAS He8	45D SFF-8643 - SFF-8643		1-3, 1-7	1-3, 1-7 FAULTED
1a	0-1	AOC-S3008L-L8e	RaidZ2	8x 8TB HGST SAS He8	45D SFF-8643 - SFF-8643		1-3, 1-7	1-3, 1-7 FAULTED	Replaced 1-3, 1-4. (1)
1b	0-1	AOC-S3008L-L8e	RaidZ2	8x 8TB HGST SAS He8	45D SFF-8643 - SFF-8643	1-4	1-3, 1-7	1-3, 1-7 FAULTED, 1-4 ONLINE	Replaced 1-3, 1-4. (2) Assume read errors on 1-4 are a random event.
2	0-1	LSI 9300-8i	RaidZ2	8x 8TB HGST SAS He8	45D SFF-8643 - SFF-8643		1-1	1-1 ONLINE	Assume write errors on 1-1 are a random event.
3	0-1	AOC-S3008L-L8e	RaidZ2	8x 8TB HGST SAS He8	45D SFF-8643 - SFF-8643		1-3, 1-7	1-3, 1-7 FAULTED

Table looks terrible … but basically, OK on 9300-8i and faults on SAS3008. Same 45D cables.

I still have additional cables due tomorrow for more tests, including some SFF-8643 to SFF-8482 cables to bypass the backplane, though I don’t suspect it at this point.

I will flash the SAS3008 to 16.00.01.00 because … why not?

DigitalGarden · December 27, 2023, 11:21pm

That was part of my point.

It doesn’t seem like 45Drives has seen this issue before. And, it looks like they use an X11SPL board without onboard HBA on all their other current storage pods, so this may be the first product/experience they’ve had with this board (?).

The only thing similar I’ve been able to turn up (and a reason I suggested reaching out to the manufacturer) directly or indirectly is;

https://www.truenas.com/community/threads/lsi-sas3008-hba-issues-with-one-drive-port-scbusx-target-2.101168/

Unfortunately, no resolution is provided.

pxpunx · December 27, 2023, 11:26pm

I do plan to reach out to support, I just want run through as many scenarios and collect as much data as possible.

pxpunx · December 28, 2023, 12:20am

Latest test with the onboard SAS3008 on 16.00.01.00 saw the same write errors. I also managed to get two other drives to enter a degraded state. The normal culprits are faulted.

I sent an email to 45Drives and opened a case w/ Supermicro.

DigitalGarden · December 28, 2023, 12:30am

Just curious; is there a revision number and/or manufacture date on the motherboard?