that line in the dmesg comes up before the drive has issues. that means that the HBA driver is having an issue first. I’d be curious if you could update the controller firmware
Hey @Hutch-45Drives that was one of the early steps I took. I’m running the latest I could find from Supermicro - SAS controller firmware 16.00.10.00 . If you know of and have access to a newer version, I’d be happy to give that a try
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.
Adapter Selected is a Avago SAS: SAS3008(C0)
Controller Number : 0
Controller : SAS3008(C0)
PCI Address : 00:19:00:00
SAS Address : 5003048-#-#x##-x##x
NVDATA Version (Default) : 0e.01.30.28
NVDATA Version (Persistent) : 0e.01.30.28
Firmware Product ID : 0x2221 (IT)
Firmware Version : 16.00.10.00
NVDATA Vendor : LSI
NVDATA Product ID : LSI3008-IT
BIOS Version : 08.37.00.00
UEFI BSD Version : 18.00.00.00
FCODE Version : N/A
Board Name : LSI3008-IT
Board Assembly : N/A
Board Tracer Number : N/A
Finished Processing Commands Successfully.
Exiting SAS3Flash.
Oh and I also tried an “unreleased” version (16.00.12.00) that was made available on TrueNAS forums for a bug specific to sata drives. Still no luck so I reverted back to 16.00.10.00.
Same, I’m on the latest release from Supermicro. It came with 15.x and I flashed to 16.0.10.0 when I switched to IT mode.
Swapped to bank 0 and 1 on the backplane, same issues. Resilver caused read errors on 1-3 and write/checksum errors 1-7, which correspond to 1-11 and 1-15 (3rd slot) on banks 2 and 3.
I’ll run some more tests.
Since I think @pxpunx is trying the cable move to enable SAS on slots 1-1 to 1-9, I’m doing a test with running the drives at SAS-2 6Gps speeds with the cables still routed away from power and backplane. I was able to use SeaChestUtiliites to accomplish as I couldn’t find an option in the BIOS.
sudo ./SeaChest_Configure_x86_64-alpine-linux-musl_static -d all --phySpeed 3 --onlySeagate
Now all the drives are running a Negotiated speed of 6Gps as verified using the -i flag.
hl15:~$ sudo ./SeaChestUtilities/Linux/Non-RAID/x86_64/SeaChest_Configure_x86_64-alpine-linux-musl_static -d /dev/sg2 --onlySeagate -i
==========================================================================================
SeaChest_Configure - Seagate drive utilities - NVMe Enabled
Copyright (c) 2014-2023 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
SeaChest_Configure Version: 2.3.1-4_1_1 X86_64
Build Date: Mar 27 2023
Today: Wed Dec 27 11:50:13 2023 User: root
==========================================================================================
/dev/sg2 - ST8000NM0075 - ZA10M24Q - E004 - SCSI
Vendor ID: SEAGATE
Model Number: ST8000NM0075
Serial Number: ZA10M24Q
PCBA Serial Number: 0000R622H78Z
Firmware Revision: E004
World Wide Name: 5000C50084D54EE7
Date Of Manufacture: Week 51, 12598
Copyright: Copyright (c) 2017 Seagate All rights reserved
Drive Capacity (TB/TiB): 8.00/7.28
Temperature Data:
Current Temperature (C): 24
Highest Temperature (C): Not Reported
Lowest Temperature (C): Not Reported
Power On Time: 8 days 21 hours 3 minutes
Power On Hours: 213.05
MaxLBA: 15628053167
Native MaxLBA: Not Reported
Logical Sector Size (B): 512
Physical Sector Size (B): 4096
Sector Alignment: 0
Rotation Rate (RPM): 7200
Form Factor: 3.5"
Last DST information:
DST has never been run
Long Drive Self Test Time: 13 hours 7 minutes
Interface speed:
Port 0 (Current Port)
Max Speed (GB/s): 12.0
Negotiated Speed (Gb/s): 6.0
Port 1
Max Speed (GB/s): 12.0
Negotiated Speed (Gb/s): Not Reported
Annualized Workload Rate (TB/yr): 340.00
Total Bytes Read (TB): 4.63
Total Bytes Written (TB): 3.64
Encryption Support: Not Supported
Cache Size (MiB): Not Reported
Read Look-Ahead: Enabled
Non-Volatile Cache: Enabled
Write Cache: Enabled
SMART Status: Good
ATA Security Information: Not Supported
Firmware Download Support: Full, Segmented, Deferred
Number of Logical Units: 1
Specifications Supported:
SPC-4
SAM-5
SAS-3
SPL-3
SPC-4
SBC-3
Features Supported:
Protection Type 1
Protection Type 2
Persistent Reservations
Application Client Logging
Self Test
Automatic Write Reassignment [Enabled]
Automatic Read Reassignment [Enabled]
EPC [Enabled]
Informational Exceptions [Mode 0]
Translate Address
Rebuild Assist
Seagate In Drive Diagnostics (IDD)
Format Unit
Sanitize
Adapter Information:
Adapter Type: PCI
Vendor ID: 1000h
Product ID: 0097h
Revision: 0002h
Built an 8-disk zpool w/ identical 8TB drives on banks 0-1 (slots 1-1 thru 1-8) and currently filling it with garbage files of various sizes locally via head -c 100G </dev/random > /pool0/test/test01
and over Samba.
EDIT: Whup, that was fast … errors on the same drives/positions.
I’ll continue to pound on this pool, then switch the cables to my stand-alone 9300-8i to see if the problem persists.
bank 0-1 (1-1 thru 1-8)
initial pool spin-up - errors on 1-3 and 1-7 when data written to pool
replaced both drives (1)
resilver 1-3, write errors; faulted
resilver 1-7, write errors; faulted
replaced both drives (2)
resilver 1-3, write errors; faulted
resilver 1-7, write errors; faulted
also read errors on 1-4, which is new
checksum errors on multiple drives, but suspect this is due to all the disk replacements and resilvers
will move to 9300-8i
Well, this is frustrating. After I moved to the 9300-8i, I’ve not observed any errors. I saw errors the last time I moved to the 9300-8i, so I’m not sure what to think.
I will keep peppering the pool with data in the hopes it’ll fault some drives.
I’m going on my fourth round of FIO Benchmarks while also running head -c 100G </dev/random> /SEAGATE6Z2/random-test
. No errors yet while running at SAS-2 6Gbps speeds. I’m probably jinxing myself here but better sooner than later I guess.
Every 2.0s: zpool status -v hl15: Wed Dec 27 14:40:33 2023
pool: SEAGATE6Z2
state: ONLINE
scan: resilvered 2.75M in 00:00:01 with 0 errors on Wed Dec 27 10:45:12 2023
config:
NAME STATE READ WRITE CKSUM
SEAGATE6Z2 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
1-9 ONLINE 0 0 0
1-10 ONLINE 0 0 0
1-11 ONLINE 0 0 0
1-13 ONLINE 0 0 0
1-14 ONLINE 0 0 0
1-15 ONLINE 0 0 0
errors: No known data errors
@pxpunx feel free to give this a try if you’re able to. I know you had a myriad of drives, but I assume they all have something similar to SeaChestUtilities or work with SeaChest to set the physical link speed. If 6Gpps resolves the errors then I think that still points to cables being the issue.
From what you’ve posted so far, I’d say there is something marginal about the mini sas hd cables. The 9300-8i is better able to handle whatever the flaw is than the Broadcom 3008 on the mobo. Would it be worth contacting Supermicro support? I’m not sure if Broadcom/LSI/Avago would have diagnostics to help identify a faulty controller.
.
.
All my 8TB drives except one are HGST He8 SAS drives. Current connection speed with the 9300-8i is 12Gb/s and no errors so far after 1.5TB written to the pool.
It’s possible. Both controllers are SAS3008 on similar firmware versions.
Onboard controller on v16.00.10.00, which used to be on 15.00.03.00 and had the same issues.
19:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
DeviceName: LSI SAS 3008
Subsystem: Super Micro Computer Inc AOC-S3008L-L8e
Flags: bus master, fast devsel, latency 0, IRQ 40, NUMA node 0
I/O ports at 7000 [size=256]
Memory at c5e40000 (64-bit, non-prefetchable) [size=64K]
Memory at c5e00000 (64-bit, non-prefetchable) [size=256K]
Expansion ROM at c5d00000 [disabled] [size=1M]
Capabilities: [50] Power Management version 3
Capabilities: [68] Express Endpoint, MSI 00
Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
Capabilities: [100] Advanced Error Reporting n
Capabilities: [1e0] Secondary PCI Express
Capabilities: [1c0] Power Budgeting <?>
Capabilities: [190] Dynamic Power Allocation <?>
Capabilities: [148] Alternative Routing-ID Interpretation (ARI)
Kernel driver in use: mpt3sas
Kernel modules: mpt3sas
PCIe controller on v16.00.01.00:
b3:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
Subsystem: Broadcom / LSI SAS9300-8i
Flags: bus master, fast devsel, latency 0, IRQ 169, NUMA node 0
I/O ports at f000 [size=256]
Memory at fbe40000 (64-bit, non-prefetchable) [size=64K]
Memory at fbe00000 (64-bit, non-prefetchable) [size=256K]
Expansion ROM at fbd00000 [disabled] [size=1M]
Capabilities: [50] Power Management version 3
Capabilities: [68] Express Endpoint, MSI 00
Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [1e0] Secondary PCI Express
Capabilities: [1c0] Power Budgeting <?>
Capabilities: [190] Dynamic Power Allocation <?>
Capabilities: [148] Alternative Routing-ID Interpretation (ARI)
Kernel driver in use: mpt3sas
Kernel modules: mpt3sas
I could flash the onboard controller down to v16.00.01.00, but a jump from 15.00.03.00 to 16.00.10.00 w/ the same issues is suspect. The firmware difference between the two controllers doesn’t seem like a solid lead at the moment.
While the controller is the same, there’s a lot that isn’t. The onboard controller is effectively an AOC-S3008L-L8i built into the motherboard, while the 9300-8i is an LSI card. I assume the overall design is different between them.
I suppose it’s possible the LSI card has better onboard error correction than the Supermicro equivalent. The cables could be a factor.
I have not contacted Supermicro or 45Drives support. Considering this was purchased as a full build, I would start with 45Drives since I would hope they have a better relationship with Supermicro than I do.
I had reached out to 45 Drives before the holidays and was put in contact with Corey. He suggested trying the new drives. I haven’t heard back since that despite reporting back several findings. I’m guessing staff is light with the holidays, but it would be nice to get more support here. I also noticed the part number on the SAS cables appears to be a 45Drives part number.
I wouldn’t expect much until after the holidays are over. I’m sure a lot of people are on vacation.
Well … it certainly seems to be connected to the onboard SAS controller.
Pool pool0
and associated test file systems were destroyed and rebuilt for each test.
test | bank | controller | pool | drives | cables | read err | write err | checksum | status | notes |
---|---|---|---|---|---|---|---|---|---|---|
1 | 0-1 | AOC-S3008L-L8e | RaidZ2 | 8x 8TB HGST SAS He8 | 45D SFF-8643 - SFF-8643 | 1-3, 1-7 | 1-3, 1-7 FAULTED | |||
1a | 0-1 | AOC-S3008L-L8e | RaidZ2 | 8x 8TB HGST SAS He8 | 45D SFF-8643 - SFF-8643 | 1-3, 1-7 | 1-3, 1-7 FAULTED | Replaced 1-3, 1-4. (1) | ||
1b | 0-1 | AOC-S3008L-L8e | RaidZ2 | 8x 8TB HGST SAS He8 | 45D SFF-8643 - SFF-8643 | 1-4 | 1-3, 1-7 | 1-3, 1-7 FAULTED, 1-4 ONLINE | Replaced 1-3, 1-4. (2) Assume read errors on 1-4 are a random event. | |
2 | 0-1 | LSI 9300-8i | RaidZ2 | 8x 8TB HGST SAS He8 | 45D SFF-8643 - SFF-8643 | 1-1 | 1-1 ONLINE | Assume write errors on 1-1 are a random event. | ||
3 | 0-1 | AOC-S3008L-L8e | RaidZ2 | 8x 8TB HGST SAS He8 | 45D SFF-8643 - SFF-8643 | 1-3, 1-7 | 1-3, 1-7 FAULTED |
Table looks terrible … but basically, OK on 9300-8i and faults on SAS3008. Same 45D cables.
I still have additional cables due tomorrow for more tests, including some SFF-8643 to SFF-8482 cables to bypass the backplane, though I don’t suspect it at this point.
I will flash the SAS3008 to 16.00.01.00 because … why not?
That was part of my point.
It doesn’t seem like 45Drives has seen this issue before. And, it looks like they use an X11SPL board without onboard HBA on all their other current storage pods, so this may be the first product/experience they’ve had with this board (?).
The only thing similar I’ve been able to turn up (and a reason I suggested reaching out to the manufacturer) directly or indirectly is;
Unfortunately, no resolution is provided.
I do plan to reach out to support, I just want run through as many scenarios and collect as much data as possible.
Latest test with the onboard SAS3008 on 16.00.01.00 saw the same write errors. I also managed to get two other drives to enter a degraded state. The normal culprits are faulted.
I sent an email to 45Drives and opened a case w/ Supermicro.
Just curious; is there a revision number and/or manufacture date on the motherboard?