RESOLVED:ZFS Write Errors with HL15 Full Build and SAS Drives

Hello Everyone!

I just received my HL15 full build last week and I’ve been doing some bench marking and testing to figure out the best setup for me. In doing so, I found that my raidz2 pool of seven (used) HP Branded Seagate Exos 8TB SAS hard drives becomes degraded due to ZFS write errors. I’m using FIO via the Benchmarks tool in Cockpit/Houston to reproduce the problem.

Here’s what I’ve tried so far to narrow down what might be the problem:

  1. Ran a Long SMART tests. SMART came back good on all them.
  2. Re-seated the SAS cable connector on the motherboard
  3. Checked for newer firmware for the drives (none available)
  4. Disabled writeback cache on the SAS drives
  5. Moved hard drives to different slots. Error usually happen on Slot 1-11 but also on 1-13 and 1-9.
  6. Updated to BIOS 4.0

Unfortunately, none of that has seemed to help. I think my next steps will be to re-seat power cables on the PSU, re-seat cables on the backplane, and/or update the SAS controller firmware to 16.00.10.00.

This is my first time engaging with the community here so any suggestions are welcome! Thanks in advance any and all help others are willing to provide.

UPDATE 1: Issue appears to be related to the SAS link speed running at SAS-3 12Gbps. Work around is to reduce the link speed by setting the controller (or drives) to a max speed of SAS-2 6Gbps. Link here for guide on doing this with lsiutil.

UPDATE 2: Replacing SAS cables now allows SAS-3 12Gbps speeds without errors using the onboard SAS3008 controller.

Benchmark Settings:
Tool: FIO
Benchmark Type: Performance Spectrum
File Size 10GB
IO Depth: 16
Runtime: 120

sudo zpool status
pool: TESTZ2
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use ‘zpool clear’ to mark the device
repaired.
config:

NAME        STATE     READ WRITE CKSUM
TESTZ2      DEGRADED     0     0     0
  raidz2-0  DEGRADED     0     0     0
    1-9     ONLINE       0     0     0
    1-10    ONLINE       0     0     0
    1-11    FAULTED      0    21     0  too many errors
    1-12    ONLINE       0     0     0
    1-13    FAULTED      0    17     0  too many errors
    1-14    ONLINE       0     0     0
    1-15    ONLINE       0     0     0

errors: No known data errors

Syslog I/O Errors
blk_update_request: I/O error, dev sdc, sector 427221816 op 0x1:(WRITE) flags 0x700 phys_seg 21 prio class 0
blk_update_request: I/O error, dev sdc, sector 63964080 op 0x1:(WRITE) flags 0x700 phys_seg 100 prio class 0
blk_update_request: I/O error, dev sde, sector 495373840 op 0x1:(WRITE) flags 0x700 phys_seg 74 prio class 0

sudo smartctl -x /dev/sdc
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-4.18.0-513.9.1.el8_9.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: HP
Product: MB8000JFECQ
Revision: HPD7
Compliance: SPC-4
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Logical block size: 512 bytes
Physical block size: 4096 bytes
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c50085bd9beb
Serial number: ZA13MJHZ
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sat Dec 9 14:39:21 2023 AST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature: 26 C
Drive Trip Temperature: 60 C

Manufactured in week 31 of year 2016
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 94
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 1879
Elements in grown defect list: 0

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 19758.685 0
write: 0 0 0 0 0 40455.634 0

Non-medium error count: 178

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
1 Background long Completed - 41675 - [- - -]

Long (extended) Self-test duration: 45300 seconds [755.0 minutes]

Background scan results log
Status: scan is active
Accumulated power on time, hours:minutes 41750:50 [2505050 minutes]
Number of background scans performed: 581, scan progress: 98.55%
Number of background medium scans performed: 581

Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 14
number of phys = 1
phy identifier = 0
attached device type: SAS or SATA device
attached reason: unknown
reason: unknown
negotiated logical link rate: phy enabled; 12 Gbps
attached initiator port: ssp=1 stp=1 smp=1
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000c50085bd9be9
attached SAS address = 0x500304801d01d40e
attached phy identifier = 2
Invalid DWORD count = 75
Running disparity error count = 74
Loss of DWORD synchronization = 166
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 75
Running disparity error count: 74
Loss of dword synchronization count: 166
Phy reset problem count: 0
relative target port id = 2
generation code = 14
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: unknown
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x5000c50085bd9bea
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0
Phy event descriptors:
Invalid word count: 0
Running disparity error count: 0
Loss of dword synchronization count: 0
Phy reset problem count: 0

After updating the SAS firmware from 15 to 16, I was having better luck with only a single error here and there - until just now. Pool went DEGRADED immediately following a scrub. Slot 1-11 is still the primary offender with write and checksum errors.

Every 2.0s: sudo zpool status                               hl15: Sun Dec 10 13:23:19 2023

  pool: TESTZ2
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 120K in 00:09:56 with 0 errors on Sun Dec 10 13:22:42 2023
config:

       	NAME                                         STATE     READ WRITE CKSUM
        TESTZ2                                       DEGRADED	  0     0     0
          raidz2-0                                   DEGRADED	  0     0     0
            1-9                                      ONLINE	  0     0     0
            1-10                                     ONLINE	  0     0     0
            1-11                                     FAULTED	  0    49    33  too many
errors
            1-12                                     ONLINE	  0     0     0
            1-13                                     ONLINE	  0     0     0
            1-14                                     ONLINE	  0     0     0
            1-15                                     ONLINE	  0     0     0
        cache
          nvme-eui.e8238fa6bf530001001b448b4a22e3f1  ONLINE	  0     0     0

errors: No known data errors

Unfortunately, a side effect of ZFS’s data integrity, is that it will find issues with drives that were previously deemed “healthy”. SMART tests and the like only can find so much, and are more of an overall drive health test, if the drive knows there is an issue it will report.

Are these HDD’s new or refurb? Many like to use something like BadBlocks to run an integrity test on brand new drives to weed out the early failures found in the bathtub curve.

Drives I bought on Amazon were listed as “Condition: Used - Like New”. I can return them through Jan 31 so that’s an option if it’s bad drive(s). I ran across another post that mentioned BadBlocks. I’ll check that out too.

What gives me pause is how the errors tend to favor a slot instead of following the drive. I did another test where I swapped position of two of the drives in the pool and the errors still occurred on 1-11 instead of showing up on 1-12. I fully acknowledge multiple or even all the drives drives might be bad, but it might also be a cable or some other defect. Hopefully BadBlocks can help vet out this out.

Right now I’m running a 6 disk pool without the drive in 1-11 to see what happens with the FIO test.

2 Likes

HI @rymandle05, I would recommend checking all the drive’s SMART data and making sure there are no uncorrectable or bad sectors on the drives.

If there are no issues on the drives, could you try to space them out in the server and not have them next to each other? what is the amp rating on these drives for the 12v and 5v connections?

2 Likes

Hey @Hutch-45Drives! I went through the SMART data again for all the drives. Every drive reports 0 “Total Uncorrected Errors”. I don’t see any output specifically calling out bad sectors, but I do see all of them report hundreds of “non-medium error count”. I’m using the -x flag with smartctl to get output like the example from above. I’ll do some checking to see if there’s any special flags or parameters needed to get more info out of SMART from these drives. I had to do that with the Sun F80’s that craft computing featured to get accurate SMART results.

As for amp ratings, the label on the drives say “+5V / 0.85 A” and “+12V / 0.99 A”. The HPE Model is MB8000JFECQ.

I’ll also try spacing out the drives acrross the 7 SAS enabled slots after the BadBlocks finishes on /dev/sdc. I’ll need to drop down to a 4 disk pool and probably RAIDZ1 for that test.

On the third run on the FIO benchmark, I see 1-13 FAULTED with write errrors in my test without 1-11.

  pool: TEST6Z2
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
	repaired.
  scan: scrub repaired 0B in 00:05:33 with 0 errors on Sun Dec 10 22:29:20 2023
config:

	NAME                                         STATE     READ WRITE CKSUM
	TEST6Z2                                      DEGRADED     0     0     0
	  raidz2-0                                   DEGRADED     0     0     0
	    1-9                                      ONLINE       0     0     0
	    1-10                                     ONLINE       0     0     0
	    1-12                                     ONLINE       0     0     0
	    1-13                                     FAULTED      0    33     0  too many errors
	    1-14                                     ONLINE       0     0     0
	    1-15                                     ONLINE       0     0     0
	cache
	  nvme-eui.e8238fa6bf530001001b448b4a22e3f1  ONLINE       0     0     0

errors: No known data errors

If you continue to have issues and think it’s hardware related please reach out to info@45homelab.com to get a support member to reach out and assist you.

1 Like

I might take you up on that. I was able to complete the spaced test with four drives in the odd slot numbers. One drive ended up faulted and the other degraded on the first benchmark. Out of curiosity, I setup a RAIDZ1 pool with three drives in the even numbered slots. I’m currently on the second run of the benchmark test in that setup with no errors. I need to put that through more paces though.

Every 2.0s: sudo zpool status                         hl15: Mon Dec 11 13:07:47 2023

  pool: TEST4Z1
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
config:

       	NAME        STATE     READ WRITE CKSUM
        TEST4Z1     DEGRADED	 0     0     0
          raidz1-0  DEGRADED	 0     0     0
            1-9     ONLINE	 0     0     0
            1-11    FAULTED	 0    69     0  too many errors
            1-13    ONLINE	 0     0     0
            1-15    ONLINE	 0     0     0

errors: No known data errors
Every 2.0s: sudo zpool status                         hl15: Mon Dec 11 13:33:34 2023

  pool: TEST4Z1
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
config:

       	NAME        STATE     READ WRITE CKSUM
        TEST4Z1     DEGRADED	 0     0     0
          raidz1-0  DEGRADED	 0    24     0
            1-9     ONLINE	 0     0     0
            1-11    FAULTED	 0    69     0  too many errors
            1-13    DEGRADED	 0    25     0  too many errors
            1-15    ONLINE	 0     0     0

errors: No known data errors
2 Likes

Maybe there is something to the odd vs even drive slots. I have completed three benchmark tests without errors with a zpool using only even slots and the SAS drives. I even made sure to replace one of the drives with drive I know received errors in an earlier test. I’m currently running the fourth benchmark to see if errors occur.

Every 2.0s: sudo zpool status -v                      hl15: Mon Dec 11 20:26:20 2023

  pool: TEST3Z1
 state: ONLINE
  scan: resilvered 141G in 00:16:13 with 0 errors on Mon Dec 11 17:29:40 2023
config:

       	NAME        STATE     READ WRITE CKSUM
        TEST3Z1     ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            1-10    ONLINE       0     0     0
            1-12    ONLINE       0     0     0
            1-14    ONLINE       0     0     0

errors: No known data errors

That is very odd. There should not be anything different from the even to the odd slots. How the backplane works is that every 4 drives are paired together into 1 HBA cable and 1 Molex connector for power so having even vs odd doesn’t change anything.

I thought that would be the case, but it’s hard to argue with results. I am running the benchmark test for the 6th time this morning and the pool has yet to experience any running with any issues. I ran a scrub a little bit ago just to see and it also came back with 0B repaired. I’m going to keep running tests throughout the morning though.

Every 2.0s: sudo zpool status -v                      hl15: Tue Dec 12 07:45:07 2023

  pool: TEST3Z1
 state: ONLINE
  scan: scrub repaired 0B in 00:18:55 with 0 errors on Tue Dec 12 07:26:09 2023
config:

       	NAME        STATE     READ WRITE CKSUM
        TEST3Z1     ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            1-10    ONLINE       0     0     0
            1-12    ONLINE       0     0     0
            1-14    ONLINE       0     0     0

errors: No known data errors

BadBlocks on 1-11 drive continues to run the non-destructive read-write test. it’s still at 31.26% done almost 24 hours later but no errors found yet. :exploding_head:

Documenting that the even slot 3 drive RAIDZ1 pool is still error free after 10 benchmark FIO tests over the past 24 hours.

Every 2.0s: sudo zpool status -v                                     hl15: Tue Dec 12 17:38:58 2023

  pool: TEST3Z1
 state: ONLINE
  scan: scrub repaired 0B in 00:18:55 with 0 errors on Tue Dec 12 07:26:09 2023
config:

       	NAME                                         STATE     READ WRITE CKSUM
        TEST3Z1                                      ONLINE	  0     0     0
          raidz1-0                                   ONLINE	  0     0     0
            1-10                                     ONLINE	  0     0     0
            1-12                                     ONLINE	  0     0     0
            1-14                                     ONLINE	  0     0     0
        cache
          nvme-eui.e8238fa6bf530001001b448b4a22e3f1  ONLINE	  0     0     0

errors: No known data errors

This is very odd. If you move the drives to the odd slots do they start getting issues?

You do not need to delete the pool simply move the drives to the other slots.

If you get the errors could you post your “dmesg -T” output so we can see if there are hardware issues?

Sure - I can give that a try when I get home from work this evening!

Yesterday, I did take a drive experiencing errors in an odd slot, moved it to an even slot, and then replaced the drive previously in that slot for the zpool. That drive hasn’t experienced any write or checksum errors either as part of the even slot pool.

Actually, one question - should I move all three drives from even slots to odd slots or move only 1 or 2 of the drives to odd slots?

Can you move all 3 drives? we already proved the drives are not the issue with the 10 tests you did. So you move them and we start getting issues we know there is something with the server

2 Likes

Alright - I moved the drives from the Z1 pool from even slots to odd slots. I just started the first benchmark so let’s see what happens.

Every 2.0s: sudo zpool status -v                                     hl15: Wed Dec 13 19:34:59 2023

  pool: TEST3Z1
 state: ONLINE
  scan: scrub repaired 0B in 00:11:00 with 0 errors on Wed Dec 13 19:34:55 2023
config:

       	NAME                                         STATE     READ WRITE CKSUM
        TEST3Z1                                      ONLINE	  0     0     0
          raidz1-0                                   ONLINE	  0     0     0
            1-9                                      ONLINE	  0     0     0
            1-13                                     ONLINE	  0     0     0
            1-15                                     ONLINE	  0     0     0
        cache
          nvme-eui.e8238fa6bf530001001b448b4a22e3f1  ONLINE	  0     0     0

errors: No known data errors

Here’s my morning update! I was able to run the benchmark 3 times yesterday evening but the pool is still healthy in the even slots. I did one other run after doing the SAS firmware update where I was able to get this far before seeing the errors. I have another in office day at work so I won’t be able to put it through the paces I like was able earlier in the week, but I’ll keep plugging away at it.

Every 2.0s: sudo zpool status -v                          hl15: Thu Dec 14 07:21:02 2023

  pool: TEST3Z1
 state: ONLINE
  scan: scrub repaired 0B in 00:11:00 with 0 errors on Wed Dec 13 19:34:55 2023
config:

       	NAME                                         STATE     READ WRITE CKSUM
        TEST3Z1                                      ONLINE	  0     0     0
          raidz1-0                                   ONLINE	  0     0     0
            1-9                                      ONLINE	  0     0     0
            1-13                                     ONLINE	  0     0     0
            1-15                                     ONLINE	  0     0     0
        cache
          nvme-eui.e8238fa6bf530001001b448b4a22e3f1  ONLINE	  0     0     0

errors: No known data errors
1 Like