Disk Errors on both Motherboard SATA connection and on HBA connections

I have an interesting situation.

I am in the process of moving from multiple separate Synology NAS to a new 45HomeLab HL15 (1.0) running TrueNAS scale 24.05.01.

ISSUE:
I started moving over 100TB of data over to the system off my Synology Units. I used NFSv4 mounts inside the Synology units to mount the NFS shares from the HL15.

The data copied over without issue, no errors etc. The data was copying over at only ~111 MB/s due to the 1GB Ethernet connections on my Synology Units.

I then started to perform CRC checks of the files using windows SMB, but using two computers concurrently again only over 1GB Ethernet as my desktop and laptop only have 1GB Ethernet ports allowing to get 2.0 Gb per second on the CRC checks.

I began getting frequent disk errors on both the drives connected to the motherboard’s SATA controller and the drives connected to the SAS controller. It can be seen from the reporting page on TrueNAS that every time a drive timed out during SMB transfers, the data transfer stopped for that volume.

The drive issues MOSTLY occur when using SMB transfers. I did get one drive errors when using NFS when mapping the same data set and performing CRCs on the same files. I even stopped using SMB shares on the windows machine and mounted my TrueNAS NFS share on windows 11 and when using the NFS share, i again get one error. The drive errors only occur when using SMB.

I am not sure why i am getting these errors. i am leaning toward bad cables or bad back plane.

this weekend, i plan to move one of my pools (with 5x drives that seemed to get the most frequent errors) to my dell SAS expander as the SSDs attached to that do not appear to have had any issues yet. this will allow me to bypass the HL15’s factory cables and the backplane. If the errors go away, then i think that would show it is either a cable problem or a back plane issue. if the errors continue even after moving to my SAS expander, then i think that would show more of a drive issue, or a power supply issue (the SAS expander has a separate Corsair XM1000 power supply).

System:
HL15 (1.0) fully burnt-in
SuperMicro X11SPH-nCTPF
Corsair XM1000
Xeon Silver
128GB ECC RAM

Added cards:
Nvidia RTX A400
LSI-9400-8e tied to DELL N4C2D JBOD 24 internal 12 external lane SAS2 6Gbps expander board
3x Micron 1.92TB SSDs
LSI-9400-8i replacing the motherboard integrated Intel 3008 controller. The 3008 controller is physically disabled by motherboard jumper pin
4 port 1GB Ethernet Intel i350 Chip-set

#########################################

HDD details
slots 10 through 15 [running through LSI 9400-8i]: WD Gold 18TB drives (WDC WD181KRYZ-01 Firmware: 1H01)
slots 1 through 5 [running through motherboard SATA controller]: WD Gold 18TB drives (WDC WD181KRYZ-01 Firmware: 1H01)
slot 6 [running through motherboard SATA controller]: WD purple 8TB drive (WDC WD82PURZ-85T Firmware: 0A82)

#########################################

root@truenas[~]# lsscsi
[0:0:0:0]    disk    ATA      WDC WD181KRYZ-01 1H01  /dev/sda	/volume2
[0:0:1:0]    disk    ATA      WDC WD82PURZ-85T 0A82  /dev/sdg	/volume5
[0:0:2:0]    disk    ATA      WDC WD181KRYZ-01 1H01  /dev/sdi	/volume2
[0:0:3:0]    disk    ATA      WDC WD181KRYZ-01 1H01  /dev/sdc	/volume2
[0:0:4:0]    disk    ATA      WDC WD181KRYZ-01 1H01  /dev/sdd	/volume2
[0:0:5:0]    disk    ATA      WDC WD181KRYZ-01 1H01  /dev/sdf	/volume2
[0:0:6:0]    disk    ATA      WDC WD181KRYZ-01 1H01  /dev/sde	/volume2
[0:0:7:0]    enclosu BROADCOM VirtualSES       03    -
[3:0:0:0]    enclosu AHCI     SGPIO Enclosure  2.00  -
[4:0:0:0]    disk    ATA      WDC WD181KRYZ-01 1H01  /dev/sdb	/volume3
[5:0:0:0]    disk    ATA      WDC WD181KRYZ-01 1H01  /dev/sdk	/volume3
**[6:0:0:0]    disk    ATA      WDC WD181KRYZ-01 1H01  /dev/sdl	/volume3
[7:0:0:0]    disk    ATA      WDC WD181KRYZ-01 1H01  /dev/sdh	/volume3
**[8:0:0:0]    disk    ATA      WDC WD181KRYZ-01 1H01  /dev/sdj	/volume3
[12:0:0:0]   enclosu AHCI     SGPIO Enclosure  2.00  -
[13:0:0:0]   disk    ATA      Micron_5400_MTFD U002  /dev/sdm	/volume1
[13:0:1:0]   disk    ATA      Micron_5400_MTFD U002  /dev/sdn	/volume1
[13:0:2:0]   disk    ATA      Micron_5400_MTFD U002  /dev/sdo	/volume1
[13:0:3:0]   enclosu Dell     SAS EXP V0110    0500  -
[14:0:0:0]   cd/dvd  JetKVM   Virtual Media          /dev/sr0
[N:0:1:1]    disk    KINGSTON SNV3S1000G__1                     /dev/nvme0n1

#########################################

root@truenas[~]# find -L /sys/bus/pci/devices/*/ata*/host*/target* -maxdepth 3 -name "sd*" 2>/dev/null | egrep block |egrep --colour '(ata[0-9]*)|(sd.*)'
/sys/bus/pci/devices/0000:00:17.0/ata3/host4/target4:0:0/4:0:0:0/block/sdb
/sys/bus/pci/devices/0000:00:17.0/ata4/host5/target5:0:0/5:0:0:0/block/sdk
/sys/bus/pci/devices/0000:00:17.0/ata5/host6/target6:0:0/6:0:0:0/block/sdl
/sys/bus/pci/devices/0000:00:17.0/ata6/host7/target7:0:0/7:0:0:0/block/sdh
/sys/bus/pci/devices/0000:00:17.0/ata7/host8/target8:0:0/8:0:0:0/block/sdj

#########################################

I have four pools (i manually added the pool names to the output of the lsscsi command above)

one is for my apps and uses the SSDs and is Z1
one is for some of my data and uses 6x drives in Z1
one is for the rest of my data and uses 6x drives in Z1
the final is a single drive for Frigate surveillance recording

root@truenas[~]# zpool status
  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:04 with 0 errors on Sun Jul  6 03:45:06 2025
config:

        NAME         STATE     READ WRITE CKSUM
        boot-pool    ONLINE       0     0     0
          nvme0n1p3  ONLINE       0     0     0

errors: No known data errors

  pool: volume1
 state: ONLINE
  scan: scrub repaired 0B in 00:22:44 with 0 errors on Sun Jun 15 06:10:57 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        volume1                                   ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            4596200f-3a52-4c76-a9db-9cc9faabaa9b  ONLINE       0     0     0
            49834b66-f515-4880-8109-b1f947a3365f  ONLINE       0     0     0
            4efb7611-233a-4c98-88c8-e325e4018666  ONLINE       0     0     0

errors: No known data errors

  pool: volume2
 state: ONLINE
  scan: scrub repaired 0B in 07:12:40 with 0 errors on Sun Jun 15 13:24:50 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        volume2                                   ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            4c17130c-18e1-4039-a1c9-84746d532a64  ONLINE       0     0     0
            08d3a783-bd44-47ce-9fe2-c609d27b2068  ONLINE       0     0     0
            b3c7e3f4-86fe-45bd-8b10-c520ddf81377  ONLINE       0     0     0
            385b8d00-c5b3-4d72-a75d-3bffdcda5abe  ONLINE       0     0     0
            bcd7a35c-a74e-4dba-8033-912853232182  ONLINE       0     0     0
            c5d9c40a-0cee-480d-a1fd-170ae6efe486  ONLINE       0     0     0

errors: No known data errors

  pool: volume3
 state: ONLINE
config:

        NAME                                      STATE     READ WRITE CKSUM
        volume3                                   ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            bc6cf883-dde3-4a60-9f74-8fc3f389faab  ONLINE       0     0     0
            8fe3c3d5-4213-4412-b162-081cd3580650  ONLINE       0     0     0
            d4140d27-be38-412f-843e-cf420187ee64  ONLINE       0     0     0
            004f830a-ac49-44e0-b0d3-d9f84a03288e  ONLINE       0     0     0
            1e1c7c7b-7c60-4602-8546-893c94f365bb  ONLINE       0     0     0

errors: No known data errors

  pool: volume5
 state: ONLINE
config:

        NAME                                    STATE     READ WRITE CKSUM
        volume5                                 ONLINE       0     0     0
          c7f8210f-a2a3-4e43-bf97-be50d99620da  ONLINE       0     0     0

errors: No known data errors

#########################################

root@truenas[~]# dmesg | grep ata5
[    2.764142] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.764597] ata5.00: ATA-11: WDC WD181KRYZ-01AGBB0, 01.01H01, max UDMA/133
[    2.772909] ata5.00: 35156656128 sectors, multi 16: LBA48 NCQ (depth 32), AA
[    2.774587] ata5.00: Features: NCQ-sndrcv NCQ-prio
[    2.787591] ata5.00: configured for UDMA/133
[829544.395755] ata5.00: exception Emask 0x0 SAct 0x1880002 SErr 0x0 action 0x6 frozen
[829544.396278] ata5.00: failed command: READ FPDMA QUEUED
[829544.396760] ata5.00: cmd 60/00:08:f8:48:12/08:00:32:07:00/40 tag 1 ncq dma 1048576 in
[829544.397713] ata5.00: status: { DRDY }
[829544.398181] ata5.00: failed command: READ FPDMA QUEUED
[829544.398648] ata5.00: cmd 60/00:98:a8:34:0c/05:00:96:07:00/40 tag 19 ncq dma 655360 in
[829544.399597] ata5.00: status: { DRDY }
[829544.400086] ata5.00: failed command: READ FPDMA QUEUED
[829544.400566] ata5.00: cmd 60/40:b8:78:48:12/00:00:32:07:00/40 tag 23 ncq dma 32768 in
[829544.401580] ata5.00: status: { DRDY }
[829544.402080] ata5.00: failed command: READ FPDMA QUEUED
[829544.402575] ata5.00: cmd 60/40:c0:b8:48:12/00:00:32:07:00/40 tag 24 ncq dma 32768 in
[829544.403375] ata5.00: status: { DRDY }
[829544.403799] ata5: hard resetting link
[829544.718017] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[829544.780035] ata5.00: configured for UDMA/133
[829544.780058] ata5: EH complete
[841426.885067] ata5.00: exception Emask 0x0 SAct 0xc84 SErr 0x0 action 0x6 frozen
[841426.885789] ata5.00: failed command: READ FPDMA QUEUED
[841426.886513] ata5.00: cmd 60/40:10:88:0c:e2/05:00:b2:07:00/40 tag 2 ncq dma 688128 in
[841426.887991] ata5.00: status: { DRDY }
[841426.888743] ata5.00: failed command: READ FPDMA QUEUED
[841426.889397] ata5.00: cmd 60/40:38:b0:d7:18/00:00:d2:01:00/40 tag 7 ncq dma 32768 in
[841426.890218] ata5.00: status: { DRDY }
[841426.890626] ata5.00: failed command: READ FPDMA QUEUED
[841426.891026] ata5.00: cmd 60/40:50:30:d8:18/00:00:d2:01:00/40 tag 10 ncq dma 32768 in
[841426.891843] ata5.00: status: { DRDY }
[841426.892245] ata5.00: failed command: READ FPDMA QUEUED
[841426.892645] ata5.00: cmd 60/00:58:48:11:e2/08:00:b2:07:00/40 tag 11 ncq dma 1048576 in
[841426.893482] ata5.00: status: { DRDY }
[841426.893890] ata5: hard resetting link
[841427.203255] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[841427.260908] ata5.00: configured for UDMA/133
[841427.260954] ata5: EH complete
root@truenas[~]# dmesg | grep ata7
[    2.454588] ata7: SATA max UDMA/133 abar m524288@0xaa700000 port 0xaa700300 irq 186 lpm-pol 0
[    2.764061] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.764516] ata7.00: ATA-11: WDC WD181KRYZ-01AGBB0, 01.01H01, max UDMA/133
[    2.772888] ata7.00: 35156656128 sectors, multi 16: LBA48 NCQ (depth 32), AA
[    2.774568] ata7.00: Features: NCQ-sndrcv NCQ-prio
[    2.787810] ata7.00: configured for UDMA/133
[838055.874386] ata7.00: exception Emask 0x0 SAct 0x6018001f SErr 0x0 action 0x6 frozen
[838055.875473] ata7.00: failed command: READ FPDMA QUEUED
[838055.876346] ata7.00: cmd 60/80:00:10:be:4b/00:00:d4:07:00/40 tag 0 ncq dma 65536 in
[838055.877996] ata7.00: status: { DRDY }
[838055.878864] ata7.00: failed command: READ FPDMA QUEUED
[838055.879708] ata7.00: cmd 60/80:08:10:bd:4b/00:00:d4:07:00/40 tag 1 ncq dma 65536 in
[838055.881462] ata7.00: status: { DRDY }
[838055.882372] ata7.00: failed command: READ FPDMA QUEUED
[838055.883284] ata7.00: cmd 60/40:10:98:ac:4b/00:00:d4:07:00/40 tag 2 ncq dma 32768 in
[838055.885161] ata7.00: status: { DRDY }
[838055.886095] ata7.00: failed command: READ FPDMA QUEUED
[838055.887068] ata7.00: cmd 60/40:18:58:ac:4b/00:00:d4:07:00/40 tag 3 ncq dma 32768 in
[838055.889482] ata7.00: status: { DRDY }
[838055.890917] ata7.00: failed command: READ FPDMA QUEUED
[838055.891912] ata7.00: cmd 60/00:20:d0:c3:4b/01:00:d4:07:00/40 tag 4 ncq dma 131072 in
[838055.893768] ata7.00: status: { DRDY }
[838055.894705] ata7.00: failed command: READ FPDMA QUEUED
[838055.895508] ata7.00: cmd 60/80:98:58:af:4b/00:00:d4:07:00/40 tag 19 ncq dma 65536 in
[838055.897075] ata7.00: status: { DRDY }
[838055.897911] ata7.00: failed command: READ FPDMA QUEUED
[838055.898737] ata7.00: cmd 60/c0:a0:d0:b3:4b/00:00:d4:07:00/40 tag 20 ncq dma 98304 in
[838055.900235] ata7.00: status: { DRDY }
[838055.900947] ata7.00: failed command: READ FPDMA QUEUED
[838055.901662] ata7.00: cmd 60/00:e8:00:a1:08/04:00:fd:06:00/40 tag 29 ncq dma 524288 in
[838055.903131] ata7.00: status: { DRDY }
[838055.903847] ata7.00: failed command: READ FPDMA QUEUED
[838055.904500] ata7.00: cmd 60/40:f0:90:b6:4b/00:00:d4:07:00/40 tag 30 ncq dma 32768 in
[838055.905837] ata7.00: status: { DRDY }
[838055.906517] ata7: hard resetting link
[838056.216634] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[838056.293820] ata7.00: configured for UDMA/133
[838056.293845] ata7: EH complete

I am also getting power-on device reset errors for drives [0:0:2:0], [0:0:3:0], [0:0:4:0], and [0:0:5:0] which are different from the ata5 and ata7 errors above.

I’d send a note to info@45homelab.com and have them walk you through testing the backplane. From a few other posts here, it seems like that may be the direction this is headed. We could walk through removing cards and changing out cables here, but IIRC all the cable issues have been from people using SAS drives and the cables not handling the 12 gbps bursts well.

Although an aside to your issue, do you really need to read all the data back over the network to do the CRC checks? Is there a Windows program for that you are most comfortable with or something? I think you should be able to create checksums to compare from the command line on the HL15 and Synology. Or perhaps set up a Windows VM on the HL15 that could at least access that copy locally in Windows.

Also, maybe I don’t understand the file transfer process vs the checksum process. Were the laptop and desktop involved in the file copying or just the checksumming? I think you should be able to set up RSYNC to copy files from the Synology to the HL15.

the file copying ONTO the HL15 was over NFS from my three synology units.

the check-summing is only on windows

i am using a program called exact file (windows program) that performs CRC checks. the CRC checks were not required agreed, but it is also to intentionally test the heck of the drives as all of the drives in the system are new. i did test all of the drives using hd sentinel “destructive write+rear” test. i wanted to purposefully test the system extensively before i retire the synology systems and based on my issues, i am happy i did so as i may have hardware issues.

now, for full disclosure, this HL15 is ALREADY a replacement system. my first HL15 was RMA’ed due to system crashes where IPMI would record “catastrophic processor errors” and i worked with 45drives tech support to trouble shoot the issue and it appeared to have been either a processor or motherboard issue.

the new replacement system is not crashing etc, just getting these disk errors, BUT i was also getting disk errors on the old RMA’ed system… which might mean i have bad drives… but that would be quite a few bad drives if that were true…

edit:
as stated, i am going to move one of my pools to the SAS expander that i am building into a JBOD to see if i still get disk errors. I am really interested to see if the errors occur or do NOT occur while connected to the SAS expander. I plan to do that tonight/this weekend and test the heck out of the drives again.

I didn’t say that at all. I was just suggesting there might be a more efficient way to do the checksumming.

So did the files (or 99.9% of them) actually copy correctly and the issues are just with the CRC process–that the CRC process is failing or giving false errors? Or is the checksum process reporting true errors where the files did not copy correctly? It sounds like what you’re saying is the load caused by the checksumming is what is causing your errors, not the load of the original file copies.

You should be able to use smartctl in the TrueNAS shell to see if the drives are healthy. There is also an app called Scrutiny you can load that GUIfies that data. You could also mount the drives via USB/SATA adapter to one of the Windows machines and use CrystalDiskMark to examine the SMART info. Scrutiny isn’t quite as robust as CDM in properly decoding the data, as some of the interpretation is reverse-engineered and not published openly by the drive manufacturers.

all of the files appear to have copied over correctly. the CRC process is not “failing” per se, when the disk errors occur, because of the disk reset, the data transfer errors out for 30 seconds or so and the exact file program then aborts so it has not been able to complete its process.

i did extended smart tests and i have checked all of the smartctl -x parameters. everything looks good on all disks. the only things in SMART that show anything is a-miss are the number of drive resets incrementing up which of course makes sense since the controller is resetting the disk.

i have a custom script that runs hourly and record all of the smart parameters and extended parameters available in smartctl -x into influxDB

OK, well it seems like this isn’t your first rodeo, so I’m not sure I have much else that might be helpful. You seem to be digging into all the things that seem to cause that error (bad power, bad drives, bad cables, bad controller, bad firmware), and I see you also posted about this on the TN forums.

How is the HL15 connected to the network? LAG on the added 4-port NIC? How many cables in use? Are the windows machines accessing the HL15 over a physically different connection than the Synology? The errors showing up mostly under the concurrent Windows client load seems odd, but maybe its a red herring. It doesn’t seem like your build should have power issues, but maybe something about the NIC is causing problems.

One thing you might try is to run a scrub; does that generate any errors? If you can write to the disks under the load of the file transfer and can run a scrub, I don’t think it can be a cable or backplane issue.

Two other things I see mentioned that you might try for testing and debugging are disabling NCQ and limiting the SATA speed to 3gbps.

the HL15 is connected to my network using one of the 10GBE ports on my main VLAN, the second 10GBE on another VLAN for dedicated 10-GB connections as i just received a thunderbolt 10GB Ethernet unit

the 4-port switch is to access my camneras for Frigate, to access my APC battery backups, and the third port is for my switch management VLAN. the 4th port is not used

i ran a scrub using the OLD system without error and it was going at a good 900-1000MB/s for 7 or more hours. i have not ran a scrub on this hardware yet

right now i have moved the 6x drives from one of my pools and am using my DELL SAS expander i am using in a home made JBOD. i am currently using SMB and transferring about 3-3.5Gbps on windows (using the new thunderbolt to 10GB adapter) for the last 3 hours without issues.

i will let the CRC checks run on that pool using the DELL SAS expander overnight and see if i get any errors…

1 Like

well, i did have TWO drive power-on reset on the 6x drives for the pool running on the SAS expander while averaging around 3Gbps using my new thunderbolt 10GB Ethernet

this is MUCH fewer errors than i was getting when i was pumping nearly 3Gbps using two separate computers, so “maybe” this is progress?

I have been tracking serial numbers of the drives giving me errors (since the disk identifiers (sda, sdb etc) change when i moved disks to the SAS expander)

drives having errors
first config with errors (with original system):
volume 2
6FG1S6NL
6EG3YAJL
6EG502LL

second config with errors (with new system and all drives installed):
volume2
6FG1B84L
6EG4L39L
6FG1S6NL
6EG502LL
6EG4WMYL
volume3
6FG1JSZL
6FG1GVDL
4BJXKBGH

third config with errors (with volume2 in dell SAS):
volume2
6EG4Z39L
6FG1B84L

Can you explain how to disable native command queuing and how to force SATA speeds at 3gbps?

How can i undo the settings changes and re-enable NCQ and return speeds to 6gbps?

obviously throughput will be reduced, but are there any other negatives/concerns/gotchas that i should be aware of by disabling NCQ and lowering speed to 3gbps?

if it matters, all of my drives right now are running through either a 9400-8e (for the SAS expander) or are running through the motherboard SATA ports.

Well, I was just repeating some things I saw. I think this is done via GRUB parameters. Most of the sources refer to a file called /boot/armbianEnv.txt that I don’t think is present/applicable to TrueNAS but conceptually the parameters are;

extraargs=libata.force=noncq,3.0, and
extraargs=libata.force=3.0

See, eg this post; https://forum.armbian.com/topic/15937-sata-issue-drive-resets-atax00-failed-command-read-fpdma-queued/?do=findComment&comment=121892

To update GRUB parameters in TrueNAS refer eg to this post; https://forums.truenas.com/t/how-to-limit-sata-link-speed/25019

So between the two you can probably figure it out or ask over on the TN forum.

appreciate it, i will look into this

ok, so i decided to spin up TrueNAS on an old dell desktop i had to play around

i was able to temporally set the setting on this system using libata.force=3.0G using instructions here: Kernel parameters - ArchWiki by:

  1. rebooting at the GRUB screen pressing e and at the line starting with linux added libata.force=3.0G and pressed ctrl+x to allow the system to boot

confirmed

1.) it did set the SATA disk speed to 3.0Gbps for the single SATA disk (the boot disk is a small NVME disk) when checking using smartctl -x /dev/sda

4.) confirmed it was temporary by rebooting normally and the SATA speed was again back to 6.0Gbps

right now i have LONG smart tests running on my disks which have around a 1950 min (32 hour) time duration (18TB drives) so i am going to let that finish and then i will try setting the libata.force=3.0G option temporarily on my troubled system and see what happens.

does anyone know if the libata.force=3.0G command controls disks on SATA ports on a motherboard AND the ports on an HBA like my LSI 9400 cards, or will i need to configure the LSI 9400 cards using storcli or something?

i ask as i have seen two sets of errors:

disks on my motherboard SATA controller give the following errors:

829544.395755] ata5.00: exception Emask 0x0 SAct 0x1880002 SErr 0x0 action 0x6 frozen
[829544.396278] ata5.00: failed command: READ FPDMA QUEUED
[829544.396760] ata5.00: cmd 60/00:08:f8:48:12/08:00:32:07:00/40 tag 1 ncq dma 1048576 in
[829544.397713] ata5.00: status: { DRDY }
[829544.398181] ata5.00: failed command: READ FPDMA QUEUED
[829544.398648] ata5.00: cmd 60/00:98:a8:34:0c/05:00:96:07:00/40 tag 19 ncq dma 655360 in
[829544.399597] ata5.00: status: { DRDY }
[829544.400086] ata5.00: failed command: READ FPDMA QUEUED
[829544.400566] ata5.00: cmd 60/40:b8:78:48:12/00:00:32:07:00/40 tag 23 ncq dma 32768 in
[829544.401580] ata5.00: status: { DRDY }
[829544.402080] ata5.00: failed command: READ FPDMA QUEUED
[829544.402575] ata5.00: cmd 60/40:c0:b8:48:12/00:00:32:07:00/40 tag 24 ncq dma 32768 in
[829544.403375] ata5.00: status: { DRDY }
[829544.403799] ata5: hard resetting link
[829544.718017] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[829544.780035] ata5.00: configured for UDMA/133
[829544.780058] ata5: EH complete

HOWEVER disks on my LSI 9400 cards seem to only give these errors:

sd 0:0:2:0: attempting task abort!scmd(0x0000000064279725), outstanding for 30236 ms & timeout 30000 ms
sd 0:0:2:0: [sdi] tag#357 CDB: Read(16) 88 00 00 00 00 01 c1 b7 17 08 00 00 00 30 00 00
sd 0:0:2:0: task abort: SUCCESS scmd(0x0000000064279725)
sd 0:0:2:0: attempting task abort!scmd(0x00000000dc41a872), outstanding for 30460 ms & timeout 30000 ms
sd 0:0:2:0: [sdi] tag#354 CDB: Read(16) 88 00 00 00 00 01 c1 b7 17 e0 00 00 00 30 00 00
sd 0:0:2:0: No reference found at driver, assuming scmd(0x00000000dc41a872) might have completed
sd 0:0:2:0: task abort: SUCCESS scmd(0x00000000dc41a872)
sd 0:0:2:0: attempting task abort!scmd(0x00000000bbf55408), outstanding for 30460 ms & timeout 30000 ms
sd 0:0:2:0: [sdi] tag#353 CDB: Read(16) 88 00 00 00 00 06 f2 1e 4d f0 00 00 01 70 00 00
sd 0:0:2:0: No reference found at driver, assuming scmd(0x00000000bbf55408) might have completed
sd 0:0:2:0: task abort: SUCCESS scmd(0x00000000bbf55408)
sd 0:0:2:0: attempting task abort!scmd(0x0000000090bc8f3d), outstanding for 30244 ms & timeout 30000 ms
sd 0:0:2:0: [sdi] tag#383 CDB: Read(16) 88 00 00 00 00 07 10 10 93 b0 00 00 00 38 00 00
sd 0:0:2:0: No reference found at driver, assuming scmd(0x0000000090bc8f3d) might have completed
sd 0:0:2:0: task abort: SUCCESS scmd(0x0000000090bc8f3d)
sd 0:0:2:0: attempting task abort!scmd(0x00000000a8dd63ed), outstanding for 30248 ms & timeout 30000 ms
sd 0:0:2:0: [sdi] tag#382 CDB: Read(16) 88 00 00 00 00 07 10 10 94 88 00 00 07 e8 00 00
sd 0:0:2:0: No reference found at driver, assuming scmd(0x00000000a8dd63ed) might have completed
sd 0:0:2:0: task abort: SUCCESS scmd(0x00000000a8dd63ed)
sd 0:0:2:0: Power-on or device reset occurred

and i am not sure if that means the libata will affect only my motherboard drives or if it will affect all drives?

My understanding is it is specifically for the SATA controller. You need to use something like storcli, sas2ircu, or lsiutil for the SAS controller. I think storcli will work for you. This thread may help or give other ideas, but I think (?) part of the difference there was they were dealing with the onboard SAS3008, not a discrete HBA.

really appreciate it

i have also stumbled upon another possibility.

this article

is talking about needed to turn off Native Command Queuing for all his WD gold 16tb drives by setting the queue depth to 1, and it has links to other articles and discussions about flawed NCQ in WD Golds in ZFS github from 2020…

I am using WD Gold 18TB drives…

i am going to let the LONG SMART tests finish (should be done late tonight) and i am going to use this script to set the queue length to a value of 1 for ONLY my WD gold 18TB drives. this will leave my micron 1.92TB SSDs and my WD Purple drive (for Frigate Surveillance) alone at their default queue lengths of 32.

#!/bin/sh

for i in /dev/sd? ; do
	#echo "$i"
	model=$(smartctl -i $i | grep "Device Model")
	if [[ "$model" =~ "Micron" ]] || [[ "$model" =~ "PURZ" ]]; then
		echo "skipping disk: $i --> $model"
	else
		echo "Disabling NCQ for disk $i"
		echo 1 > "/sys/block/${i/\/dev\/}/device/queue_depth"
	fi
done

so my plans are:

1.) set queue depth to 1, test system
2.) if still errors, then try setting libata.force=3.0G. while this will not affect one of my pools since it will always be running off an HBA, it would allow me to test my other pool connected to the motherboard SATA controller. this pool also gets errors on heavy loads so it will be worth testing. Assuming this fixes the issue, then i will worry about configuring the HBA controllers.
3.) if still errors, then i will try testing with BOTH libata.force=3.0G and the queue depth set to 1

i am hoping the issue will be fixed with option 1.

1 Like

Yeah, that seems like it’s probably related. I’ve been using Seagate drives recently.

You might lose a bit of performance on the HDDs depending on your workload, but yes I think NCQ is more important for SSD performance, so good to try to leave those alone.

In a quick look it didn’t seem like WD publicly publishes firmware updates for their drives. It’s possible you could contact their support and see if here are any by-request-only firmware fixes available. You could also maybe load up their Dashboard software on one of your Windows machines, connect one of the problem WD Gold’s to it and see if their utility picks up on any firmware update.

Finally, I didn’t read the whole thread but it did mention the possibility of a TrueNAS Core vs Scale issue. If your workarounds didn’t work, you could maybe load up Core and see if you still get the same errors under Core. If you don’t, that may point to a TrueNAS Scale bug rather than a WD drive firmware bug (although I do suspect the latter given WD’s whole SMR debacle).

coming from Synology where even internally between volumes i only got like 250MB/s, and i have lived with 1GB Ethernet since the start, so even if i can get 200-300 MB/s over SMB (which i should on a 5x and a 6x wide array, that would only be like 60MB/s per disk, i would be happy.

while i would be happy it is obviously not the greatest, but the most important thing is that everything WORKS

well, all LONG smart tests completed without issue, no unusual readings from SMART.

last night around 9:00 PM i started doing my CRC check data transfers. as of this morning at 5:00 AM (8 hours) no errors yet. Keeping my fingers crossed that this will correct the issue.

1 Like

over 40 hours and counting and zero errors on both pools while pulling around 2Gbps on one volume and 2.5Gbpd on the second volume.

before making the queue depth change, copying one file over SMB, i was getting around 400 MB/s, after making the queue depth change i still got about 400 MB/s over SMB so at least for my needs, no real impact occurred on speed. with that said, my one pool with 6x disks, i have run a scrub a few weeks back and during the scrub i was getting over 1000 MB/s so i plan to eventually re-run a scrub on that volume and run a scrub on the 5x disk volume and see what my data speeds are during the scrub.

i am going to let my current CRC check process complete, i am going to move the 6x drives that i took out of the HL15 and attached to my JBOD expander and put them back inside the HL15. I will be adding my queue depth script to my TrueNAS config as a post-init script so it runs automatically at every boot and i will do some more data intensive activities to stress the drives.

1 Like

Glad it’s working better. I’d consider it more of a work-around than a fix.

My impression was the performance impact, to the extent that there is one with HDDs, would present more with concurrent random write operations, not so much read.

fair, i too do not think it is a true fix either, but it does (seem) to alleviate the issue for now and allow me to use the system without risks of data corruption.

i am going to try reaching out to WD to see if they have any knowledge of this issue since it seems to be occurring since 2020

luckily for me the two pools that are affected are used entirely for large file storage for PLEX so the random reads is not as much of an issue as it COULD be.

well, i do seem to have “solved” the issue.

transferring data to and from the server at around 600MB/s over SMB i did not have any errors when the queue depth was set to 1 while transferring 50TB off and 50TB back on to the system 10x times over and over again to both pools.

as soon as i set the queue depth to a value of 2 or higher, i begin getting the disk errors within a couple TB of transfer

i did reach out to WD support and according to them, i am running the latest firmware version available. their support does not think the issue is with their drive, they think it is something either with the controllers or the software… they were not very helpful…

so anyone with WD gold drives who are seeing the same errors i am, try setting your queue depth to a value of 1 instead of the default (for me anyways) of 32

2 Likes