SAS = Too Slow, maybe NVME?

doodlemania · February 6, 2024, 7:42pm

Here’s a RAIDZ, lz4 compression, all 15 disks:

root@pve:/hdd# fio --name=seqwrite --ioengine=posixaio --rw=write --bs=1M --numjobs=1 --size=10G --iodepth=16 --filename=seqtest --runtime=60 --time_based --direct=1
seqwrite: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=posixaio, iodepth=16
fio-3.33
Starting 1 process
seqwrite: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=304MiB/s][w=304 IOPS][eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=287196: Tue Feb 6 13:39:36 2024
write: IOPS=354, BW=354MiB/s (371MB/s)(20.8GiB/60054msec); 0 zone resets
slat (usec): min=2, max=597, avg=16.32, stdev=10.49
clat (usec): min=1750, max=102539, avg=45130.73, stdev=19359.65
lat (usec): min=1754, max=102547, avg=45147.05, stdev=19360.24
clat percentiles (usec):
| 1.00th=[ 2671], 5.00th=[ 2802], 10.00th=[ 4490], 20.00th=[40109],
| 30.00th=[50594], 40.00th=[51643], 50.00th=[52167], 60.00th=[53216],
| 70.00th=[54789], 80.00th=[55837], 90.00th=[58459], 95.00th=[62129],
| 99.00th=[70779], 99.50th=[73925], 99.90th=[87557], 99.95th=[88605],
| 99.99th=[89654]
bw ( KiB/s): min=202752, max=4972544, per=100.00%, avg=362615.47, stdev=460314.17, samples=120
iops : min= 198, max= 4856, avg=354.12, stdev=449.53, samples=120
lat (msec) : 2=0.24%, 4=9.13%, 10=5.88%, 20=2.39%, 50=10.16%
lat (msec) : 100=72.19%, 250=0.01%
cpu : usr=0.68%, sys=0.04%, ctx=10647, majf=0, minf=25
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=50.0%, 16=49.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=95.8%, 8=0.0%, 16=4.2%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,21262,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
WRITE: bw=354MiB/s (371MB/s), 354MiB/s-354MiB/s (371MB/s-371MB/s), io=20.8GiB (22.3GB), run=60054-60054msec

root@pve:/hdd# fio --name=seqread --ioengine=posixaio --rw=read --bs=1M --numjobs=1 --size=10G --iodepth=16 --filename=seqtest --runtime=60 --time_based --direct=1
seqread: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=posixaio, iodepth=16
fio-3.33
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=5862MiB/s][r=5861 IOPS][eta 00m:00s]
seqread: (groupid=0, jobs=1): err= 0: pid=287448: Tue Feb 6 13:40:45 2024
read: IOPS=5849, BW=5849MiB/s (6133MB/s)(343GiB/60003msec)
slat (nsec): min=50, max=140845, avg=125.13, stdev=253.27
clat (usec): min=991, max=7524, avg=2734.30, stdev=112.45
lat (usec): min=991, max=7524, avg=2734.42, stdev=112.46
clat percentiles (usec):
| 1.00th=[ 2671], 5.00th=[ 2671], 10.00th=[ 2704], 20.00th=[ 2704],
| 30.00th=[ 2704], 40.00th=[ 2704], 50.00th=[ 2737], 60.00th=[ 2737],
| 70.00th=[ 2737], 80.00th=[ 2769], 90.00th=[ 2802], 95.00th=[ 2868],
| 99.00th=[ 2933], 99.50th=[ 3130], 99.90th=[ 3556], 99.95th=[ 3916],
| 99.99th=[ 4178]
bw ( MiB/s): min= 5568, max= 6368, per=100.00%, avg=5852.25, stdev=68.27, samples=119
iops : min= 5568, max= 6368, avg=5852.25, stdev=68.27, samples=119
lat (usec) : 1000=0.02%
lat (msec) : 2=0.26%, 4=99.70%, 10=0.03%
cpu : usr=0.75%, sys=0.52%, ctx=175491, majf=0, minf=25
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=50.0%, 16=50.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=95.8%, 8=4.1%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=350977,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
READ: bw=5849MiB/s (6133MB/s), 5849MiB/s-5849MiB/s (6133MB/s-6133MB/s), io=343GiB (368GB), run=60003-60003msec

doodlemania · February 6, 2024, 10:52pm

And just for fun, here it is with RAID10 with 14 of the 15 drives:
WRITE: bw=1071MiB/s (1123MB/s), 1071MiB/s-1071MiB/s (1123MB/s-1123MB/s), io=62.8GiB (67.4GB), run=60019-60019msec
READ: bw=5866MiB/s (6151MB/s), 5866MiB/s-5866MiB/s (6151MB/s-6151MB/s), io=344GiB (369GB), run=60002-60002msec

–direct=1 applied

This feels like ZFS is just being crummy. Not an issue with the HBA or the disks?

Hutch-45Drives · February 8, 2024, 1:24pm

this last test tells me it’s your raid type slowing down the writes.

With a single raidZ1 vdev, you only get 1 drive of performance. when you do the Raid10 you are getting 7 drives of performance(each vdev gets 1drive of performance for random IO)

But sequential IO should be able to benefit from the raidz1 where 145 drives would be used and 1 drive would be lost to parity.

I agree this has nothing to do with the hardware of the system

doodlemania · February 8, 2024, 3:22pm

Thank you so much for helping me study this! Hopefully, it is of some help to others too. So, looks like I need to go with Raid10 since my workloads are almost all random.

Cheers all!

Specter · February 22, 2024, 6:36pm

@doodlemania Thanks for asking the question and following up so enthusiastically with all the great responses.

Bookmarking this for when I run into a similar issue next week. It’s my first time with Houston, an HBA, and anything in ZFS beyond “use raid z1 on these 4 drives” - which is was fast enough to bottleneck a 1gbe NIC .

Anyways, yep. Already helpful.

pcHome · February 23, 2024, 5:43am

I am bookmarking this post as well.

orix · February 27, 2024, 3:50am

Sharing some thanks for this thread. Kudos to both of you in the friendly exchange that rarely seems to occur when using ZFS for some reason…

I’m here right now because I’m using the commands @Hutch-45Drives provided to benchmark my previous storage box vs my HL15, mainly to rule out NFSv3 vs NFSv4.1 woes with vCenter.

doodlemania · February 27, 2024, 6:54pm

Really great to hear that others may find use in this thread! I’m continuing to study and tweak as well!

doodlemania · March 6, 2024, 6:31pm

So, I thought I’d be all smart and installed a VM on Proxmox for TrueNAS.
I created a single pool with the following characteristics:

7 VDevs
Each VDev with 2 disks in “mirror” mode
(leaves one spare)
One dataset with compression and encryption off
NFSd it wide open “root” “root” (we’re just playing)

Back on the proxmox side, I mounted the NFS export as Proxmox storage.
Created a Linux on this - supposedly in my brain - stupid fast setup, but only 5.5TB of storage.

Reads: 82GB IO and 1379MB/sec - shiny! ZFS cache kicked in even though I used --direct
Writes: 3020MB IO and 50MB/sec - …really?

That still seems…slow? Or am I just underwhelming myself?

orix · March 6, 2024, 8:28pm

Did you pass through the whole controller from Proxmox to TrueNAS? That’s a HUGE thing to make sure you get correct. There’s tons of horror stories about just passing through the disks.

Is that related to you I/O concerns? Probably not.

Are these numbers using FIO from Proxmox or TrueNAS, or are these from the NFS? The more abstraction you put between the hardware and OS handling ZFS, the more performance hits you take.

The writes being that much slower do seem strange, but again, I’m still learning daily about ZFS.

doodlemania · March 6, 2024, 8:44pm

Great question - yes indeed, passed through both PCI entries of the controller. The FIO is from a VM running on Proxmox whose storage is NFSd to the TrueNas (same box). I’ll repeat the fio from a shell to TrueNas direct to compare!

doodlemania · March 6, 2024, 8:53pm

OH! Holy buggers - NFS sucks I guess. From TrueNas VM on Proxmox (controller passed through) and the above vdev alignment:

READ: bw=16.7GiB/s (17.9GB/s), 16.7GiB/s-16.7GiB/s (17.9GB/s-17.9GB/s), io=999GiB (1073GB)

WRITE: bw=990GiB/s (1038MB/s), 990MiB/s-990MiB/s (1038MB/s-1038MB/s), io=58.0GiB (62.3GB)

That seems … awesome. So…NFS = poo?

DigitalGarden · March 7, 2024, 12:46am

If you need NFS, for testing you could try disabling sync on the dataset and see if that helps? I think best practice is to not actually run that way. If that does help, you could try adding a ZIL and re-enabling sync.

See, eg;
https://jrs-s.net/2019/07/20/zfs-set-syncdisabled/

doodlemania · March 7, 2024, 12:58am

What would be the most performant way? iSCSI? Surely not SMB me thinks?

DigitalGarden · March 7, 2024, 1:12am

iSCSI has the least overhead, but it has limitations; you can only mount that share on one client at a time, and you preallocate the space, so it doesn’t grow or shrink with the data. It’s a bit like plugging in a drive to your computer with an ethernet cable rather than a SATA cable.

orix · March 7, 2024, 5:39pm

What’s the recommendation for using it with VMware as a shared Datastore for many hosts in an HA cluster?
I’ve been using NFS for a bit (NFSv3 as I can’t seem to get NFSv4 to work properly), and even with 2.5 SSD’s it runs, but it is painfully slow on the VM’s themselves.

DigitalGarden · March 7, 2024, 7:02pm

I’m sorry, I’ve only really used VMWare Workstation, not their ESXi/vSphere virtualization platform, so I’ll need to defer to the better experts here.