Mitch here from 45Drives - Let's dive in yall

mitcHELLOWORLD · November 2, 2023, 12:33am

Hey yall,

My name is Mitch Hall and I’m the Chief Architect here at 45Drives. I know we released a video recently and in that video I would have pool and tuning configs in a doc in the description and unfortunately we never did do that!

We shot the video a few weeks before releasing and unfortunately just something that fell through the cracks.

So, with that being said, I thought I would spend some time making it up to our homelab community

I had Hutch post the broad strokes of the tunings on the forum on another post earlier today, but for anyone that didn’t see that post here is what it contained:

–
Motherboard: X11-SPH-NCTF (Supports IPMI)

Processor: Xeon Bronze 3204 (6 Core & 6 Thread)

Memory: 2x8GB DDR4 RDIMM

Boot Drive: 1TB Kingston M.2 NVMe

Power Supply: Corsair RM750e (750W Modular ATX)

Configuration:

14X Mirrored VDEV Zpool

64K Record Size

noatime,compression=lz4,sync=disabled,arc=70% (added 20% over default)

10G network RX/TX buffers increased to 4096

CPU on server set to disable C states/P states: tuned-adm profile network-throughput

cpupower governor set to performance (command: cpupower frequency-set –governor performance)

NOTE The above Configuration changes were made purely for performance and are not intended for a day-to-day workload. We would advise against these settings if you are storing important data on your server

–

Ok so that gives some broad strokes of everything. Why the note on the end? Well, I don’t know if I agree with that note fully. There’s absolutely nothing wrong with using 95% of those settings out of the box and you have no worry about incurring data loss or anything like that.

The one setting that requires just a bit more investigation is the “sync=disabled” setting.

This tells your pool/dataset to never use synchronous IO - even if your incoming IO request has a sync IO request.

What does this mean?

ZFS has two main ways to handle writes, async and sync.

ZFS is a transactional file system that has some really neat tricks up its sleeve, one of them being that it turns random writes into sequential writes! How does it do this? Well, with standard async write requests - a write IO request comes in, it gets sent to a transaction group in memory (RAM) and then the ACK (acknowledgement) is sent back to the client. It does this until the transaction group is filled, and at which point it flushes all that write IO as a single big transactional write. Taking what might have been some random writes, and writing them sequentially.

Cool stuff right? Most desktop applications you use even over SMB will use plain-ol’ async writes. This is efficient and makes things run very smoothly.

But the keen eyed among you might already be seeing what could be the downside of this type of write.

What happens if your system loses power before the last async write transaction was flushed? Well, as you might imagine - those writes never get committed! So you now have an application that “thinks” that the data it sent was completed - but it in-fact was not.

This is not an issue for the vast majority of uses as when power resumes you just pick up where you left off with the state of the application a few seconds before the outage.

But then there are another type of writes called sync writes. These are reserved for applications/use cases that are much more strict on data consistency. The biggest use cases for these types of writes are things like transactional databases, or running virtual machines/containers.

In those cases many times it is necessary to ensure that data is committed to disk safely before acknowledgement is ever sent to the client.

For sync writes, ZFS has something called a ZFS intent log. This ZIL resides by default on the exact same disks that your zpool is built on.

Okay, makes sense. But what does it do and why disable it? Great question Mitchell, I’m glad you asked

So for these synchronous write requests, we know we need to treat them with more safety. For this purpose, a sync write request comes into the same transaction groups in RAM that async writes go to - BUT, do not immediately send an ACK back to the client.
The data is simultaneously written to the ZIL which tells the ZPOOL the intended data that will be written with these sync requests. Once they hit the ZIL, the ACK is then immediately sent back to the client - but on the ZFS server, the data still has to be written once more. As I said in the beginning, this data is still inside of a transaction group in RAM. This data needs to be flushed to the storage alongside the rest of the data in that transaction group.

This means when we don’t have a dedicated device to offload our ZIL to, we have now in-effect produced something very similar to a “double write penalty”

The same disks that had to do the work of writing the data to the ZIL now have to go back and write it for real this time into the ZPOOL from the transaction group.

At this point, the data in the ZIL can be dropped, as it has done its job.

The ZIL, then - will NEVER be read from except for on pool import. What do I mean by that?

Because the ZILs main function is to act as a type of journal, telling the ZPOOL “these are the blocks you’re going to write!” at which point the ACK is sent to the client and then the data is actually written to the ZPOOL from the transaction group.

If the ZFS server was to lose power right after sending the ACK back to the client - we now have a bit of an issue. We told the client we safely wrote the data, but really what we did was wrote down “here are the changes I’m going to make” and didn’t actually make them yet.

So, when power resumes and we import the pool, the ZFS server looks at the last commits, looks at the ZIL and sees that it still has data in it and at THIS point it reads from that ZIL and makes writes the data that would have been written from the transaction group if power didn’t die.

Ok - so we explained what it does, now why would I disable it. Simple answer - For some reason the black magic disk benchmark which benchmarks disk/SMB share and outputs all resolutions,codecs and frame rates that it thinks your disk/SMB share will support uses sync write requests which sullies the numbers. Unfortunately we ended up cutting that part of the video anyways due to time constraints.

Most SMB workloads will use async write requests anyways, and so the sync=disabled probably isn’t necessary for most of yall.

ZFS is especially nice because it is so resistant to corruption because it uses atomic operations. If anyone is familiar with the ‘dreaded’ RAID write-hole you know what I mean. Because ZFS is a copy-on-write like solution - it does not modify blocks in place, but instead when editing an existing file, it simply writes a brand new block and then through the metadata layer simply unlinks the old block and frees it up to be written by something else - unless there is a ZFS snapshot holding it in place.

So, with ZFS async even if you were to begin editing a file and power shut down, you will never end up with half written blocks or half overwritten blocks like is possible with traditional RAID. Atomic operations happen as one single operation. This means ZFS unlinking the old block and linking the new one happens in a single operation. You will never come back from power loss with ZFS and see corrupted blocks from this.

In traditional RAID let’s say you have a 128K stripe size across 5 disks, and you were going to edit an existing file. You send the write request to the RAID array and it begins to write it, 3 of the 5 disks overwrites their part of the block but the remaining 2 didn’t get to it in time before power loss. This is where we can experience data corruption.

Thankfully, most modern file systems these days have some form of journaling which can help to an extent.

I think at this point I might be rambling however.

So… to summarize… Using SMB if you aren’t running an application that is extremely sensitive to synchronous writes and you understand what the possibilities are - setting sync=disabled isn’t the end of the world.

However, for the most part you will probably never have to do it anyways because most of your applications will use asynchronous IO anyways

As far as everything else I set - I think they’re self explanatory but if anyone wants me to expand on any of that stuff more or any other questions drop them in the thread.

Like I said, this is me making up for me not having a doc ready for yall when the video went live !

ignorantforager · November 2, 2023, 1:27am

This was a very informative read and the type of stuff that I love. I love reading / watching / listening to people who are enthusiastic or knowledgeable about their fields. If you have any more random tidbits on anything storage related, I’d love to read them as additional topics. Thanks Mitch!

davemcl · November 2, 2023, 2:01am

Mitch,
Were those drives SATA or SAS?

ZVeguillaCotto · November 2, 2023, 2:34am

At 200MB/s each does it matter?

Is there a benefit like dual path you might be suspecting?

(Genuine inquiry)

mitcHELLOWORLD · November 2, 2023, 12:31pm

good sir, I have enough random storage related tidbits to fill a book or five However, if there is anything Linux, ZFS, Ceph related that is more pointed I would be happy to dive in when I have some time!

mitcHELLOWORLD · November 2, 2023, 12:32pm

They are SATA ! 4k sector Seagate EXOS 16 I do believe.

ignorantforager · November 2, 2023, 1:44pm

I fear that I’m in the realm of “I don’t know what I don’t know”. I’d like to think that I have a little bit of knowledge on a fair bit of topics, but I love when I can dive deeper. I am really looking forward to the Plex guide that I’ve heard was coming later. That’s mostly what I use my current set up for. As for right now, I’ll continue to peruse all the topics posted here and glean what I can.

Krushal · November 2, 2023, 3:04pm

I was just saying to @Vikram-45HomeLab in another post I would love to see this as a YouTube video too (get those views up!). What would be really interesting it doing each modification 1 by 1 and seeing what the result is each time showing what mod gives the biggest enhancement vs only squeezing the last 1-2% out, or are all of them only 1-2% but adding them together gets you this, etc.

You all are doing great, keep it up - we all really appreciate it.