HL15 – “Paranoid” Build

Hello everyone,

I’m looking to build a “paranoid” backup/data/Plex server, and I believe the HL15 could be an excellent solution. However, I’m not a ZFS expert.

My target performance is to saturate four 10GbE ports aggregated (LAG).

Here’s the layout I’m considering:

  • ZFS Pool: 3 vdevs, each with 5 drives in RAIDZ2 or RAIDZ3 (15 total drives)
  • Special VDEV: 2× Kioxia CM6 6.4 TB in mirror (metadata)
  • SLOG: 2× Intel Optane P5800X 400 GB in mirror
  • RAM: 256 GB–512 GB of RDIMM DDR4 (still deciding)
  • CPU: (Suggestions welcome)
  • Motherboard: (Suggestions welcome)
  • NIC: Intel X710-T quad port
  • HBA: Broadcom 9600-24i (edit)

My question: will this setup be able to fully saturate the 4×10 GbE LAG—not only for large sequential reads/writes, but also for random operations on smaller files (like Git repos, etc.)?

Any motherboard/CPU recommendations or other suggestions would be greatly appreciated!

Edit: What would be the right software solution to synchronize a directory from my computer to this storage machine, and then keep up to N versions of the synchronized files from that directory?

Thank you!

Safe to assume this will be the metadata type of special vdev?

1 Like

Yes, it will be metadata.

Looks like you’ve put some thought into your setup here to minimize tradeoffs, but I don’t think you’re going to achieve network saturation. I recommend reading through this article from IXsystems that walks through each Z layout and how to calculate a rough idea on performance for different scenarios.

ZFS_Storage_Pool_Layout_White_Paper_2020_WEB.pdf

Certainly, reads from the ARC will saturate a 10gpbs connection just due to DDR4 speeds. The P5800X SLOG will also perform well above 10gbs. I’m assuming you’re already aware that having four ports in a LAG configuration won’t buy more speed for a given device. You will have four uncongested network lanes to use, so, in theory, up to four client connections can each achieve full 10gpbs bandwidth simultaneously.

Regarding your ZFS pool, I can’t say I’d recommend Z3 with a 5 wide VDEV. At that point you’ll have more parity than usable space. Two drives in a VDEV will also mean the most you could expect for streaming reads is 200-250 MB/s. That’s like 1.5gbps. Z2 would help in both regards but you’re still short from the 10gbps you’re looking for on streams. You’re going to need something closer to 8 to 10 wide VDEV’s to max out a single connection with streaming reads from the hard drive itself. Having three VDEV’s will help mean you’ll get 3x the number of IOPS of a single drive (~250 on average for spinning HDD).

The metadata VDEV will help with those random operations and can also be used for storing small files up to a certain size. Just remember you don’t want to run out of space on that drive so start conservative (1k file size?) and adjust to something higher once you see how much of your workload moves to the special VDEV.

2 Likes

Thank you very much for sharing that document—it’s quite interesting, and I see several appealing configurations.

However, I do wonder whether the lower I/O performance typical of a Z3 layout can be mitigated by having a large ARC plus SLOG. Also, what about the streaming write speed of a 5×3 mirror—can that be improved somehow?

Using Z3 with 15 drives appears to be a sensible approach. My understanding is that ZFS collects data in ARC, coalesces it into a larger block, and then writes it as a single stream to the array. This should help with write I/O performance, but for reads, random I/O performance will still be critical.

One specific concern: if I’m reading data while there’s a write stream in progress, does ZFS handle ARC usage in a way that prevents reads from getting delayed until writes complete, is ZFS doing arbitration between these?

Additionally, how many write blocks can ARC cache, and may SLOG help somehow? For instance, if a data block is written to the SLOG, does that free (by free I do not mean it will delete the data from RAM) ARC to buffer more writes while the disks are busy? Meanwhile, how does ZFS schedule reads and writes so that neither starves the other?

How should I properly configure the machine if I choose a specific array setup to compensate for its shortcomings with SLOG / ARC / etc..?

So many questions!

EDIT: I misread the network bandwidth requirement of “saturate four 10GbE ports aggregated (LAG)” as implying an expectation of 40 gbps to a single client. Seems it is understood that this is 10 gbps to four different clients. Sorry. I’ll leave the reply since it is quoted below.

I’m no expert on this, but my comments would be;

  • I assume when you say “drives” you mean single actuator SATA 6 gbps (typically max 250 MBps) spinning rust, not dual actuator or 12gbps (burst) SAS drives?
  • A 3-VDEV RAIDZ2 setup is only going to give you the read speed of 9 drives (2 gbps * 9 = 18 gbps), and the write speed of just one. A 3-VDEV RAIDZ3 setup is only going to give you the read speed of 6 drives (2 gbps * 6 = 12 gbps), and the write speed of just one.
  • A SLOG is a write cache. I don’t see anything in the setup beyond RAM that is a read cache.
  • You will likely need an HBA, not listed. But you might not if you end up with an Epyc build or other CPU with lots of PCIe lanes exposed through MCIO connectors or such.

Whatever CPU or motherboard you choose, I don’t think that will be the bottleneck. I think you will need to dig into the ZFS setup and evaluate your performance vs redundancy requirements. You need an array of something like 20 disks of spinning rust in a RAID0 setup to give you 40 gbps. You may need some LZARC as well as SLOG, although obviously much of the caching is done in RAM (so it depends on number of users and how long “sustained” means.

As you mention, random operations on smaller files also bring down throughput significantly. For that you would want to be operating against SSDs.

My guess is you’d probably want to look at two boxes, one focused on delivering the performance you’re spec’ing, and the other focused on backup, and setting up the main box with no or minimal redundancy. RAID0 or a set of mirrored VDEVs give you write performance equal to your read performance, but in a 15-bay chassis even that doesn’t get you to 40 gbps with SATA speed spinning rust. Or an all-SSD solution if “sustained” means more than one minute (256GB RAM) or two minutes (512GB RAM)

The other thing you could look at is a solution that involves NVMe SSDs, such a a 4x NVMe carrier card. A single 5000 MBps NVMe is delivering about 40 gbps of throughput.

I’m sure you’ll get some other responders here, but if this is part of a project with a budget, 45Drives does offer hourly consulting services that can draw on their experience with many clients to help configure a system that meets your requirements.

1 Like

It certainly can but only to a certain extent. ARC is mostly good for improving reads and the SLOG is only a temporary space that still needs to write to the spinning hard drive.

Hopefully you saw David point this out too that Z3 is applied at the VDEV level. I agree that Z3 is a good for a single pool and single VDEV comprising of 15 drives. I don’t see it making sense for a single pool with 3 VDEV’s of 5 drives where you giving up 3 drives with each VDEV. Like I mentioned, at that point you have more parity than usable space.

Some of this is beyond my knowledge, but I’ll point again that ARC isn’t really a write cache. RAM isn’t a safe place to hold your only copy of the data so ZFS, by default, waits until data is written to disk before moving on with the next operation unless the sender says otherwise. That can be your spinning hard drives or a SLOG. There are options to override this behavior if you want that will net more performance. Just make sure you’re ok with losing some data if the system crashes or loses power.

I agree with David here on looking at two different pools to handle different workloads if you really want to maximize performance of each type. I do this in my HL15 with SSD’s setup in mirrored VDEV’s for random workloads and higher IOPS and then a Z2 pool with a single VDEV for streaming read workloads. You can do this one system but could also be two separate builds.

2 Likes

Isn’t the 250 MBps speed only good until the drive cache is exhausted? I’ve always thought an average 7200 RPM hard drive speed was around 100-125 MBps. Always open to be learning something new or different. :slight_smile:

1 Like

I went high to keep the math simple, but I seem to have misunderstood the network bandwidth requirement regarding an individual connection. It all depends on specs of specific models. 100-125 MBps is my real world working number also, in my home environment where I’m not buying enterprise drives, but throughput has creeped up over time. I think 150 MBps is now taken as an average higher end/enterprise drive throughput, and some enterprise drives can do over 200 MBps. There’s such a range it’s hard to generalize. I could have been more pessimistic and used 125 MBps but then someone would likely call me out the other way and say that’s too low, buy better drives.

2 Likes

Valid reasoning in my book. :joy:

1 Like

How much data (usable space not including any parity or anything) are we talking about for each use case (base and growth projection)?

  • plex (I assume this is mostly reads, with writes being limited to the initial load and relatively small incremental additions.
  • backups (I assume these are mostly writes (with some sort of incremental and periodic full snapshot process) and only occasional reads to recover from oopses.
  • “like git repos” (I assume this isn’t just archiving, but is some sort of software development activity generating these random read/writes.

It doesn’t sound like you’re trying to do video editing over the LAN.

1 Like

That’s a good point. For some reason I read the intent of the initial post as trying to get 40 gbps over a single connection to a client (although I know what LAG actually means). If the intent was truly that there are four different clients that only require a 10 gbps connection each, that does simplify the ZFS calculus.

Once the data gets in ZFS, ZFS has a snapshoting facility, so that is one way historical versions can be set up, but typically that is time based (take a snapshot of the storage machine backup pool every 5 minutes, or hour or day or week. So it might not capture each and every incremental change. The nice thing about ZFS snapshots though is they typically take up very little incremental space, unless you’re making huge changes frequently.

For synchronizing from client machines to the storage server, one app that is mentioned frequently is Syncthing. Although I have never used it. There is also more traditional backup software, but it sounds like you are asking about syncing a limited number of client folders, not backing up whole partitions or machines.

1 Like

Perhaps it needs mentioned again that RAID is not a backup. RAID is a way of achieving High Availability. If you are truly paranoid about losing data, then you need a proper backup strategy, like the 3-2-1. You need data on a separate machine in case something hardware related corrupts the pool, and preferably off-site in case something bad physically happens (fire, flood, tornado, water pipe breaks) to the server. This doesn’t have to be the same level of “paranoia” for all the data. For most people, eg, losing their Plex library isn’t as long-term devastating as losing family photos, important personal documents, or business related data if they are self employed. You can segregate types of data into different tiers for both HA% as well as number and location of backups. Perhaps you have RAIDZ2 for the Plex library for HA, but are willing to assume the risk of losing it in a disaster and don’t have a separate backup, but for the other stuff, you do want that available after a disaster so really need a second copy on a second box and ideally offsite. In my case I compromise and am comfortable with the most important stuff on an external drive in a fireproof/waterproof safe in a location in my home away from the main server, that I update every month. Not ideal, but I don’t really want to go to a safe deposit box, have a buddy.to replicate to, or like the egress fees of cloud backup services. I’m a bit iffy about putting personal data “in the cloud” at a third party, even if its encrypted on the client before sending, although I should probably get over that.

1 Like

Many thanks for the advice and for taking the time to share it. Unfortunately, adding a second box isn’t possible right now.

This is not a sponsored project; I’m building it with my own funds for both personal and work use. I’m the sort of person who monitors hardware and storage solutions for a long time, then acquires them once I find a deal that’s overwhelmingly in my favor. Over the years, I’ve collected several high-end SSDs and Optane drives at very good prices.

Server Purposes:

  • Plex.
  • Storing personal photos, documents, etc.
  • Acting as a Git server.
  • Serving as a backup for my workstation (I use some 15 TB Kioxia drives in RAID 1 for work).
  • Backing up phone photos and data.
  • Providing shared storage for my workstation, laptops, TV, etc.
  • Storing test traces from RV-Cores simulations.
  • Storing layouts / synthesized netlists.
  • Housing additional data generated from my work.

I’m calling this build “paranoid” because a second backup server isn’t something I can manage right now. Even this one is more than I’d like, since it will generate noise in my room—and I really need a quiet environment to focus.

Therefore, reliability is the most important feature. I need to ensure that I don’t lose data. I’m willing to sacrifice some storage capacity by using three Z3 vdevs. I’d rather purchase larger drives to offset capacity than compromise on reliability.

I also plan to use LAG because I don’t need more than 10GbE per client. As for overall storage, I don’t need a huge amount. I’d be fine with around 72 TB for a few years. Currently, I generate roughly 1–3 TB of data every 2–3 months, so if I go with 12 TB drives in three Z3 vdevs, that should cover my needs for quite some time.

I don’t need the 4x 10 GbE LAG to be at full capacity all the time. The workflow involves large bursts of data—ranging from a few tens of gigabytes up to around 200 GB(almost never)—after which activity subsides for several minutes.

Most of the time it is many many small random files.

Main Concern: Accelerating the Pool

L2ARC (Read Cache): I’ve watched multiple videos suggesting it’s better to add more RAM than to rely on an L2ARC. I’d love to see some test data on this—if anyone can share, that would be very helpful. My understanding is that data stays in the ARC until there’s no more available ARC space, at which point it’s replaced by new read data. (ARC is fast)

Write Performance: I’m more concerned about write performance. Regarding a ZIL, my thought was that a fast ZIL device would improve write performance because once data is written to the ZIL, it’s considered safe, freeing ZFS to handle subsequent writes in RAM. I’m not entirely sure how to resolve any bottlenecks here; I do not 100% understand how writes works.

Drive Choice I plan on consulting Backblaze’s AFR stats and picking drives with power-loss protection (PLP), a low failure rate, a SATA interface, single actuator, and maybe non-Helium if possible.

Thanks again for your time. If anyone can share test data on different setups or point me to resources showing how to calculate expected write performance, I would really appreciate it.

1 Like

Thanks for the clarifications. It helps understanding the whole picture.

The minimum recommended number of drives for RAIDZ3 is eight. As Ryan pointed out, there would be no reason in having a VDEV with more parity than actual data. There are two possibilities if you don’t drop down to Z2;

  • Add a 16th drive, using a bracket such as this; 2.5" to 3.5" Hard Drive Tray Holder for PCI SSD HDD Metal Mounting Bracket Adapter and create two 8 drive Z3 VDEVs. This would give you 10x the read speed of a single drive, and I think 2x the write speed of a single drive directly to the array. You could survive a failure of 3 out of 8 drives in each VDEV.
  • Use mirrors instead of RAIDZ to increase write performance. This would be a RAID 10 type setup but with three disks per mirror. In this case you would have five VDEVs with three drives in each VDEV, each with the same copy of the data. This would give you the read and write performance of 5x a single drive directly to the array, and across the whole array up to ten disks could fail with no data loss. But, mirrors are a bit more dependent on which disks fail. If as those 10 drives fail, three of them are in the same VDEV then you lose the pool. Is this dicier than the Z3 option where a six disk failure could lose the pool? I doubt it.

With the appropriately sized SLOG I think you should be easily able to keep up with that workload when is saturates all four 10 gbps connections. 400 GB is probably way overkill since each connection is only going to be delivering data at 1.25 GB/s. Even with all connections saturated, that’s only 5 GB/s and i believe the SLOG is flushed every five seconds. So I imagine you wouldn’t need more than about 25 GB. More doesn’t hurt, but the remaining space might be better used for something else.

So what you’re saying is you are paranoid about data loss, but that data is only worth $1,000 to $2,000 to you and your employer. Hm. Most people who post here about doing employer work at home say that the data is somewhat transient. If there were a disaster at the house, the impact is only to time to recapture the inputs from backups at the employer, not to the single copy of the data. If you are holding the sole copy of data on this system for your employer, they should be helping pay for it.

1 Like
1 Like

Thank you very much for your reply!

My employer maintains his own copy of the data, and I also keep a local copy to manage certain agreed-upon workloads. On my employer side, I do know that the data is maintained by experienced professionals — of which I am certainly NOT one.

Perhaps I’m being too optimistic, but I’m hoping no lightning strike will hit my apartment. In case of a flood, I’m not on the ground floor, and the server is on a shelf raised about 15–20 cm from the floor (in case of apartment flood). I also rarely leave the house unattended for more than 12 hours; if I’m not around, either my wife is here or my brother comes by to check on my cat.

Regarding a power surge, I’m using a UPS, and I’m hoping it will suffice.

What would you suggest for the SLOG, then? I’ve found Optane drives to be excellent, and I do have other potential uses for them.

3 vdevs of Z2 / 5 vdevs mirror of 3 each looks like it is the way to go.

I really appreciate your feedback.

Unfortunately, most of the theory and testing tends to be for read performance, and for lower levels of redundancy. There also seems to be a fair amount of conflicting information floating around, caused I think, not just from misunderstanding, but also from improvements that have been made in ZFS over the years and assuming RAIDZ performs like traditional RAID.

i’ve been able to dig up two articles comparing test data on different setups, but one is from 2014 so I hesitate to link it here. The more recent one is here; https://arstechnica.com/gadgets/2020/05/zfs-versus-raid-eight-ironwolf-disks-two-filesystems-one-winner/ and was linked in this thread; https://www.reddit.com/r/zfs/comments/z5myld/zfs_theoretical_readwrite_speeds/. Unfortunately, that articles direct comparisons are for RAIDZn to the equivalent traditional RAID levels, so you have to look across multiple charts to compare the performance of RAIDZn to a striped mirror. One thing shown by this though is RAIDZn writes do scale in some ways with an increase in the number of drives in a VDEV and the number of VDEVs in the pool, to an extent, but not as fast as reads scale, or in the same way. I don’t think I would trust the write performance returned by the RAIDZ calculator I linked above, as for almost all configurations except mirrors it claims no write improvement, and I think that’s incorrect.

I haven’t benchmarked or tuned my two RAIDz2 systems, as I don’t stress them much, but anecdotally his graphs and conclusion seem reasonable to me and follow conventional wisdom. I think with your workload you’ll want the “5 vdevs mirror of 3 each”.

1 Like

Hey Cpsv,

I am kind of jumping in late here and it seems that everyone kind of has you sorted out. If you are still confused on the way slogs work here is a good article on sync vs async.

If you really want the super best performance you could turn off sync writes as explained in the article. That said the ability to bounce back after any type of misfortune with power loss or other unforeseen issues is quite nice.
I would also like to reiterate, which you have seemed to adopt, is that using the Raidz2 in your configuration would be best. ZFS write performances also slow down considerably the more full the pool gets so through using Houston we set 10% of data aside out the gate that cannot be used which should also be calculated into your total space.
Here is a video of ours explaining it

and a video on read and write caching as well

I hope this helped a little though it seems DigitalGarden and Rymandle05 had a lot of this covered!
If you have any other questions please let me know!

3 Likes