Tradeoff for installing DIMM 7 & 8 in full build?

Nickel · October 5, 2024, 4:45am

Was reading the motherboard manual and saw:

“The blue slots must be populated first.
Only populate DIMMA2 and DIMMD2 if the extra memory support is needed.”

What’s the tradeoff for populating DIMMA2 and DIMMD2 vs not? Was originally thinking of populating them for additional ZFS ARC (that I probably don’t actually need), but if there is a perf hit I might reconsider.

DigitalGarden · October 5, 2024, 7:30am

The performance hit depends a bit on your workload, what CPU you chose, and what speed DIMMs 45HL provides with the full build–if they provide 2933 MT/s DIMMs or something slower. I don’t have a full build.

Based on the table and Note 1 on page 34, the performance impact is probably zero unless you splurged for the 6230R upgrade. Without going into a whole explanation of slots and channels, when you populate A2/D2 you are causing the memory controller to switch from 1 DPC (DIMM Per Channel) to 2 DPC mode. Having the second DIMM on a channel allows for more RAM capacity, but slows down the maximum transfer speed to each of those DIMMs. The Xeon/LGA3647 platform used by the X11SPH allows for up to six memory channels, so to support 8 DIMM slots, two of those need to be paired, thus A2 (paired with A1) and D2 (paired with D1).

If you were/are running a 62xx CPU or higher, then the maximum memory transfer speed supported by the CPU would drop from 2933 to 2666. But this is also dependent on the speed rating of the RAM sticks, if that is only 2333, for example, then all the RAM will only operate at 2333. The speed limit is across all the RAM connected to a CPU, not a per slot or per channel basis. Since the 32xx and 42xx can’t operate the memory at 2933 anyway based on Note 1, and that is probably more likely the CPU you have, then adding RAM to A2/D2 shouldn’t slow anything down assuming you use 2666 MT/s RAM.

Also, even if you did have a 62xx and memory speed dropped from 2933 to 2666, it probably wouldn’t be that noticeable for ARC, given it’s role as a cache. I think memory speed is most noticeable for heavy compute tasks, like CPU-heavy (non-GPU) games, AI or creator workloads.

Nickel · October 6, 2024, 4:45am

Thanks for the excellent explanation! It certainly makes sense that multiplexing channels to incorporate DIMM 7+8 would reduce performance.

My WFH CI pipeline has a working set of ~3.5 TB in mostly <10MB files. It also does occasional scans to ingest a larger historical set (>20TB) . This is what’s got me thinking about memory tradeoffs. I know that ZFS has protection against polluting the ARC with streaming scans, but I’m not sure how it does with more stochastic scans. This seems like it /might/ be a good use case for a large L2ARC (4TB against 384G RAM), but L2ARC population speed seems like it could be the bottleneck.

Going from 384G to 512G is probably the better idea. Maybe I can talk the boss man into letting me use 128G DIMMs

Anyhow, thanks much for the explanation!

DigitalGarden · October 6, 2024, 8:18am

I’m not sure if this article helps any;

Specifically, it mentions the arcstats file that might give you some insight into cache hits and misses, although I’m not sure if that is lifetime, since last power on, etc, and if you can reset it to monitor activity during a specific workload. I guess you could make a before copy, run your CI pipeline, and do a before/after comparison. If you’re running TrueNAS Scale, you can see this info in the Netdata reporting. I’m not sure if Netdata is also bundled with Houston or Proxmox, but I think it is installable separately if not.

I don’t know much about CI (Continuous Integration?) pipelines, but if it is like compiling large software projects I think you are right to want the RAM speed to be as high as possible. Do you know how multithreaded the process(es) is/are? Can it scale to more than the 6C/12T or whatever you have currently?

The first thing I would do is confirm in the BIOS what speed your current RAM is running at. Then, things I might look at besides more RAM and L2ARC would be upgrading the CPU to one of the 62xx SKUs giving you both more threads and access to the faster RAM, and also your ZFS pool layout. You may want to do the pipeline processing against a RAID 0 area when and then move the results to a redundant pool when complete. If the process is highly stochastic and getting a lot of cache misses, making it bigger may not help much.

Passmark.com isn’t the definitive place for pricing, but looking over all the 62xx chips available for the LGA3647 socket;

https://www.cpubenchmark.net/socketType.html#i23

it looks like the 6248 or 6230 might be a good upgrade that would give you access to faster memory and more threads, depending on your tolerance for some increased heat and fan noise.

CPU	CPU Mark	Price (USD)	TDP
Intel Xeon Gold 6258R @ 2.70GHz	40252	$2,578.00	205W
Intel Xeon Gold 6238R @ 2.20GHz	37511	$1,346.00	165W
Intel Xeon Gold 6248R @ 3.00GHz	36186	$1,633.00	205W
Intel Xeon Gold 6268CL @ 2.80GHz	34558	NA	205W
Intel Xeon Gold 6242R @ 3.10GHz	34282	$1,964.35	205W
Intel Xeon Gold 6230R @ 2.10GHz	33724	$1,047.00	150W
Intel Xeon Gold 6240R @ 2.40GHz	33353	$1,550.00	165W
Intel Xeon Gold 6278C @ 2.60GHz	32835	NA	185W
Intel Xeon Gold 6254 @ 3.10GHz	31513	$1,498.95	200W
Intel Xeon Gold 6246R @ 3.40GHz	30468	$1,564.00	205W
Intel Xeon Gold 6248 @ 2.50GHz	29831	$489.00	150W
Intel Xeon Gold 6210U @ 2.50GHz	28915	$1,167.44	150W
Intel Xeon Gold 6238 @ 2.10GHz	28668	$739.89	140W
Intel Xeon Gold 6253CL @ 3.10GHz	28549	NA	205W
Intel Xeon Gold 6240 @ 2.60GHz	28369	$655.15	150W
Intel Xeon Gold 6212U @ 2.40GHz	27470	$3,902.62*	165W
Intel Xeon Gold 6252 @ 2.10GHz	27148	$1,006.40	150W
Intel Xeon Gold 6208U @ 2.90GHz	27085	$950.00	150W
Intel Xeon Gold 6230 @ 2.10GHz	27069	$834.00	125W
Intel Xeon Gold 6226R @ 2.90GHz	26297	$943.00	150W
Intel Xeon Gold 6242 @ 2.80GHz	26288	$783.02	150W
Intel Xeon Gold 6246 @ 3.30GHz	24829	$3,113.22*	165W
Intel Xeon Gold 6222V @ 1.80GHz	24764	$1,710.89*	115W
Intel Xeon Gold 6250 @ 3.90GHz	20915	$2,530.89	185W
Intel Xeon Gold 6226 @ 2.70GHz	20429	$885.86	125W
Intel Xeon Gold 6244 @ 3.60GHz	18980	$2,970.00*	150W
Intel Xeon Gold 6234 @ 3.30GHz	17752	$1,082.84	130W

Nickel · October 7, 2024, 12:07am

Thanks for the information!

I’d read that article previously, and has def a good walkthrough of ARC profiling. Looks like Proxmox does have arcstats as well. I’ve also been reading the OpenZFS source, which is both amazingly well documented, yet has 10k+ line source files

if you’re running TrueNAS Scale, you can see this info in the Netdata reporting. I’m not sure if Netdata is also bundled with Houston or Proxmox

Yeah … I’m such a homelab newb that this is the first I’ve heard of Netdata :). I still need to learn more about monitoring.

Do you know how multithreaded the process(es) is/are? Can it scale to more than the 6C/12T or whatever you have currently?

So I’ve got the HL-15 with a Xeon Silver 4216 on order & it end up with 384G Ram.

For reference the storage pool looks like the following (comments welcome!)
(this is assuming I can actually run a 4, 2 and 1 NVME ‘passthrough’ adapters on the 16x, 8x and 4x slots respectively)

Main Storage: 7x raid 1 VDEVs (Sata III spinners), 1 hot spare
‘Special’ VDEV: 3x Mirrored 4TB NVME’s
SLOG: 2x Mirrored 1TB NVME’s
L2ARC: 2x ‘Sorta’ Raid 0 4TB NVME’s

You may want to do the pipeline processing against a RAID 0 area when and then move the results to a redundant pool when complete. If the process is highly stochastic and getting a lot of cache misses, making it bigger may not help much.

“redundant pool” is the key here, it would have to be basically the same size as the main pool, which I don’t have the budget for

Putting some thought into it, I think it may work with some tuning. As I mentioned there’s 2 things going on:

Back to back ‘compiles’, basically forever.
Periodic stochastic scans over a proprietary and essentially unbounded time series DB. This needs to run about once an hour for about 40 mins. 90% random reads, 10% sequential writes. Dataset is essentially a log, so it’s mostly static except the tip.

So if #1 were the only thing running on the machine, I think it’d be ok, assuming the working set fits into L2ARC (which is sized at ~2x the expected working set).

As far as I can tell, #2 is always going to be cache missing, and dispatching random reads no matter what. An all flash array would probably be ideal here, but it’s not in the budget The idea behind the Special VDEV was to try to soak up some of the IOPSs on the large number of small metadata source files.

The problem with #2 is that I am worried it may basically flood / thrash the cache with transient blocks.

However, I am thinking perhaps setting l2arc_mfuonly=1 might protect the L2Cache, due to the relative infrequency of #2.

… On a related note, I think I’ve been set up. My HL-15 req was approved last week. I just checked, and my next sprint seems to have a lot of tasks related to optimizing #2

Thanks again for the advice

DigitalGarden · October 7, 2024, 1:18am

Sounds like fun. I’d get it running with what you’ve ordered, install Netdata on Proxmox and quantify if and where the bottleneck(s) are. With the Xeon 4216 I don’t think populating slots 7/8 will slow your RAM down, but you might find it beneficial to resell the 4216 and get a 6230 or something if your RAM supports 2933 MT/s. 384G of 10% faster RAM might be better than 512G at the slower speed.

The RAID 10 / mirrored stripes should perform well for your pipeline, but as people say, remember RAID is not a backup. Lose two disks in the same mirror and the pool is gone. I assume the company has other 3-2-1 backup strategies, but always feel compelled for this PSA when people say they don’t have budget for X.

Nickel · October 7, 2024, 2:18am

you might find it beneficial to resell the 4216 and get a 6230

Def on my radar now, thanks for pointing that out.

I assume the company has other 3-2-1 backup strategies, but always feel compelled for this PSA when people say they don’t have budget for X.

This machine is sort of a mega thick client All of the data on it is 100% expendable. Worst case, I lose uncommitted source a few hours old, and I have to drive 3 hours to the office to make a new baseline sync.

Nickel · October 23, 2024, 5:45pm

Just an update on the perf tuning progress.

Finally got the unit in and installed the additional parts.

Originally was seeing about 2G/s reads and 1G/s writes on my CI workload. Was seeing about 500K/s in SLOG writes, and the write bursts in general seemed to be ‘colliding’ with the constant read seeking on the spinners.

Looking in to it, I found a few things to tune:

Turns out that Proxmox by default clamps the ARC at 16G which was generating a lot of read cache thrashing. Cranked the max up to 300G
Also discovered that containers perform much better with the ‘special’ VDEV. I guess this makes sense since proxmox implements vms as zvols, which obscures some of the higher level metadata from the special VDEV. I suspect it was also ‘double caching’ the data as well. I managed to convert a few of my larger VMs into containers.
The synchronous write load was concerning. After some investigation discovered that about 80% of it was done during a build process generating target files, and the remaining 20% was in DB modifications. I split the two processes into separate containers and datasets and set sync writes disabled on the build container, and left them on on the DB container.
I saw that the L2ARC was slightly lagging the read stream, so I doubled its population rate limit.

After the above I started seeing 4G/s sustained reads during the builds with 2.5G/s sustained writes. With an 82% hit rate on L2 Arc (this is on a ~20TB working set)

A few things I learned along the way:

There is a ton of stale and incorrect ZFS information out there. Some of the things people talk about in forums, etc don’t square with the code reality. For example:

Large L2ARC caches are now relatively cheap memory wise.
PLP drives are not required for SLOG devices to prevent data corruption. They can give a performance benefit however. There are also some interesting code only flags for the SLOG.
…

At the moment I’m still bottlenecking on SLog. I’ve got a small mongo container cluster with change monitoring that causes a synchronous write storm - and maxes out the iops on the Slog nvme. It also pushes the ARC’s 6 second “write back” cycle long enough to interfere with the regular read seek load.

I’m not exactly sure what to do here - maybe there is some tuning I can do. I was also thinking something like 45drives “auto tier” might help to concentrate writes into an all nvme “tier”. I’m also curious about nvdimms for slogs and whether the mb supports it.

In any case, it’s a fun learning experience. Of course the real fix is to fix the mongo DB queries, but unfortunately BI is constantly generating naieve new queries in an ongoing basis, so it would be nice to figure out how to get the FS to take that.

DigitalGarden · October 23, 2024, 7:55pm

Sounds like you can probably get it worked out. I think 45Drives offers an hourly rate technical consulting service. If you are still having bottlenecks they may be able to provide help or ideas