HL2.0 10Gbe Performance Testing & Optimization

Finally received my HL15 2.0 and looking for some tips & tricks to help optimize 10Gbe networking.

TLDR: I get about 9.45 Gbits/sec from the HL15 to the Mac but only about half the speed when going from Mac to HL15 (4.39 Gbits/sec) and am looking to optimize the setup

This is my setup flow:

Mac Studio M2 Ultra —> TRENDnet 6-Port 10G Switch —> USW Pro XG 24 PoE —> HL15 2.0

Here are the HL15.0 Specs:

Motherboard:	ASRock ROMED8-2T
CPU:  			AMD EPYC 7452 
Memory: 		256GB
Boot Drive:		Kingston NV2 1TB M.2
Add-on Card:	Supermicro AOC-SLG3-2M2-O (Qty 2)
NVMe Pool:		Samsung 990 Pro 4TB NVMe (Qty 4 - ZFS RaidZ1)
HBA Card:		LSI 9400-16i
HDD Pool:		Seagate Exos 24TB (Qty 5 - ZFS Raid Z2)
GPU:			Nvidia GeForce RTX 4070 Super

Mac Studio Specs:

CPU:			M2 Ultra
Memory:			64GB
HDD:			2TB SSD

Test Enviorment:

Hypervisor:		ProxMox 8.4.5

Windows VM 100:

OS				Windows 11
CPU:			12 Cores
Ram:			32GB
Location:		ZFS RaidZ1 NVMe Pool
Drivers:		VirtIO Drivers Installed

TrueNas VM 101:

OS				TrueNas Community 25.04.2
Host Location:	ZFS RaidZ1 NVMe Pool (Host OS)
CPU:			12 Cores
Ram:			32GB
Passed Through:	LSI 9400-16i 
Data vDev:		Seagate Exos 24TB (Qty 5 - ZFS Raid Z2) 

AmorphousDiskMark (Mac to Windows VM)

Test Specs:		4GiB File - 5 Times Average

Windows VM Hosted on NVMe RaidZ1 Pool
QD8 Sequential Read:		8,944 Mbps 
QD8	 Sequential Write:		720 Mbps
QD1 Sequential Read:		992 Mbps
QD1 Sequential Write:		719 Mbps

TrueNAS VM Data vDev on HDD ZFS Raid Z2 Pool
QD8 Sequential Read:		9,400 Mbps
QD8	 Sequential Write:		2,072 Mbps
QD1 Sequential Read:		3,656 Mbps
QD1 Sequential Write:		2,200 Mbps

iPerf3 30-second TCP test with 4 parallel streams

Windows VM —> Mac Ultra = 9.45 Gbits/sec
Mac Ultra —> Windows VM = 4.39 Gbits/sec
 TrueNas VM —> Mac Ultra = 9.41 Gbits/sec
Mac Ultra —> TrueNas VM = 5.47 Gbits/sec

Notes:

  • I’ve tried with turning testing with Jumbo Frames on & off and noticed no change in results
  • Tomorrow Im going to repeat the test with the Mac Ultra connected directly to the switch just to rule out any issues with the cables or patch panel, however since I do see 9.45 Gbits/sec in some test that leads me to believe its not a cable issue but will test directly to verify.
  • Ive also increased the CPU count to 24 and memory to 64GB on the Windows VM with no change in performance, I still get about 9.4 from VM to Mac and around 4.4 from Mac to VM.
  • On the Mac, I closed down all other apps to ensure maximum performance, although the M2 Ultra should be more than enough power to handle the task.
  • I tried iPerf3 with 1-10 parallel streams and saw no noticable difference, although with fewer than 4 streams I wasnt able to fully saturate the line.

Of course 4.4 Gbps from my Mac to the HL15 is plenty for 99% of scenarios so its not something I’m going to die over but I spent more than I want to admit to the wife and am looking to get and optimal setup & configuration out of this just can’t figure out why its so much slower when sending from the Mac into the VMs, about half the speed and pulling from the VM into the Mac.

Any advise, tips or tricks to try? Or is this just some kind of limitation within MacOS that I need to accept?

What version of iperf3 is the mac running?
What version of iperf3 is the Windows VM running?
What version of iperf3 is the TrueNAS VM running?

How is networking set up for the VMs? Is it bridged or are you assigning NICs to specific VMs? The test I would start with if possible is with iperf3 running in Proxmox and not one of the VMs. Perhaps even reconfiguring the networking for testing even if not the final desired config.

Are you setting up the HL15 as the server and the mac as the client and then using -R or --bidir on he Mac to determine the HL15-to-Mac transfer rate? Or are you switching which machine you start as the server and which as the client?

Thanks for taking the time to respond.

I can’t believe I didnt test iPerf3 on the actual ProxMox host. Interesting results, I get 9.4Gbps for BOTH directions when testing from the Mac Ultra and ProxMox. I tested both directions with each host taking turns being the client & server, solid 9.4Gbps on every test in every direction.

So then the problem is a slowdown when ProxMox forwards traffic from the host to the actual VM costing almost 50% of the bandwidth, right?

My Windows & TrueNas VM are simply using vmbr0 which is slaved to eno1np0 which was the default setup when I installed ProxMox 8.4.5. Ive made no tweaks to the networking after install.

Any ideas on where to go from here?

Sorry, my specific knowledge of 10G networking and Proxmox is limited. The best my Google-Fu could come up with is;

https://www.reddit.com/r/Proxmox/comments/rat6po/slow_iperf_results_to_vm_host_to_host_i_get_full/

But it’s an old thread and doesn’t really come to a solution. Are you able to tell the VMs in Proxmox to emulate certain NICs? Perhaps changing the NIC being emulated?

Although not a solution, there is a 2-part deep dive on iperf3 on YouTube that might give you some help in debugging, seeing extended test information on where packets are being dropped or how the connection is being throttled, etc.;

HOMELAB: A definitive guide to using iPerf3 to measure network performance! (Part 1)
HOMELAB: A definitive guide to using iPerf3 to measure network performance! (Part 2)

In proxmox, have you tried enabling Multiqueue? Your windows VM may not be using the VirtIO driver out of the box but TrueNAS probably does. It’s supposed to allow more CPU threads to be dedicated to network packet processing.

QEMU/KVM Virtual Machines

Multiqueue

If you are using the VirtIO driver, you can optionally activate the Multiqueue option. This option allows the guest OS to process networking packets using multiple virtual CPUs, providing an increase in the total number of packets transferred.

When using the VirtIO driver with Proxmox VE, each NIC network queue is passed to the host kernel, where the queue will be processed by a kernel thread spawned by the vhost driver. With this option activated, it is possible to pass multiple network queues to the host kernel for each NIC.

When using Multiqueue, it is recommended to set it to a value equal to the number of vCPUs of your guest. Remember that the number of vCPUs is the number of sockets times the number of cores configured for the VM. You also need to set the number of multi-purpose channels on each VirtIO NIC in the VM with this ethtool command:

ethtool -L ens1 combined X

where X is the number of the number of vCPUs of the VM.

To configure a Windows guest for Multiqueue install the Redhat VirtIO Ethernet Adapter drivers, then adapt the NIC’s configuration as follows. Open the device manager, right click the NIC under “Network adapters”, and select “Properties”. Then open the “Advanced” tab and select “Receive Side Scaling” from the list on the left. Make sure it is set to “Enabled”. Next, navigate to “Maximum number of RSS Queues” in the list and set it to the number of vCPUs of your VM. Once you verified that the settings are correct, click “OK” to confirm them.

You should note that setting the Multiqueue parameter to a value greater than one will increase the CPU load on the host and guest systems as the traffic increases. We recommend to set this option only when the VM has to process a great number of incoming connections, such as when the VM is running as a router, reverse proxy or a busy HTTP server doing long polling.