Storage VM: Part 6 – Final Optimization and Backup Pool

Storage VM: Part 6 – Final Optimization and Backup Pool

As can be seen in the previous post, I did end up buying a LSI SAS 9305-16e [1] Host Bus Adapter.  This wasn’t out of a direct desire to actually test the one PCI card solution so much as a necessity of the PCI slot crunch.  With the developments around the Virtual Reality VM, I had to use two PCI x16 slots for smaller hardware.  One was used for the USB controller card, and one was used for the Intel WiGig wireless card. 

With the two video cards for the Virtual Reality VM and the Streaming VM, I had to find a way to reclaim a PCI slot.  The best options, in my estimation, were from collapsing down the storage VM’s HBA cards to one, or abandoning the wireless WiGig adapter.  Since I think there is a strong chance I will need to abandon the WiGig card anyways for a third video card slot at some point, I chose to start with collapsing the storage VM’s HBA down to just one first.

Alright, since I decided to start here, I have some more optimization to work through.  I wanted to test the best configurations for the single HBA.  If you need a refresher on these tests, please check out the previous optimization section for a full description.  Briefly, these tests involve copying a specific directory and all of its files to the remote pool, copy them back to local storage, and deleting the remote directory.  This directory contains approximately 57K files totaling 130GB.  I run these two copies, and the delete operations five times and average the results.

Alright the first set of tests I ran was to determine whether with a single HBA I should keep the two-drive mirrors on the same backplane or different backplanes.  Remember way back when I did the tear down of the Chenbro 34438 chassis I discovered there were two independent backplanes.  Given that the previous tests involved two independent HBAs each controlling an entire backplane on their own, I did not think it would be instantly obvious which method would be preferred.

Alright, there are a couple points I want to cover at first.  The first is that when I was comparing these numbers to the old numbers from part 3 of this series, these are about 25% higher.  I am not certain what the cause of that was, but I believe that this is related to using the Intel x550 10G ports instead of the Solarflare SFF 5122F fiber card.  As previously mentioned, Fiber latency is approximately 2.5 microseconds to get data through.  What I didn’t say was my previous experience has shown me that RJ-45 copper ethernet has a latency of approximately 20-30 microseconds.  This is an order of magnitude difference.  This isn’t as big a deal when transferring large files where the bandwidth number is more important, but when transferring a large number of small files, that latency difference can add up quickly.  

The other possibility I considered is that adding three more two-drive mirrors somehow caused a performance hit of 25%.  I don’t really put a lot of stock into that idea.  Adding mirrors should improve performance, not hurt it.  At least, if that ZFS developer is to be believed.

Alright, the second thing to note, is that one HBA is definitely better than two HBA in this setup.  Under all cases one HBA is faster.  Comparing the different backplane mirrors to the same backplane mirrors, it is also clear that the same backplane is preferred over different backplane mirrors.  

There is a secondary benefit for dividing the mirrors across the backplanes.  It’s the same as the benefit of having two HBAs, redundancy.  If one backplane goes out, the pool will still be up, and I will still be able to use the pool.  In addition if one backplane goes out then the current operations could be completed on the other plane until I can get a fix in.  This isn’t quite as good as two HBAs, but it is definitely better than mirrors on the same backplane from the reliability point of view.

Alright now, I’m going to show some of the results I have from when the Optane and memory caching was turned on.

Holy opposite day Batman!  With the Optane and memory caching on, the better performance flips.  Two HBA cards is better in all cases.  The next best is the one HBA with mirrors on the same backplane, followed by different backplanes.  Though I would point out the differences between same vs different backplane mirrors is very small in most cases, though clearly still worse.

So, what is going on here?  The performance flipped when I added caching.  I am not certain I have the answer here.  But my best guess is that with caching, the storage system overall is able to utilize more hardware better.  Most of these copies tend to be single threaded.  I once coded up a multithreaded version of the test:

workflow parallel-experiment-run {
	Param ($local_source, $remote_destination, $threads)	
	$local_destination = "$local_source - Copy"
	$thread_lim = $threads
	
	echo "$local_source, $remote_destination, $local_destination, threads $thread_lim"
	$copy_to_1 = Measure-Command {
		workflow copyfiles {
			param($sourceRootDir, $targetRootDir)
			$sourcePaths = [System.IO.Directory]::GetFiles($sourceRootDir, "*.*", "AllDirectories")
			foreach -parallel -throttlelimit $threads ($sourcePath in $sourcePaths) {
				$targetPath = $sourcePath.Replace($sourceRootDir, $targetRootDir)
				$targetDir = $targetPath.Substring(0, $targetPath.Length - [System.IO.Path]::GetFileName($targetPath).Length - 1)
				if(-not (Test-Path $targetDir))
				{
					$x = [System.IO.Directory]::CreateDirectory($targetDir)
					# $z = [Console]::WriteLine("new directory: $targetDir")
				}
				# $z = [Console]::WriteLine("copy file: $sourcePath => $targetPath")
				$x = [System.IO.File]::Copy($sourcePath, $targetPath, "true")
			}
		}
		copyfiles -sourceRootDir $local_source -targetRootDir $remote_destination
	}
	$copy_to_1_ms = $copy_to_1.TotalMilliseconds
	echo "Copy-To-1   : $copy_to_1_ms"
	Start-Sleep -Seconds 20
	$copy_from_1 = Measure-Command {
		workflow copyfiles {
			param($sourceRootDir, $targetRootDir)
			$sourcePaths = [System.IO.Directory]::GetFiles($sourceRootDir, "*.*", "AllDirectories")
			foreach -parallel -throttlelimit $threads ($sourcePath in $sourcePaths) {
				$targetPath = $sourcePath.Replace($sourceRootDir, $targetRootDir)
				$targetDir = $targetPath.Substring(0, $targetPath.Length - [System.IO.Path]::GetFileName($targetPath).Length - 1)
				if(-not (Test-Path $targetDir))
				{
					$x = [System.IO.Directory]::CreateDirectory($targetDir)
					# $z = [Console]::WriteLine("new directory: $targetDir")
				}
				# $z = [Console]::WriteLine("copy file: $sourcePath => $targetPath")
				$x = [System.IO.File]::Copy($sourcePath, $targetPath, "true")
			}
		}
		copyfiles -sourceRootDir $remote_destination -targetRootDir $local_destination
	}
	$copy_from_1_ms = $copy_from_1.TotalMilliseconds
	echo "Copy-From-1 : $copy_from_1_ms"
	Start-Sleep -Seconds 20
	$delete_1 = Measure-Command {
		workflow deletefiles {
			Param ($files)
			foreach -parallel -throttlelimit $threads ($file in $files) {
				Remove-Item $file -Force -recurse
			}
		}
		
		$files = Get-ChildItem -Path $remote_destination -Recurse -File
		deletefiles -files $files.FullName
		
    # take care of the case where parallel folder deletes fail due to not being empty
		Get-ChildItem "Y:\" -Recurse | Remove-Item -force -recurse
	}
	$delete_1_ms = $delete_1.TotalMilliseconds
	echo "Delete-1    : $delete_1_ms"
}
#example run
parallel-experiment-run -local_source 'D:\zfs_test' -remote_destination 'Y:zfs_test' -threads 3

When running on the old Solarflare SFF 5122F, it was literally slower by a factor of 2x.  This didn’t make a lot of sense to me, but I just assumed that the other workflow copying the files, or the individual commands were already somewhat optimizing for this.

This is somewhat contradictory, but in the case where I was adding more threads for copying, I was blanketing the system with no ability to do any queuing or optimization on its end because I removed the cache.  With the caching, both memory and Optane, it was able to respond and utilize the resources available to it better.  It could schedule the writes optimally, read the files into memory, and predict what may be next.

That is my best guess as to why it is better with two HBA than one in the case I allowed caching.  Underneath, the one HBA version has a faster response for a straight up, no caching, write through.  But with caching, more resources are better.  This is somewhat contradicted by the two backplanes one HBA test with caching, but those numbers are not actually that different than simply one HBA.

There was one remaining set of test cases I wanted to run.  When I first ran tests in part 3, I had only 8 processors allocated.  While running them, I could review the CPU usage and noted, on multiple occasions, that the system spiked to 100%.  Not wanting to penalize any particular parameter on this, I kept it throughout the testing of that article.  However, for this one I wanted to actually test the effect of adding more CPUs to the system.  So I ran the test for mirrors on the same card, with Optane and memory caching enabled, but varying the CPU count.  

As can be seen, there is a much more clear result of this one than some of the rest of my tests.  More CPUs is very clearly better when writing.  This is likely because of LZ compression being a CPU intensive operation, but it is good to test this.  I also noted that on both the 16 and 24 CPU tests, it never maxed out at 100% during the test.  More CPUs is clearly just better. 

Okay, from here I made my final decisions.  I decided to go with 24 CPUs, and a base record size of 32KB with mirrors on different backplanes.  

 Throughout the tests, it has been clear that 32KB was kind of a sweet spot for testing.  In addition it also is the last of the major jumps in performance on the newer tests as well, it starts to improve by just 1% at that point, whereas 16KB to 32KB was a 3-4% improvement.  I decided to sacrifice a little performance for redundancy on the different backplanes.  This is mostly because I don’t want to lose data more than I want 2% performance improvements.  But at least this way I know I am quantifying the exact cost of doing this.  It would also enable me to shift back to the two HBAs configuration if I manage to free up a PCI slot at some point, without much fanfare underneath.

Datasets

The base structure has been decided.  I think I have a pretty good feel for what all of the parameters of this equation look like.  So It was time to make the final decisions for how I was going to access this pool.  

In ZFS, or most modern storage solutions, the underlying pool of storage is abstracted from the end users in some way.  That way I don’t need to have multiple pools doing kinda similar things, I can have a big one presented as if it were two or more pools.  I just need to pay attention to the overall storage space, rather than specific space.  Let’s take a quick look at my first stab at this.

Okay, first thing to note, this is not the same FreeNAS install I did all of the analysis on.  A new version of FreeNAS came out called TrueNAS Core.  They are the same project, iXsystems just didn’t want to maintain two forks and is opting for a community version and a paid version [2]. 

I am not thrilled about this, as very often this is the first step towards charging for all versions of the software.  However, I still have the last FreeNAS release ISO, so I can fall back if needed.  When FreeNAS was taken over by iXsystems, some in the community also maintained the old version as CoreNAS for a bit [3], until it was absorbed in OpenMediaVault [4].  That being said, it is not like there are not other options should this become an issue.  Like I said in the beginning, this is what is great about virtualization.  If I want to migrate my ZFS, it is literally as simple as deploying a new VM, installing the new software and pointing it to the ZFS install I have.

Alright, to the specifics of datasets here, I started with four datasets.  Each dataset has a corresponding share.  I intend to use NFS for applications or operating systems and CIFS for general sharing since Windows is more common in my household.  

First was the Application data set.  The goal of this dataset was to be an NFS mount for any applications that need to run on remote spaces.  The bare idea here was maybe Zoneminder would want a more native access to storage space.

Second was the Data dataset.  It’s not a very creative name, but this is what I term critical data.  It will be part of what is backed up in the backup pool. 

Third is the Media dataset, as its name implies, this is all the media I wish to serve up to my computers or the house in general.  It is expected to house the most space used by far, but will have a different file signature than the rest (more large files).

Last is the VirtualMachines dataset.  I wanted to serve the space back to VMWare, so I can have less restrictive hard drive space on VMs that don’t need speed.  Mostly I am thinking of the vSphere vCenter server.  That takes up 500+GB of hard drive space on my 2TB NVME drives.  That’s too much for something I don’t want to use that often.  This is a nice solution, it will be a bit slower, but it should still work.

Now some of you may be asking, why not just one big dataset?  The truth is that this is about secure access.  Each of these has a different user case in mind.  I may have granted my specific user access to all of them, but I want to separate out the server users.  For example, if I want to serve up the house media on a Plex server, it needs access to my Media dataset, but it specifically doesn’t need access to the VirtualMachines dataset.  If, somehow, the Plex server were compromised, all that user has access to is the Media dataset.  The damage is at least contained for a bit.  Hopefully I will notice something in the network system, which does do inspection for me to identify threats.

This is also the same reason I prefer to deploy multiple VMs as opposed to just running everything in the FreeNAS add-ons (Jails, Services, etc).  If this service is compromised, a user could get access to the entire data pool since FreeNAS has access to everything.  Inside a VM, they still have one more layer to get through (the hypervisor in fact) before a malicious intruder gets access to everything.  It is a good general practice to encapsulate the services as narrowly as necessary for these exact cases.

There is a second reason I added multiple datasets.  I can mess with the attributes.  The Media dataset uses 512KB record sizes, because these are very large files on average and will benefit from having these larger operations.  This is possible because of this logical separation. 

Lastly, I can also enable or disable LZ compression.  I ended up doing this on the Application and VirtualMachine datasets, where response time is more important than storage space.  That is somewhat mitigated on the lack of compression side by using a smaller record size of 32KB.

Backup Pool

With the main array configured, I decided it was time to implement my backup pool.  I bought a pair of 16tb Seagate IronWolf NAS drives [5] for this.  These drives are designed for capacity, not speed.  They will be slow.  But that is the point of the backup, I don’t need speed in any real sense, that is what the primary pool is for.  These being Hard Disk Drives, they also are more likely to fail from physical factors, such as the head reading failure, spinning failure, fluid leakage, etc.  

All of that makes them sound unreliable.  They really aren’t.  This is solid, if old technology.  I am just covering my bases here.  With the backup, I will end up with 4 copies of my critical data.  For both HDDs and SSDs to fail at the same time would be a very odd scenario as they have different failure characteristics and causes. This is a different kind of redundancy than just extra drives, different technologies.

Now to do this, I did not want to use the same HBA as with the primary pool.  I want as much resource independence as I can get.  That way if the HBA goes down, I still have access to my critical data while I order a new HBA card to attempt to fix this situation.  So I plugged these two hard drives into the free hard drive caddies I set up way back in the build section.  There were eight unused SATA ports.  This would use just two of them.

Now, mapping these hard drives to the storage VM isn’t as simple as toggling passthrough on an HBA card, where all of the disks on the other end of the card migrate with just that.  I learned that when toggling the AHCI controller caused VMWare to not be able to make any state changes.  I will need to use something called Raw Device Mapping (RDM).  This is basically a way to fake a VMWare VMDK disk by presenting the entire 16tb HDD as one big virtual disk.  Here is a support article on how to do this [6].

Let’s go through this.  I started by validating that the Disks are detected correctly in VMWare.  Navigating to the storage section, I can see both of them.

With that validated, I used SSH to get to a command prompt on ESXI.  Then I used the command ls -l /vmfs/devices/disks.

I have highlighted the disk I am going to create a Raw Device Mapping for first.  This matches the first hard drive I found in the VMWare storage section.  Next, I executed the command to create an RDM.

vmkfstools -z /vmfs/devices/disks/t10.ATA_____ST16000NE0002D2RW103_________________________________Z
L22PGWQ /vmfs/volumes/Primary/Alexandria/archives_rdm1.vmdk

That will create the device mapping in the same spot that the current virtual hard drive is for the storage VM (the VM is called Alexandria).

I then execute the same command for the second hard drive as well.

vmkfstools -z /vmfs/devices/disks/t10.ATA_____ST16000NE0002D2RW103_________________________________Z
L22QXXP /vmfs/volumes/Primary/Alexandria/archives_rdm2.vmdk

This will create two RDM files for me named archives_rdm1.vmdk and archives_rdm2.vmdk.  I am intending to name the ZFS pool Archives.

Then, I needed to add the RDM drives to the VM.  I started by navigating to the Virtual Machine screen for Alexandria.

I clicked edit.

Then under Add hard disk, I selected Existing hard disk.

That will open the datastore browser, where I navigated to the Primary -> Alexandria folder and I see the two RDMs I created above, archives_rdm1.vmdk and archives_rdm2.vmdk.  I selected archives_rdm1.vmdk.

I repeated the procedure for archives_rdm2.vmdk.

At the end you can see I ended up with three virtual hard disks. The 100GB install disk, and two 14.55 TB extras.

Now before I proceed to create the Archives pool, I want to go on a brief digression.  In VMWare, when creating a virtual hard disk, it doesn’t just create one drive, it also requires a virtual hard drive controller.  This is virtual hardware that allows optimizations on how to send and receive data from the disk drive.  

By default, only one virtual disk controller is created.  If I were to leave things like above, then there would be only one controller for all three drives.  This isn’t usually an issue with real hardware, but in our case, there are legitimately two underlying disk controllers, the NVME one and the AHCI one.  Keeping both of these under one virtual controller will limit the resources and queuing of the actual disk accesses.

There is even an issue with what exactly is being virtually emulated.  The LSI SAS controller that the default VMWare system emulates is built for compatibility, not speed.  To address this, VMWare actually built its own virtual controller that takes advantage of how VMWare works for about a 30% performance boost [7]. VMWare calls this the Paravirtual Controller.

In effect, there are two things we could do to improve performance for this RDM methodology, first is add multiple controllers, that way the HDDs are treated independently by the underlying VM.  Second is to use the VMWare Paravirtual controller.  I attempted to use the paravirtual controller.

First I created extra SCSI controllers with the add hardware option.  

Then I assigned the Hard disk 2 and Hard disk 3 to use the newly created virtual controllers.

That took care of adding extra controllers for these disks.  When I loaded up TrueNAS to create the pool, I saw them just fine.

However, when I attempted to switch to the Paravirtual Controller, they did not show up correctly.  I am thinking this is just a FreeBSD support issue for this.  I fell back on the LSI Logic SAS controller.

Let’s go ahead and step through the creation of the Archives pool.  First I navigated to the Storage -> Pools section and clicked Add.

Then I clicked Create Pool.

I can see my two 14.55 TB drives just fine now.  I named the new pool Archives.

Then I added them to the Data VDevs.

Then I clicked Create.  Everything worked.

Data Migration

Alright,  I wanted to start migrating the data over from the old FreeNAS to the new TrueNAS.  At first I wanted to just mount the new drives from the old FreeNAS, but I quickly realized I had an issue.  The old version of FreeNAS I was running was out of date, and didn’t have an easy console to create and use mount points (the rough equivalent of a Mapped Network Drive in Windows).  I ended up mounting the old FreeNAS from the new TrueNAS.  

I created the mount point for the old zfs system (this is just a directory in FreeBSD and Linux).

mkdir /mnt/oldzfs

Then I attempted to mount the old CIFS shares.

mount -t cifs -o username=<SECRET>,password=<SECRET> //192.168.2.14/bigstore_zfs_nas /mnt/oldzfs
mount: /192.168.2.14/bigstore_zfs_nas: Operation not supported by device

I tried several iterations around this.  At first, I thought it was about CIFS version issues [8], which I specified a version number.

mount -t cifs -o username=<SECRET>,password=<SECRET>,vers=3.0 //192.168.2.14/bigstore_zfs_nas /mnt/oldzfs
mount: /192.168.2.14/bigstore_zfs_nas: Operation not supported by device

I tried every version of CIFS, and tried to look up on the old system what it was running.  None of that worked.  I eventually gave up on CIFS and created an NFS sharing service on the old FreeNAS.  Then I was able to get this command to work.

mount -t nfs 192.168.2.14:/mnt/bigstore /mnt/oldzfs

The next thing I attempted to do was to copy over the media files from the old ZFS NAS to the Media datastore of the new one.

mkdir /mnt/Stacks/Media/Music
cp -R /mnt/oldzfs/Music/* /mnt/Stacks/Media/Music

This created a weird problem I was not expecting.  It gave all of the files a new timestamp!  That’s weird, I don’t really need to preserve the old timestamp, but it would be very useful. 

I also noticed that for reasons I still don’t know, this was capped at 1Gbps despite the fact that I have a 10G Solarflare SFF 5122F on the old ZFS NAS and the 10G Intel x550 on the new storage system.  

Upon noticing this, I decided to copy the media files by using my Windows workstation.  This is highly inefficient!  I mounted both drives and then started the copy.  This was not limited to 1Gbps, and got the full 10Gbps.  It also preserved the timestamps.  I’m not sure why this was necessary to get the full 10Gbps, but it worked.  Since this migration is a one time event, it didn’t seem to be that big of a deal to me.

I did notice that the NFS mounted version of this was better at IOPS than going through the Windows workstation, although the Windows workstation achieves higher bandwidth.  Because of this I was determined to use the NFS mounted version for the migration of old backups and data files.  Media would be copied through the workstation for bandwidth.

It occurred to me that I should be using rsync for this instead of cp.  I embarked on trying to figure this out.  At first rsync wasn’t working, mostly because the new datasets had CIFS share points, which have incompatible file permissions with Unix files (NFS shares as Unix files) [9]. I eventually settled on the following command.

rsync -rltd /mnt/oldzfs/<directory> /mnt/Stacks/<dataset>/<directory>

That worked, preserved timestamps, and symbolic links.  Normally for this I would use rsync‘s archive flag -a, but that includes file permissions which made the whole operation fail. While doing this, I decided to expand and change the datasets for the Archives and the Stacks pools (because they are both parts of Alexandria, a Library 😉 ).

I have created two extra datasets on the SSD based pool (Stacks), Certificates and Backup.  Certificates was created for distributing certificates that need me to perform manual actions.  I can keep this separated and secure by limiting access.

Backup is where I want to stick the VM backups.  Again, I am expecting there to be a program that needs access to a dataset.  I want it to be limited only to the backup data.  I have also edited the compression on Application and VirtualMachines to turn off LZ compression.  This should improve latency on these datasets.  I also noticed a huge improvement in storage space used after the migration.  For example, let’s look at the Backups folder.

The old ZFS used 4.4TB for the old backups, the new one uses 1.09TB.  That’s a huge improvement.  It explains why after the migration was completed I have 15 TB free.  I think I might have made a mistake here when purchasing extra mirror pairs.  It looks like I may genuinely not need them.  I did not know how effective LZ compression would be on my data set.  The answer appears to be, very. 

Looking at the datasets, only the Media dataset doesn’t compress well.  That actually makes sense, most media is already compressed, and adding LZ compression on top of it may actually make things worse if they aren’t careful.

There was only one more thing to do, rsync the Data and Backups datasets to the Archives pool.  I created the corresponding datasets on Archives and copied them over.  I didn’t actually change the record size, so the Archives is left with the 128KB record size.  That appears to have helped the LZ compression a bit as it is about 4-5% more effective than the main pool.

Conclusion

At this point, the two ZFS pools I want have been created, the migration is complete, and I have a pretty good understanding of the performance trade-offs here.  In addition I have saved quite a bit of space by using the default LZ compression.  I have also partitioned the space into datasets in preparation for security encapsulation.  This is a big success to me.  Storage VM complete!

References

[1] https://www.broadcom.com/products/storage/host-bus-adapters/sas-9305-16e

[2] https://www.servethehome.com/freenas-is-dead-long-live-truenas-core/

[3] https://techenclave.com/threads/corenas-the-future-of-freenas.79992/

[4] https://www.openmediavault.org/

[5] https://www.newegg.com/seagate-ironwolf-st16000vn001-16tb/p/N82E16822184805

[6] https://kb.vmware.com/s/article/1017530

[7] https://www.davidklee.net/2018/02/01/multiple-disk-controllers-in-vms-can-mean-improved-performance/

[8] https://unix.stackexchange.com/questions/144522/mounting-cifs-operation-not-supported

[9] https://unix.stackexchange.com/questions/61586/how-to-tell-rsync-to-preserve-time-stamp-on-files-when-source-tree-has-a-mounted

Leave a Reply