Storage VM: Part 3 – Installation, Configuration and Optimization

Storage VM: Part 3 – Installation, Configuration and Optimization

Storage Management System:

I did not do a great deal of new research before deciding to continue with FreeNAS [1] as my storage solution.  I have, from time to time, considered building a Ceph cluster [2], a Gluster FS [3], or Snap Raid [4].  Heck a quick open source search yields quite a few competitors to FreeNAS [5].  Honestly, after a little searching I noticed a herding effect.  All of these pages denote an object storage solution. 

This makes sense, S3 basically redefined the way we think about enterprise storage.  There will always be cases for needing high speed low latency storage when doing processing, but the end user need not be too concerned about what the storage system does underneath.  It’s a directory, an object storage server, a block storage solution (iSCSI), and more, all based on the same deployment.  All that matters is how that particular system wants to use it. 

All of these solutions are coalescing around a set of interfaces, and there isn’t a huge difference on the support front.  A few quick searches and yup, FreeNAS supports object storage [6].  It has even picked up a corporate sponsor, iXsystems, which sells servers based on FreeNAS.  It is difficult to get a handle on market share for something like this.  AWS, Google Cloud, Azure all utilize large storage systems.  In addition HPE, Dell, Cisco, and Oracle all sell storage solutions based on their own take for what is desired.  However, from where I am sitting, they all are based on a few sets of technologies that everyone has been using for years.  They basically sell hardware, support, and configuration, not some new unique technology.  

First, they all use some kind of RAID.  RAID stands for Redundant Array of Independent (sometimes Inexpensive) Disks.  Basically, in order to reach large amounts of storage space, I need to have more than one disk.  Although there exists a 100 TB SSD [7], it is an absurd $40,000 [8].  By chunking a bunch of drives together I can treat them as one big drive with their combined space.  Typically this is called RAID 0, or striping.  However, this is rarely used in isolation.  It has no data protection, and that is where RAID really stands out.  Instead of investing in a $40,000 SSD, I can buy $40,000 worth of 4TB western digital NAS SSDs, get 100 of them, use 40 of them for data protection and redundancy and still come out ahead.  

There are several RAID types, and I don’t want to get too into the weeds here, there are plenty of good overviews.  Briefly though, RAID 1 is known as mirroring.  That is where I have 2 drives and they are exact copies of each other.  If any one of the drives fails, the other can serve the data until the failed drive is replaced and the data re-copied. 

RAID 3 is where one drive is used for parity [9].  For RAID 3 it is typically the Exclusive-Or operator, since in the event of a failure, with the knowledge of which drive failed (data 1, data 2, or the parity drive) the missing data can be determined. 

RAID 5 uses one drive for parity.  Parity was envisioned as a way to recover from bit errors [10], either even parity or odd, where attempting to make the total number of 1’s in all drives at the same block even or odd.  Example, if I have 5 drives, labeled A, B, C, D, Parity then:

Drive A DataDrive B DataDrive C DataDrive D DataTotal 1sEven ParityOdd Parity

Take out any specific bit, and with the knowledge of which drive failed, it is possible to determine what the missing data is.  This is not exactly what RAID 5 does.  This includes a bit more information than is needed, and can potentially even detect corruption (where one of the bits is wrong).  In reality, we have an extra piece of information, which drive failed.  Most RAID 5’s simply use an Exclusive-Or operator for speed.  With knowledge of exactly which drive is missing, it’s a simple Exclusive-Or of the working drives.  As a side effect this introduces processing overhead, which tends to be quite small overall. 

RAID 5 fell out of favor something like 12 years ago [11].  This was a side effect of the unrecoverable read error rate.  Essentially, drives got so big that the Unrecoverable read rate of 10^14 (typical for HDDs of the time) actually would happen when rebuilding the array after a failure.  Due to magnetic characteristics, a drive would end up failing to read a certain sector.  Thus, in a RAID rebuild, where all sectors need to be read, arrays would lose data when trying to calculate the missing data. 

RAID 6 is still commonly used, it uses 2 drives for parity.  I’m not going to go into exactly what the calculation looks like, it ends up in a Galios Field, but here is a good explanation with nice visuals [12].  Thus, the unrecoverable read during a rebuild can be recovered from.  One is simply betting they don’t get 2 unrecoverable read errors on the same sector.  Assuming the error is independent of the data written there (it almost certainly is), this is a good bet. 

RAID 7 uses 3 drives for parity calculations, again based on the Galios Field calculations. 

A very common high performance RAID is based on RAID 1+0, which is where the number of drives are divided in half, each half is striped, then mirrored to the other.  This also works with the other raids, so RAID 3+0 is where I am building a bunch of RAID 3 arrays and striping them. 

There is a specific set of RAIDs used in ZFS that are more fault tolerant than a standard RAID.  Briefly, ZFS tries to be able to recover from failures, even when which drive is failing is not known.  It covers data corruption in addition to drive failure.  Typically it is easy to determine which drive failed.  For basic RAIDs it’s the one that won’t turn on or stops returning data all-together.  However, what if the drive is returning incorrect data (data corruption)?  Like an alpha strike from back in the memory discussion, or a write hole where power is lost in the middle of a write operation. 

Well, if I know which one is wrong, RAID can recover it with the parity algorithm, but that is unlikely to be detected.  RAID-z includes a checksum on the data stored.  Whenever a piece of data is retrieved, the checksum can be calculated and verified against the stored checksum.  Except in the case of collisions (it should be absurdly rare for a random corruption to create a checksum collision) the corruption can be detected.  With the right checksum algorithm, it can even be recovered from. 

ZFS also does not stripe all data across the entire array.  There are advantages and disadvantages to this.  Advantages include that the system can recover from a wider number of known causes of data loss.  This comes at the cost of losing the RAID simplicity.  When using ZFS, I must know it is ZFS, or the drives can’t be used.  My personal opinion here is that ZFS is something I would be hesitant to be an early adopter of, but at this point it is a very mature well understood project.  I have no issues with using any RAID-Z. 

The last thing of note about RAID-Z is that it does have a rough translation to the previous RAIDs.  RAID-Z1 is approximately RAID-5 (1 parity storage), RAID-Z2 is approximately RAID-6 (2 parity storage), and RAID-Z3 is approximately RAID-7 (3 parity storage). A good overview of their differences between RAID and RAID-z is here [13]. 

From a high level, all of this detail is probably not that import.  There are a ton of RAID reliability calculators out there [14].  If I needed an official document stating exactly what the MTTF (mean time to failure) would be here, I would go ahead with that.  For me though, I knew I wanted to use 2 or 4 tb SSDs, so that narrowed the field a bit. 

I will discuss exactly what the options I tested were for the storage VM later, but from here?  A RAID of 3-5 drives would be RAID-z1 or RAID-5 (I personally don’t trust this, I would use RAID-3, but I think RAID-5/z1 gives me a false sense of security I’d prefer to avoid), a Raid of 6-10 drives should be a RAID-z2 or RAID-6, and a RAID of 11-18 drives should be RAID-z3 or RAID-7.  Above that number of drives should be chopped up into multiple arrays, but I will discuss how that works later.

The second technology that is used is the RAID management system.  There are 2 main ways that RAIDs are managed. First is a RAID controller card. These are almost identical to HBAs underneath, so much so that I have literally flashed a new bios onto a RAID controller to turn it into an HBA. The second is with software. 

Dedicating a computer to managing the storage system used to be a much more difficult decision.  Back in the day (like high school for me), it used to be as much as a 10% overhead to actually use software.  That choked resources for everything else.  Nowadays, with as many cores as our common processors have, just dedicating one whole core (or more) for this is kinda a no brainer. 

I do not like RAID controller solutions.  They introduce a single point of failure for the whole system.  This isn’t to say that an HBA does also introduce that single point of failure, but if the HBA fails, it doesn’t, as a side effect, potentially destroy the data that is behind it.  The drives simply become unavailable to the system. 

I know that there is also the data corruption type of failure, but as I have just covered, there are other options for that, and since an HBA doesn’t typically do any RAID calculations, that removes a potential source of errors.  That being said, I do not think RAID cards are a bad idea, I just think the cost of using the software is so low nowadays, it is a better solution unless something else is requiring a physical raid card configuration to get working.

The third technology all of these storage solutions implement are the interfaces.  They all have a method to make NFS, or Network File System, the Linux sharing protocol, CIFS or Common Internet File System, which is a specific implementation of Microsoft’s SMB or Server Message Block protocol, Object storage based on compatibility with AWS’s S3 protocol, and iSCSI or Internet Small Computer Systems Interface [15].  I am certain there are more protocols here, but these 4 protocols will pretty much guarantee that any application I wish to install will be able to use the storage system.  This standardization has been very helpful for me, as it has allowed me to have a great deal of encapsulation for the storage system.

The fourth technology these solutions employ is some kind of file system.  I remember in high school, I first cared about file systems when FAT could no longer format my large hard drives.  FAT is File Allocation Table, and at the time FAT16 could only allocate 2GB drives [16].  Remember when 2 GB was “a lot”?  Anyways, I then learned about FAT32, NTFS, and EXT3.  All file system types that extended the ability to have drives beyond that limit, even learning about gpart, and partitioning.  I am much more familiar with EXT4 than I am with other file systems, so I can use this as an example of exactly what File Systems do.  

EXT4 determines how, exactly, a file is stored on a drive. In EXT4, blocks are written to a drive based on whatever the block size of the drive is, and identified by a block type. 

At the top, there is a superblock node that identifies all of the base information about the partition and filesystem (partitions are built on top of raw drive data).  This is usually stored in multiple places since if it cannot be located or read, the file system has failed. 

The next block type is a block group descriptor, these describe how a partition space is used.  For example, these will denote where the free inode tables and free blocks begin as well as the used inodes and bitmaps.  inodes are what denote files and directories (directories are really just a special case of inode, just like files are). 

Continuing down this chain we get to the inode, which contains specific information about a particular file.  This is what is commonly referred to as file metadata.  This includes the last time it was accessed, the file permissions, and more (but not the file name, that is in the directory block).  This also contains a list of inode data blocks, which point on the disk to raw data for this file. 

There are four main types of inode data blocks: direct, indirect, double indirect, triple indirect.  Direct blocks contain raw data, indirect blocks point to new inodes which solely contain pointers to direct blocks, double indirect blocks point to blocks containing solely indirect inodes, and triple indirect blocks point to inodes containing solely double indirect inodes.  See this picture below to get an idea of how all of these nodes map together. 

Basics of an EXT4 File System Block Structure

The Directory structure of EXT4 was an add-on that occurred later in EXT development.  It started as a strict array of linear names and inodes, but was later converted to a hash tree (b-tree).  Essentially starting from the block group descriptor, it points to the root directory, which then has an array of name and inode pointers.  The tree was added by making these directory name/pointer groups become nodes that point to another directory node, which then points to inodes.  The same diagram shows this as well.

Alright, since this shows the basic layout of a file system built on top of a block structure, all of these blocks are of the same size, and padded out to fill them. 

I have left out a lot of details, this is an entire technology.  I didn’t cover journaling, special features, lazy allocation, just flipping a bit to signify deleted, or any of the rest of the things necessary to make the actual file system work.  Here is a good overview of the blocks one can find in EXT4 [17]. 

ZFS needs to be viewed through this lens.  It is a complete file system, and without the software that defines how to read and operate all of the blocks it creates, one cannot read the files.  All of these are rolled up into a technology I’m calling file systems. Dell, HPE, etc do implement their own file systems, but these are built around data clusters. That is far larger than what I am attempting here.  AWS is starting to venture into actual file system implementations recently, in their attempts to increase performance and redundancy.  Their AuroraDB structure is quite impressive.  I wish I had a chance to do a deep dive with it sometime.

The last technology all of these solutions employ is a caching system.  I am hesitant to call this a technology here.  Caching has been around for ages.  It’s a known quantity, and it has a lot of uses outside of storage.  However, they all utilize some kind of caching system.  ZFS, the storage system underneath FreeNAS, uses caching to improve performance.  It is a bit of a joke among ZFS users that the best thing one can do to improve performance is add more RAM.  Even something as simple as a Redis-cache [18] or Memcached [19] can be thrown in front of an object storage server to improve performance.  It doesn’t need to be as low level as the ZFS caches are.

Briefly, in case one is not familiar with caching, whenever a certain block of data is read or written, it needs to be read or written from the master record.  I like to call this the source of truth (and in fact a great number of problems come from having more than one source of truth).  If it takes ~10ms to read the data from disk, that is a significant amount of time for an application to be waiting.  Whereas it is on the order of ~10ns to read data from memory.  Caching is the act of storing a copy of the data in a faster to access form and to use that as the method of read or write while syncing with the master record at a later point in time. 

When using caching as it relates to file systems, ideally it would be able to load a file into memory and handle modification there.  With follow-up writes to disk to maintain coherence with the master record.  However, much of caching is in memory it fails in a power loss scenario (sometimes called volatile storage), so writing changes to the cache tends to be followed up by actually needing to write these changes to persistent storage.  There are actually quite a few more topics related to caching, but those are probably better addressed in a computer architecture class.

Combined together, these five technologies form the foundations of most storage technologies.  With the truth being that many of these technologies were solidified in the 1980s.  They are not new and they are very mature. 

The newest changes are around iSCSI and S3, and they are old and backed by even older technologies.  Because of this, and the herding effects of the software around them, there really isn’t a huge deal of difference between them all.  I did not feel the need to do a deep search. 

This isn’t to say there is no differentiation between them all, a point I’m sure they would argue, but this isn’t like the difference between a hard disk drive and a solid state drive.  This isn’t several orders of magnitude.  It also is mostly around specific applications.  A general storage solution, independent of specific application, isn’t where current gains are made.  AuroraDB, AWS’ PostgreSQL database, has made gains by viewing the whole storage system holistically and taking over parts commonly associated with operating systems.  This tells me that the general storage systems are very mature, and only specific application optimizations are really seeing big gains.  That is not necessary for my case. 

I am not tuning to a specific application, but am wanting a general storage solution.  A couple quick looks, and FreeNAS is still considered one of the best for home labs [20] (this guide from 2019 still shows its up to date), supports all the needed protocols and will serve my purposes.

Configuration and Install:

This really won’t take long here.  I just wanted to show my work a bit, in case someone is trying to follow along for their own setup purposes.  First I went ahead and toggled passthrough on the 2 LSI 3008-8e PCI cards, and one of the 10G x550 network controllers:

This may or may not necessitate a system reboot.  In my case it did, but it does not always. Then I went through the VM create process. Clicking create new vm on the virtual machines section.

I named the storage system Alexandria, after the famous library or storage of human knowledge ;).  I also selected the Guest OS of FreeBSD 11 [21], which is what the current version of FreeNAS is based on (that is version 11.3).

I selected the datastore that the VM will use for its install hard drive

Then, I selected 8 processors and 64 GB of RAM for this machine.  I can change this later if I choose to (and I will).  For now I just want to get things running.

Then there is a review page (This says Alexandria2 because I already had the original created and selected a new name for this demonstration purpose)

Next I needed to enable passthrough of the 2 cards and the network card.  From the Alexandria main page, clicked edit, then add other device.  Selecting PCI device.

This will auto select from the passthrough device list.  I did not need to change the selections because I only have 3 devices on passthrough right now, but I will have to be more careful in the future.

Next, I needed to reserve the memory to the VM.  This is because of the PCI passthrough.  PCI devices have what is called DMA, or Direct Memory Access.  This is when the operating system allows a PCI device to write directly to memory.  As controlling this direct writing is not easy for the hypervisor without assuming control of the PCI device directly, it makes sense that the memory needs to be reserved from the rest of the system, since a PCI device could be writing to it at any time.

At this point I needed to load the installation ISO to the datastore.  I downloaded the FreeNAS installation ISO from their website [22].  I then navigated to the storage section, selected the primary datastore and clicked on the datastore browser.  Then I uploaded the ISO to the datastore.  I have already done this, that is why the ISO is already listed.  In fact you can see the other directories of other VMs I have previously created.  This is after my reinstall, which you can read about in the previous post.

Next I went back to the VM screen and edited Alexandria again.  I selected add device and choose to add a new CD/DVD drive.

Then I selected the option to switch it from a host drive to an ISO file

Then it will open a datastore browser, I selecedt the FreeNAS ISO

Afterwards, I verified the selection, and saved changes.

Then, I was ready to boot and step through the FreeNAS installer.

I should note that for whatever reason I almost always needed to select the connect on power on twice and boot up twice to get the VM to register the ISO.  I am not sure why this is the case, I assume I am missing something simple.  But, as install is a one time event, it didn’t seem worth investigating with such a short workaround.

Next I booted up the system, and got the FreeNAS installer. (I had to use Alexandria2 for this, as the first one already has a FreeNAS install).

Then I selected Install/Upgrade (hit enter)

I got lucky on this one, the VM drive is at the top, but it may be needed to navigate down until they find the VM drive.  Hit space bar to select (that will add the * next to it) and then I hit enter to continue.

Next I agreed to the deletion on the VM drive.

Then I created a root password.

Next I needed to select the boot option.  At first I tried UEFI, but that does not seem to work in this VM, I had to restart the installer and select BIOS

The system will go ahead and install FreeNAS

Lastly, I agreed to completion.

I then shutdown the VM, and removed the CD/DVD drive.

Then I booted the VM up and eventually it finished initializing.

This concludes my basic setup and configuration.  I kept 2 network controllers because one is the VM network.  That should be optimized if VMs want to use the storage system without ever actually leaving the server.  The 10G card will be for all systems not on the virtualization server.


I didn’t want to just take the default parameters of ZFS.  The last time I put a ZFS pool together I was more focused on just ending up with a set of vdevs, the ZFS way of saying a specific raid array, I could expand.  I was worried about being stuck with less space than I needed. 

After 6 years with the old pool in one form or another, I am much more comfortable with how this works, and I want to focus a little on optimization this time.  I am not sure if I want to turn this pool into a straight up iSCSI pool for ESXI, but I do think that would be a project I want to look into at some point.  Given that, I don’t want to just throw together a collection of vdevs with the primary goal being to reduce the capacity I lose to redundancy.  I want to actually optimize and run a set of tests to figure out what this solution space looks like.

When setting up a ZFS pool there are 3 primary variables that determine pool performance, well 4, but I will be ignoring caching for this set of tests, as the caching will be worked on separately. 

First is how the vdevs are configured.  ZFS pools will stripe all of the available vdevs.  Note, these do not need to be of the same size like RAID arrays, when one pool is out of space, ZFS will simply use fewer vdevs.  Therefore the number of vdevs is a variable. 

Technically the type of vdevs is a variable, these are RAID-z2 vs RAID-z3 vs RAID-z1, etc.  I do not necessarily consider this an open configuration option.  The type of vdevs is more a factor of the redundancy I am willing to deal with.  I’m not willing to put up with a 24 drive RAID-z1 from that point of view, so it isn’t as variable for me, so I am not really considering this a primary variable. 

The next variable is the method of allocating the vdevs.  The expansion chassis has 2 SAS3 backplanes.  I also have 2 SAS3 HBA cards.  So I can stripe the vdevs between the cards or keep them contained within one card. 

The last variable I am considering is the record size.  Essentially this is the size of a write for each drive.  Likely this is best to match the block size of the drive itself, but since these commands can be pooled or collected, this tends to be a factor of the type of file access.  Small random accesses want a small record size.  Large sequential accesses want a big one.  The ZFS people describe 128KB record sizes as an uneasy compromise between the various users.  I wanted to get a general idea of what this looked like.

I set up a series of tests to copy an installation directory from my local workstation to the remote pool.  The directory is of size 130 GB with 57K files.  File sizes vary from complete ISOs down to individual 4k install files, and should be a good mix of the files I would use for backup.  That was my primary thinking here.  This test would look like a system backup. 

This isn’t the only case for use, and I know there is a danger of over optimization for a specific case here.  But as long as I use these results in a general sense rather than just blindly picking the best performance I should be able to glean useful information. 

I thought a bit about it, and decided to disable the caching on ZFS with this command:

zfs set primarycache=none pool

I was concerned that the cache might be able to detect that I was copying the same files over and over and optimize.  Then I would be measuring how intelligent the cache was, not the performance of the pool.  I also considered copying the installation directory to a local drive on the VM, then copying to the pool from the native command line.  I didn’t like the idea of taking the network out of the loop, I am unsure if there are differing effects from using the CIFS protocol vs not.  If I always use CIFS, through the network, then I would have accurate comparisons between all of the data points.  I prefer to test in the same kind of environment that I will actually use the system in.

Okay, the first set of tests I did was to determine whether I should have arrays built across the backplanes or within them.  Before running this I was kinda thinking across, where more resources would be activated resulting in better performance.  I am going to step through the first pool creation in detail here, but I won’t be doing that for all future configurations.  

First I navigated to one of the IP addresses of the system in a web browser. They can be identified from the booted system screen.  There was a login page, so I logged in as root.

The very first time I created a pool, it appeared to hang on the pool creation.  It spun for a good 30 minutes.  I did some research and figured out that when ZFS first creates a pool with SSDs it will attempt to run the trim command on them.  Trim is a sort of clear command for SSDs that will delete all space on them by overwriting them with 0s.  This allows some nice optimizations for new space allocations.  As a one time event, done only at pool creation, this is quite useful.  However, for my purposes, where I will be creating and destroying pools quite a bit during my testing, I do not want to wait for it to complete.  It can take upwards of 13 hours to trim a 4TB SSD.  To disable this I navigated to the System -> Tunables section and clicked add.

The specific setting I need to disable to prevent the trim command from being run on pool create is vfs.zfs.vdev.trim_on_init and it needs to be set to 0 (boolean for false).

Afterwards this is now set.

This needs to be done before pool creation to prevent the trim issue.  Next I navigated to the pool section and click on the add button in the top right.

I clicked create pool.

This brings up the pool creation screen.

I clicked on the add data at the bottom to add a second vdev.

I selected the first 12 hard drives that I wanted to create a vdev with.  Note, these are named daXX on the hard drives.  I had to go and physically take notes on the location and serial number of each drive.  The arrow expanded so I could see the serial for each drive.  There is a pattern, in my case the first 12 were the left backplane, the next 12 were the right backplane, and they were in order from first row to last row.

Once selected, I clicked the arrow next to the first data vdev.

Then I selected the next set of drives and clicked the arrow next to the second data vdev.

At this point I could change the raid level for each vdev.  I am leaving these as the default recommended Raid-z3 for this demonstration, but when I am running tests, this is how I can change the RAID level I am using.

Next I typed in the name of the pool.

I scrolled down and clicked create pool.

I got a warning about deleting all data on these hard drives.  Then, I clicked confirm and create pool.

Here the pool is created.  Next I need to add a dataset.  Most of the protocols will operate on a dataset, which is an encapsulation mechanism that can allow a pool to be shared by multiple users, ACLs (Access Control List), or protocols without needing to use the entire pool.  It also allows tuning on the level of datasets for optimization purposes.

I clicked on the snack bar and add dataset.

Here I can see all of the options that can be set or changed.  I am just going to type in a dataset name for now, and click save.

Now I can see the dataset created.  Next I changed the recordsize to 32KB.

I clicked on the snack bar for the base pool and select edit options.

Then I expanded the options and I can see the recordsize near the bottom, I select 32KB from the drop down.

Then I clicked save.

Next I will want to enable CIFS sharing, so I can run my tests.  I navigated to the Windows Share (SMB) under Sharing.  I clicked add.

I navigated to the test dataset and selected this for the share.  Also I typed in the name of the share.  CIFS uses a standard //<server>/<share name> to identify specific shares. I clicked save.

When I clicked save it asked me to Configure ACL (Access Control Lists can also be configured from the snack bar of the dataset in the pool section).  I clicked configure new.

From this screen I selected the groups and users I want to be able to access this dataset.  I have pre created a user for myself named nweaver and added them to the users group.  I selected the group option of Group, with the Users group selected below it.  I also selected a specific user of nweaver to give myself specific access.  Both are not necessary, I am just them for demonstration.

I then deleted the everyone ACL. I don’t want this open to all users by default.

I made a small mistake here by not selecting the inherit option anywhere.  The system needs a default ACL, or nobody would have access.  I clicked save.  As an aside, do not use root as a user access. Root includes a lot of extra privileges.  Just create a user that can be further restricted later if need be, and keep root for configuration.

Next I needed to disable caching for the ZFS pool.  So I navigated to the shell, which gives me command line access, and I entered the command from before.

Alright this is the method I used to create all of my pools.  It does leave a little bit of a chance that the trim optimizations won’t work, but this should be lost in the aggregate if it has any effect at all (I didn’t see one).  The only other item worth mentioning is that unlike in the creation and setup, for all of these tests I used the Solarflare SFN-5122F 10G fiber ports, not the Intel x550.  

The first set of tests I wanted to compare were a set of 2 x 12 Raid-z2 drives.  In the first option I wanted the two vdevs (of raid-z2) divided with 6 on each backplane.  The second option was each vdev’s 12 drvies were contained within a single backplane. 

Alright I thought about how I wanted to run this test.  This is going to be similar to a sensitivity analysis [23], but I don’t think I’m going to end up with a whole lot of data points where doing the full on analysis will yield what I am wanting.  I think a simple comparative test where I can figure out generally what is going on should suffice. 

The first big question I had was whether one sample was enough or whether I needed more.  With most benchmarks they run ~5 tests and average them out to give a final number.  Thus, if there is a particularly bad or good result, it gets mixed in with the rest.  I ran a quick test of copying that folder (The directory is of size 130 GB with 57K files) from my local workstation to the remote pool. Then I copied the resulting directory back to the local workstation. Finally, I deleted the remote directory. Thus I ended up exactly how I started.  I have never used Windows Powershell before this point.  A bit of googling and I came up with the following set of commands, run in order, that will achieve this test:

  1. Measure-Command {Copy-Item -Path "D:\Individual Installs" -Destination "Y:\Individual Installs" -Recurse}
  2. Measure-Command {Copy-Item -Path "Y:\Individual Installs" -Destination "D:\Individual Installs-Copy-1" -Recurse}
  3. Measure-Command { Get-ChildItem "Y:\" -Recurse | Remove-Item -force -recurse }

I repeated this five times and got the following results (time is in seconds). 

ConfigRecord SizeCopy To TimeCopy From TimeDelete TimeNotes
2 x 12 Raidz28K1341.5941544.802820.798Array Per SAS Card
2 x 12 Raidz28K1463.5431505.588791.648Array Per SAS Card
2 x 12 Raidz28K1519.6581319.205796.781Array Per SAS Card
2 x 12 Raidz28K1344.3551257.201834.563Array Per SAS Card
2 x 12 Raidz28K1381.6651309.285813.428Array Per SAS Card

I calculated a standard deviation of ~78 seconds for copy to test ~122 seconds for copy from test and 17 seconds for delete.  This is approximately 5 %, and that tells me I really do need to run 5 of these, not just one data point.  I kept these for the first data point I would make.  I wanted to run a test that changed record size as well to make sure I’m able to see other effects and that the results are consistent even when I am changing the record size variable.

So here I made a mistake.  I attempted to correct it later, but that created further mistakes so let’s just cover the initial one now and I will discuss the compounding nature later. 

In the beginning, I ran these commands manually, 5 times each.  I also know that when running a test, I want to repeat the exact same sequence for the commands, that way comparisons are fair.  What this doomed me to was sitting here and every 25-30 minutes popping back to the workstation and starting the next section.  In between the 3 runs I would delete the local copy directory.  

Some of you may be wondering about local access time.  How do I know that my limitation isn’t how quickly the workstation can get the files.  I think this is a small effect, but it certainly exists.  The local drive is actually a RAID 0 of 6x NVME Sabrent Rocket drives (PCIe 3.0, not 4.0).  I have measured its performance, and with metrics like this it really should push the bottleneck to the network.  Even if false, it should be consistent for every single iteration of the test, and the averaging of 5 would yield useful results.

Since we are already talking about drive performance.  I am sure some of you are wondering about me listing the Intel DC-S4500 and the Intel D3-S4510 each as drives I am using.  Most raids want to use the exact same drive with the exact same firmware that way their read and write characteristics are consistent.  This is a good standard practice.  That way if there is a major performance difference, the array isn’t limited by the slowest drive on every operation.  Here are the two performance comparisons on these drives

Intel DC-s4500Intel D3-s4510

The performance characteristics are close, really close.  So close, that I am convinced they are the same flash technology underneath. There is probably a minor controller update that makes sequential accesses better, but that’s the only big difference here.  I am not worried about slowing down to the DC-s4500 speed, this is akin to replacing an old drive with a slighter newer one of the same class.  That being said, if it is possible to keep the drives of the same model in the same vdevs, I will endeavour to do so.

Back to the testing, below are the results for my first test.  The copy to and copy from color scales are calculated together.  I really didn’t want to get lost in minor performance improvements (1231 vs 1215 just isn’t that significant).

Two patterns emerged here.  First, copying to the remote system improves as record size gets larger.  I will be further exploring this observation later.  Second is that, save for the large record sizes, keeping the arrays within a single backplane and HBA has ~5% performance increase when copying from. 

This was surprising to me, but should not have been.  I slightly misunderstood how ZFS works.  At first, I thought that it endeavored to keep the blocks of its RAIDs and files on the same vdev.  I no longer believe that is true.  I think it is always using all of the arrays.  Thus, strafing just makes the system wait for 2 HBA and drive writes to confirm rather than just one.  Thus allowing the HBA to manage the queues for complete writes to its arrays.  This is my best hypothesis as to the observed performance improvement, though this is not a slam dunk, some of what I see could just be statistical variance.

The second set of tests I ran were to test all of the raid configurations I could.  I took the observation of keeping all the arrays within the same backplane for the next set of tests (I also re-used the results from the first test here as well).  

Alright, a few general observations I made here.  Adding more vdevs definitely improves performance.  Copying from on the 3 x 8 Raidz2 is one that stands out.  I think this is because of the strafing.  ZFS needs to pull the data from both HBAs to be able to collate the files.  This is minor evidence in favor of my hypothesis from before.  

My second observation is that there appears to be a peak of performance around 32KB.  Too small a record size kills performance altogether, too large and writes are better, probably because more files can fit into one write versus needing more. Whereas reads are poor, again because it needs to pull more data from the drive. 

There is a drive block size, and when copying to or reading from, the record size does not necessarily match this.  If the record needs to read a full 512KB to pull the file, that may end up being multiple different reads.  This explains the performance hit here.  I think the block size for these drives are 4KB, but I have not been able to confirm this.  This hypothesis fits the delete being better for larger block sizes as well, since more of these deletes can be rolled up into one block delete instead of multiple.

Now, to my mistake.  I was getting sick of staying near the workstation to run through this loop.  So when I started the 1 x 24 Raidz3 I scripted up a function that would do all 5 tests at once.  This freed up my time to go ahead and do other things.  Here is the code I used, it is quick and dirty, and I wouldn’t use it in production, so judge less please.

workflow experiment-run {
	$copy_to_1 = Measure-Command {Copy-Item -Path "D:\Individual Installs" -Destination "Y:\Individual Installs" -Recurse}
	$copy_to_1_ms = $copy_to_1.TotalMilliseconds
	echo "Copy-To-1   : $copy_to_1_ms"
	Start-Sleep -Seconds 20
	$copy_from_1 = Measure-Command {Copy-Item -Path "Y:\Individual Installs" -Destination "D:\Individual Installs-Copy-1" -Recurse}
	$copy_from_1_ms = $copy_from_1.TotalMilliseconds
	echo "Copy-From-1 : $copy_from_1_ms"
	Start-Sleep -Seconds 20
	$delete_1 = Measure-Command { Get-ChildItem "Y:\" -Recurse | Remove-Item -force -recurse }
	$delete_1_ms = $delete_1.TotalMilliseconds
	echo "Delete-1    : $delete_1_ms"
	Start-Sleep -Seconds 20
	$copy_to_2 = Measure-Command {Copy-Item -Path "D:\Individual Installs" -Destination "Y:\Individual Installs" -Recurse}
	$copy_to_2_ms = $copy_to_2.TotalMilliseconds
	echo "Copy-To-2   : $copy_to_2_ms"
	Start-Sleep -Seconds 20
	$copy_from_2 = Measure-Command {Copy-Item -Path "Y:\Individual Installs" -Destination "D:\Individual Installs-Copy-2" -Recurse}
	$copy_from_2_ms = $copy_from_2.TotalMilliseconds
	echo "Copy-From-2 : $copy_from_2_ms"
	Start-Sleep -Seconds 20
	$delete_2 = Measure-Command { Get-ChildItem "Y:\" -Recurse | Remove-Item -force -recurse }
	$delete_2_ms = $delete_2.TotalMilliseconds
	echo "Delete-2    : $delete_2_ms"
	Start-Sleep -Seconds 20
	$copy_to_3 = Measure-Command {Copy-Item -Path "D:\Individual Installs" -Destination "Y:\Individual Installs" -Recurse}
	$copy_to_3_ms = $copy_to_3.TotalMilliseconds
	echo "Copy-To-3   : $copy_to_3_ms"
	Start-Sleep -Seconds 20
	$copy_from_3 = Measure-Command {Copy-Item -Path "Y:\Individual Installs" -Destination "D:\Individual Installs-Copy-3" -Recurse}
	$copy_from_3_ms = $copy_from_3.TotalMilliseconds
	echo "Copy-From-3 : $copy_from_3_ms"
	Start-Sleep -Seconds 20
	$delete_3 = Measure-Command { Get-ChildItem "Y:\" -Recurse | Remove-Item -force -recurse }
	$delete_3_ms = $delete_3.TotalMilliseconds
	echo "Delete-3    : $delete_3_ms"
	Start-Sleep -Seconds 20
	$copy_to_4 = Measure-Command {Copy-Item -Path "D:\Individual Installs" -Destination "Y:\Individual Installs" -Recurse}
	$copy_to_4_ms = $copy_to_4.TotalMilliseconds
	echo "Copy-To-4   : $copy_to_4_ms"
	Start-Sleep -Seconds 20
	$copy_from_4 = Measure-Command {Copy-Item -Path "Y:\Individual Installs" -Destination "D:\Individual Installs-Copy-4" -Recurse}
	$copy_from_4_ms = $copy_from_4.TotalMilliseconds
	echo "Copy-From-4 : $copy_from_4_ms"
	Start-Sleep -Seconds 20
	$delete_4 = Measure-Command { Get-ChildItem "Y:\" -Recurse | Remove-Item -force -recurse }
	$delete_4_ms = $delete_4.TotalMilliseconds
	echo "Delete-4    : $delete_4_ms"
	Start-Sleep -Seconds 20
	$copy_to_5 = Measure-Command {Copy-Item -Path "D:\Individual Installs" -Destination "Y:\Individual Installs" -Recurse}
	$copy_to_5_ms = $copy_to_5.TotalMilliseconds
	echo "Copy-To-5   : $copy_to_5_ms"
	Start-Sleep -Seconds 20
	$copy_from_5 = Measure-Command {Copy-Item -Path "Y:\Individual Installs" -Destination "D:\Individual Installs-Copy-5" -Recurse}
	$copy_from_5_ms = $copy_from_5.TotalMilliseconds
	echo "Copy-From-5 : $copy_from_5_ms"
	Start-Sleep -Seconds 20
	$delete_5 = Measure-Command { Get-ChildItem "Y:\" -Recurse | Remove-Item -force -recurse }
	$delete_5_ms = $delete_5.TotalMilliseconds
	echo "Delete-5    : $delete_5_ms"
	echo "`n`n"
	echo "Summary"
	echo "Run 1 - CopyTo    : $copy_to_1_ms"
	echo "Run 1 - CopyFrom  : $copy_from_1_ms"
	echo "Run 1 - Delete    : $delete_1_ms"
	echo "Run 2 - CopyTo    : $copy_to_2_ms"
	echo "Run 2 - CopyFrom  : $copy_from_2_ms"
	echo "Run 2 - Delete    : $delete_2_ms"
	echo "Run 3 - CopyTo    : $copy_to_3_ms"
	echo "Run 3 - CopyFrom  : $copy_from_3_ms"
	echo "Run 3 - Delete    : $delete_3_ms"
	echo "Run 4 - CopyTo    : $copy_to_4_ms"
	echo "Run 4 - CopyFrom  : $copy_from_4_ms"
	echo "Run 4 - Delete    : $delete_4_ms"
	echo "Run 5 - CopyTo    : $copy_to_5_ms"
	echo "Run 5 - CopyFrom  : $copy_from_5_ms"
	echo "Run 5 - Delete    : $delete_5_ms"

As can be seen, this ends up creating 5 local copies, that I can then clean up at the end.  However, this uses Powershell workflows, which I am not all that familiar with.  It appeared to work right.  I didn’t have any reason to suspect that this created a problem until I ran my next set of tests. 

I asked myself a question. Alright with these tests run, what are the options I am actually considering.  I axed 8K and 512K record sizes outright.  There are two more options available from ZFS around 32K and 128K that I had not tested, 16K and 64K.  So I went ahead and ran a test for these numbers.  Below are the results.

One thing about the previous test is that it was very internally consistent on a per configuration basis.  For the CopyTo it always got better when the record size increased (save one exception on 4 x 6 raidz2).  For CopyFrom it went 8k < 32K > 128K > 512K.  This sort of repeating pattern gives me confidence that I am detecting and seeing meaningful results.  When I added the 16K and 64K, It bounced around.  The pattern is not discernible anymore.  It looks like 16K is better than 8K in general, but then 64K bounces everywhere. 

There are a couple of explanations that come to mind.  First, this could just be statistical variance.  There isn’t a huge difference between the 64K and the tests around it.  The second is my switch to a function and a workflow has actually altered the copy pattern. 

I am not sure which it is, but I did ask myself. What is my goal here?  I am not looking to completely optimize this specific copy.  I am trying to glean general principles here.  This data is meaningful enough to glean a couple important points:

  1. It is probably better to keep the vdevs within an HBA.
  2. There appears to be an inflection point at around 32KB for these kinds of files.  But I will not be sacrificing much performance to move around this point either.
  3. The more vdevs I have the better.

This kinda matches what I was expecting in a very general sense.  I have found a few forum posts on optimization for all SSD ZFS pools [24], which suggests I’m in the same range here.  They also bring up the point that lz compression works better with larger record sizes, also matching my experience here.  I found a very good quote here [25] from a ZFS developer:

“For best performance on random IOPS, use a small number of disks in each RAID-Z group. E.g, 3-wide RAIDZ1, 6-wide RAIDZ2, or 9-wide RAIDZ3 (all of which use ⅓ of total storage for parity, in the ideal case of using large blocks). This is because RAID-Z spreads each logical block across all the devices (similar to RAID-3, in contrast with RAID-4/5/6). For even better performance, consider using mirroring.

That blog article is worth reading about ZFS performance in general.  My observations are very much in line with these observations.  The IOPS scale with the number of vdevs.  More vdevs such as 8 x 3 Raidz1 would give even better IOPS here.  In addition they recommend mirroring for best performance.  They also mention a part I hadn’t considered.  The resliver.  

Unlike striping RAID arrays for performance, ZFS actually stores important information on each file on every vdev (like my hypothesis above).  When recovering from a disk failure, it actually needs to read every bit of written data in the whole pool.  This is a cumbersome task at the best of times, and a dangerous one if drives are near their end of life.  The recovery array rebuilds are called a resliver. This is also sometimes run to detect corruption issues. Luckily drive failure is much more an issue for HDD than SSDs.  

However, the time it takes to resliver is a performance hit.  When using mirroring, there isn’t a resliver.  It was always just a copy underneath.  Therefore ZFS doesn’t actually need to do anything other than copy the correct hard drive.  That is a boon here.

After considering this, I decided to use mirroring.  This is the most expensive option, and since I need at least 37 TB of space to copy the current pool over, mirroring my 24 drives would only yield 41TB.  I decided to order an extra 7 drives, 6 to create 3 extra mirrors, and one more to serve as a hot swap failure drive.  This will give me 50TB to work with for applications.  This should give me plenty of reliability.   It also lets me continue to expand in pairs, much cheaper than other options.  

With this setup, my window of time for losing data is very small (the time it takes for the hot swap to be copied) and it would take not just a random drive failure, but the exact mirror failure.  This is probably slightly more than just random chance, as drive failures may be related since both are used identically.  However, remembering back, SSD failures don’t appear to be a metric of use.  I think this should be okay.  

They also recommend creating a backup of the pool.  After considering, I decided I can keep some of my old 8TB drives from the old pool and convert these to straight backup mirror of critical data in the main pool.  This pushes my failures before loss of critical data up to something like 4 or 5.  That should be very rare, and I have an extra 8 SATA ports with nothing to use for right now.  

Overall I like this plan.  I think my own investigation of performance traits led me down the correct path, and although it is possible to have simply figured this out, I am not certain I would have asked all the right questions without actually going through the process.  This is something I have thought about before.  There is value in the engineering investigation. 

Performance optimization is a skill.  I need to keep it in practice.  Not every question I have been asked to solve has an exact answer out there already, and in fact, most won’t.  I learned both from first hand experience and examining the experiences of others, and that is the best result possible for a home lab.

The only question I have left is whether I want to mirror on the same HBA or split the mirrors between that way I can survive an HBA failure.  I may even consider buying the bigger HBA, this is not a high availability production deployment.  It is my home lab, and that extra PCI slot might be more important than the extra redundancy the 2 HBAs give me.



























Leave a Reply