Server Components

Server Components

Okay, so I’ve got a general idea of what I will need to make this whole home lab upgrade work, now I need to work through exactly how to build this virtualization server.  I always like to start with the basics.  All servers / computers need to have the following hardware: Processor, Motherboard, Memory, Video Card, Networking, and Storage.  I don’t necessarily mean these all have to be separate components that are purchased. For instance it is almost always true that the networking will be built into the motherboard.  I just mean that this is a sort of checklist. If these aren’t accounted for something is not going to work right.  I tend to start with the core 3, Processor, Motherboard, and Memory, and work my way from there.

Processor:   

Well, as described in a previous blog post this will be the EPYC 7702P 64-core 128-thread I won on eBay.  This determines the kind of motherboards available, since it will have to support AMD.  

Motherboard:  

Now, motherboards aren’t quite as important as they used to be.  Both Intel and AMD have long ago moved the true capabilities of the North Bridge motherboard chipset on die for the CPU.  That’s why there exists the concept of CPU-PCI lanes.  This basically means that the on die PCI controller can control PCI lanes, so they don’t necessarily need to be routed through a separate chipset.  Intel still likes to keep a few PCI lanes routed through their North Bridge, but they’ll always provide a nice architecture diagram of how that works [1].  Even modern Intel designs still designate quite a few PCI lanes to go through the North Bridge. 

I remember back in the core-2 days when Intel first moved the memory controller on-die and introduced their memory prediction chips that did an amazing job at reducing memory access latency.  Then we had integrated video controllers on Intel (their core I-5 ones are pretty good for integrated).  AMD’s recent Ryzen started chipping away at this even more by moving more than just memory, video, and PCI but also disc controllers on-die.  It’s why Ryzen supports RAIDs in the chip configuration. 

We all see where this is ending up, one big SOC (System on a Chip).  It used to be a bit debated whether SoC or CPU would win out.  Some people liked having their I/O hardware separated, but the speed gains for going on-die were too much to lose. Or rather, the only way to keep seeing the 10-15% improvements with IC (Integrated Circuit) fabrication technology stagnating is to start looking at things more holistically than just faster CPU.  In addition, this on-die controller movement had the side effect of consolidating the I/O ports on the motherboard.  Maybe 15 years ago, I remember having an AGP port, a PCI port, and a PCIe port all on the same motherboard.  This has since died out, the truth is that hardware interfaces are pretty darn standardized nowadays around PCI-e. 

Now, I actually have a suspicion that other than socket pins, AMD and Intel could probably standardize pretty well if the companies weren’t trying to lock people into their platforms.  So much of what used to be done by the North Bridge is on-die that I think current motherboard design is a lot more about layout design, wire routing, and actual breadboard fabrication techniques than needing a specific chipset to make it all go right.  

So much of current chip production isn’t actually different.  The biggest difference between an Intel core I-5 and core I-7 is (or was, not completely sure of current generation which they switch around CPU naming for marketing rather than design considerations) is whether SMT works or not. 

Essentially chip manufacturing is a crap shoot.  They go through the same manufacturing process for the lot of them.  One chip has 4 working cores, one has 8, one has 12, and so on.  A 4-core with SMT working gets labeled an I-7, without an I-5.  If this chip keeps working at 4GHz it’s a 700, if it stops passing tests before 3.5GHz it’s a 500.  This is what is meant by binning.  They look for the chips that keep working as the frequency goes higher.  Then stamp them a certain level and increase the price.  After all, a CPU is just a rock we tricked into thinking.  It is kinda insane this whole process works at all.  I can go into more detail if people are interested (I even manufactured some circuits on the old 486 integrated circuit process in college, so I have direct, hands-on experience).  

Anyways, one thing that EPYC chips do, and Intel is sorely lagging in, is in providing PCI-e lanes.  The 7702P has 128 PCI-e 4.0 lanes.  This was going to be necessary to get all the hardware on this one machine I was needing. 

A standard video card uses 16 PCI-e lanes.   I want 3 of them, which means 48 PCI-e lanes.  Intel has the ability to compete, sorta, by using their north bridge, but their Xeon’s tend to max out at 48 PCI-e lanes total [2].  North Bridge PCI-e lanes are not as good as on-die PCI-e lanes.  In addition there is a limit to how many can be added. 

Now there are some cool chips out there that can act as a kind of PCIe switch, and provide more lanes.  For example, I own an Asus x99-E WS [3].  This would appear, at first glance, to have 112 PCIe lanes, plus more for the hard drives, etc.  This is a trick by using a PLX 8747 chip, 2 of them actually, which is a high speed PCIe switch [4].  These are banking on not maxing out the PCIe bandwidth.  This is almost certainly a good bet, which is why these are used somewhat frequently, though they do add ~30 watts to the system.  However, these kind of workarounds probably shouldn’t be necessary, and Intel just needs to up its game.  

On the same tangent, most high speed NVME drives nowadays require 4 PCIe lanes to work.  The m.2 standard is just an attempt to make these PCIe lanes a bit more explicitly for drive space rather than generally available.  However, the plethora of converters that allow other things than just straight up nvme for this shows how easy it is to use them more generally.  For my purposes though, I will almost certainly want to use these for faster drives.  I am planning on at least 1 possibly up to 6. 

These NVME drives are important because I don’t want to be booting and running all the VMs on the storage array.  Note, I can do that, and sometimes that is recommended as the storage array is quite redundant and if built right has the IOPS.  I personally just prefer for the workstation machines to have dedicated-ish resources on this. More accurately, I want more IOPS than through the storage system which could be overloaded.  That being said, I will almost certainly want to use some of these for caching from the bigger storage array as well.  

Alright, with the PCIe considerations and drive considerations in mind, I want to examine the actual possibilities.

ASRockRack EPYCD8

https://www.asrockrack.com/general/productdetail.asp?Model=EPYCD8#Specifications

There are 4 PCIe x16 lanes and 3 PCIe x8.  However, these are all PCIe 3.0 lanes.  As previously mentioned, the EPYC supports PCIe 4.0.  I would very much like that for future proofing.  It’s a small thing, but these aren’t cheap as is. 

It also has a dedicated port for IPMI, which is remote management.  This enables me to turn on and off the server without having to go to the rack and hit any buttons.  I prefer a dedicated versus a shared network port for this, as doubling up adds failure opportunities that would otherwise not exist.  For example, if someone is DOSing the shared port for any reason, all the sudden access to basic power management to shut it down is also gone.  Not ideal. 

Also, I would be remiss to not state ASRockRack has somewhat of a cheap reputation.  I have used 2 of their motherboards before. Linux ran just fine, but Windows lost the ability to update and kept crashing.  I replaced every part but the motherboard and Windows still couldn’t update, so in my personal experience they are touchy.  I would kinda prefer to avoid ASRockRack if I can find something more acceptable.

ASUS KRPA-U16

https://www.asus.com/Commercial-Servers-Workstations/KRPA-U16/

This one has a few PCIe 4.0 lanes.  The x24 is almost certainly for a riser card, as x24 is not a standard card input.  A Riser card is for a server that is not tall enough to fit a full sized card upright. Instead, it provides a set of inputs to put I/O cards along the same plane as the motherboard to keep it’s vertical footprint low.  It’s for stacking a lot of them.  I can almost certainly use it as a standard x16.  Beyond that though, there is still a lot of PCIe 3.0 here. 

In addition it has a OCP 2.0 Mezzanine slot.  This is a good thing!  It provides a low profile way to add networking, which in my case would be a fiber 10G port.  It also has a dedicated remote management port.  The motherboard is of the EEB form factor, making it even more for low profile servers.  It is possible with a big enough chassis I could make this work, and I do think this is a solid option.

Gigabyte MZ32-AR0

https://www.gigabyte.com/us/Server-Motherboard/MZ32-AR0-rev-10#ov

This looks like another solid option.  It has a lot of PCIe 4.0 slots available and an OCP 2.0 mezzanine slot.  The form factor is EATX, making it much better for what I am hoping to do.  It has a dedicated remote management port as well. 

This looks like a motherboard that would suit my needs.  The only real concern is that I would not be able to access all PCI slots. Video cards almost universally take up two PCI slots.  The above configuration would have my 3 video cards taking up 6 of the PCI slots overall, leaving only one for everything else.  That’s a bit risky, but I am willing to bet I could make it work.

Tyan Tomcat HX S8030 (S8030GM4NE-2T)

https://www.tyan.com/Motherboards_S8030_S8030GM4NE-2T

This is also a strong contender.  The Tyan has the dedicated remote management port I like.  It is of the standard ATX form factor, which means I will have more options on my chassis (the formal term for a computer case).  It also has nice breaks around the bottom PCIe x16 slots, so I won’t be losing access to plug in those big video cards.  In addition all of these PCIe slots are 4.0 slots. 

It has plenty of extra hard drive connections.  It has 2 10Gig ports, even if I slightly prefer fiber 10G over RJ-45 10G.  I think this one works as well.  

Motherboard Conclusion

The Tyan Tomcat is the winner for me.  I strongly considered the Gigabyte, as I do have a preference for fiber over RJ-45, mostly because fiber is much lower power. The real deal breaker was that the gigabyte was sold out everywhere and nobody had any idea when it would be back in stock.  I was willing to wait, but not for an “unknown time period”.  Even then, I still think I would prefer the Tyan, the ATX is a better option for fitting many chassis.  However, I do not believe waiting for the gigabyte to be a bad choice either.  Ultimately I ended up getting caught by covid for this motherboard anyways and had to wait several months for it to arrive.

I should note, at the time I did my research and purchasing there were no standard offerings from Supermicro.  They are my preferred manufacturer by a wide margin.  They have since come out with a motherboard I would have gone with, but it is still not generally available and the best guess for when it could have arrived was July.  Given the shipping delays who knows when/if it would have arrived.  For reference I would have gone with this one: 

Supermicro H12SSL-CT

https://www.supermicro.com/en/products/motherboard/H12SSL-CT

Now, I should say, this is not a slam dunk Supermicro best decision either.  There are a lot of considerations here.  The Supermicro adds 2 x8 slots that would ultimately be covered by the video cards. The Tyan has 2 slimline SAS ports which replace those x8’s and are available for NVME drives (which I will end up using later).  Therefore I am using basically every interface in the Tyan, but 2 of those interfaces would not be available in the Supermicro. 

It’s important to make sure I really have a grasp on the hardware I want to use and the various methods available for making them work.  I won’t go so far as to say I planned everything out perfectly, but I did think about how best to use the PCIe slots and leave myself with the most options from what I had available.  Like any computer project, app development or hardware, I think it best to plan for flexibility rather than a strict definition of exactly what to do.

Memory:

I think a brief overview of the various aspects of memory helps here.  There are about 5 properties I think are important to know when researching memory.  I am not going to say only 5, but if I get these right, I don’t think the other considerations will matter as much.  

First and foremost is the frequency.   My previous research into the Ryzen platform suggests strongly that the Infinity fabric that connects the chiplets in EPYC/Ryzen processors needs higher frequency memory [5][6][7].  The infinity fabric is probably best when it exactly syncs with the RAM clock. Even though it is theoretically possible to unlink them, I think this is a foolhardy thing to do in a virtualization server.  That kind of fine tuning belongs for optimizing one type of application. A game VM, storage, home automation, and more is not going to give the same optimization profile.  I will stick with just going for the highest frequency RAM I can reasonably acquire for this.  

The second point is RAM timings, of which one commonly encounters 4.  The CL (cas latency), Trcd (row address to column address delay), Trp (Row Precharge Time), Tras(Row Active Time).  From a high level, actively understanding each of these isn’t that important.  One wants these to be lower, but the actual memory response time is a function of both these timings and the RAM frequency. 

In isolation of the Infinity Fabric, which I want to keep at a 1:1 ratio between the RAM frequency and the Fabric Frequency, these timings can make a ~2-3% difference in overall performance, but the cost is extreme.  For a virtualization server, I do not want to go with overclocked RAM, I want RAM I know will work and not fail for as long as reasonably possible.  Wikipedia has a more in depth overview on timings here [8].  

The third thing to be aware of is the difference between buffered and unbuffered memory.  Unbuffered memory is common consumer memory, in my laptop, desktop, or anything I carry around.  These are called UDIMMs. 

Buffered, also called Registered, memory includes a register between the RAM and the memory bus lane for the command and address.  This is done to boost the memory request signal and allow higher density RAM chips at the expense of one clock cycle for responsiveness.  These are referred to as RDIMMs. 

Fully Buffered (FBDIMMs) are something I am aware exists, but have no actual experience with.  They use too much power and generate too much heat and are not used often, though they have the ability to increase capacities by quite a bit. 

Next is Load Reduced modules (LRDIMM), which add another buffer on the data lines as well.  This is a sort of power compromise between RDIMM and FBDIMM.  I tend to use RDIMM or LRDIMM, but these require specific motherboard support to use.

The fourth memory property is about ECC, which stands for Error Checking and Correction.  In standard desktop memory, there is no error checking.   When a fault occurs, it has to be detected externally and accounted for in the programming or memory controller.  Faults occur for all kinds of reasons, such as the stuck-at-[0 or 1] errors.  These mean that a transistor is permanently on or off, or some interaction with a more complicated logic circuit results in the circuit always outputting 1 or 0. 

The most common way these happen are hard faults, like electron-migration where a wire within the chip has physically created a hole where no electron flow is possible.  This results in the wire not working and the circuit being permanently “stuck” at a particular value. 

Another common form of this is when the transistor channel has been weakened from use such that the leakage current is high enough to result in an output of always flowing regardless of whether the gate is on or off.  If that doesn’t make sense, don’t worry, it just means broken transistor. I did a brief stint at Intel working on circuit aging. These are the kinds of effects I worked on, all of them hard faults. 

Another kind of error is the soft fault, like an alpha particle strike.  This is when alpha radiation hits a transistor or electronic storage cell and takes a bit of energy along with it. That interaction results in the circuit flipping from one to zero or vice versa. 

This is a much bigger issue in outer space, where there is no atmosphere to absorb radiation.  NASA will use TMR (triple modular redundancy) when building circuits to combat this.  This is where they put three of every chip and use the plurality result for the operations.  Tripling the costs is expensive, as is planning around the extreme environment of outer space. 

Anyways, alpha strikes hit all the time on earth as well.  Most of the time these hit a place that isn’t vital to the operation of the system, such as for 1 frame 1 pixel is wrong.  Mostly who cares, it corrects itself before the human notices. That is called a masked effect. Think about an alpha strike in someone’s bank account balance, and its obvious that this quickly becomes a concern. 

Hard faults are a bigger concern in general. They also instill a lifespan on a processor.  Too many hard faults and it stop working. Smaller fabrication nodes make hard faults more likely, both during fabrication and over time. That is why my old 486 processor still runs fine 20 years later, it was built on a giant 100 micron processing node. Whereas the current 7nm are very small and last only a few years of use (Intel tends to target ~7 years). 

Digression over, but ECC will use a form of hamming codes [9] to be able to detect and correct for errors, both hard and soft. Hamming code is a pretty cool algorithm of parity bits that lets the system not only detect errors in the bits (which single parity bit does) but also exactly where in the bit-vector the error is, thus it can be corrected.  This comes at the expense of increasing the needed storage size, but the transfer to the memory bus does not need to include the extra parity bits.  This is important for servers, where adding a few extra storage bits and hamming calculations is preferable to having a silent error that takes a long time to detect.

The last important property, to me, is an understanding of the memory manufacturers. There are 3 primary manufacturing companies for server memory, Hynix, Micron, and Samsung.  My order of preference is Samsung > Micron > Hynix. That being said this order of preference is slight.  Hynix is NOT bad, though some of my peers may disagree on that.  I would not feel bad purchasing it.  That being said, I do believe that, in general, one gets what one pays for.  If I buy a cheap car, I should not be surprised if it doesn’t last a decade.  

Alright, with everything I just covered in mind, I think I can define what I am looking for.  I will want a DDR4-3200+, with whatever timings I can get because I prefer frequency. This memory should be either an RDIMM (which is slightly faster) or LRDIMM with ECC.  Given that my motherboard has 8 slots, and I want ~424GB of RAM, these will need to be 64 GB DIMMs.  Also me being who I am, I wanted Samsung or Micron manufacturing.  I ended up finding 8x Micron MTA36ASF8G72PZ – 64GB [10] on ebay.  They were not cheap, but they ticked all the right boxes.

Video Cards:

This was the easiest one.  I have two NVidia GeForce 1080 GTXs I bought ~4 years ago.  Part of the goal of this project is to see if I can get the PCI passthrough for these working correctly.  I will use these two.  That being said there is a minor complication for this section.  My streaming setup blinks and sometimes goes to static on the bedroom TV. 

I recently replaced the old streaming media PC.  I upgraded it from a Core i7-6700k / CS236 based system to a Ryzen R7 3700 based system.  This involved a complete overhaul of everything except the 1080 GTX. 

The old system got stuck and couldn’t update windows at all.  It failed for over 3 years to install a single update. The weekly chore of letting it attempt to do so, fail, and fall back was getting quite annoying.  I suspected it was the ASRockRack CS 236 based motherboard, and the new system has no issues updating.  However, the blinking remained.  At this point The only thing left to change is the 1080 GTX.  Therefore, looking at my original requirements on this, I think the 1080 GTX will go to my partner’s workstation, and I will get a new video card. 

My usual plan is to just buy a new video card and let my old video card become the next hand-me-down card.  Checking the release schedule, it looks like both NVidia and AMD have planned new architectures this year, which serves my purposes.  So for now, I’ll just use the 2 1080 GTXs and wait and see which of these video cards that come out this year is the best. Then let the 2080 TI become a hand-me-down to the media streaming VM.  I would like to end the blinking if possible.  I suspect it is an EDID issue and related to 4K / 1080p streaming playback. Therefore, I have asked for a ticket to be opened with just-add-power about it.

Network Cards:  

As I discussed before, I have all the setup for a 10G Fiber network.  This is based on multi-mode fiber as this is a small deployment in my rack room, not a large deployment.  Single-Mode fiber is best for 300+ meters, multi-mode for less than 300 meters.  I based this on my old High Frequency Trading days.  So I mostly have Solarflare SFN 5122Fs.  These are definitely getting old at this point, and their primary use was for TCP offloading.  TCP offloading is where there is no memory copying from kernel space to user-space. The processing of the TCP transfer is offloaded to the user application.  It speeds processing up by skipping the kernel memory space usage entirely.  This isn’t an important detail here, I just think it’s interesting and I enjoyed my time at the HFT firm. 

One of my 6 SFN 5122Fs.

One thing to note, HFT companies will often overclock, and burn out their hardware.  They don’t have any problems whatsoever with doing even crazy stuff to eek out extra frequency and speed.  It is just the cost of doing business.  This means that when they need to upgrade, which is frequently, the old hardware finds its way to ebay for dirt cheap because the components don’t have any warranty left. 

I would caution against buying a processor they are reselling as it has likely been burned out pretty hard.  However, networking equipment doesn’t “overclock” like a CPU does (and that is part of the reason they moved to FPGAs in the industry).  One can potentially overclock the PCI bus, but the networking fiber and SFPs don’t.  So when an HFT firm sells networking equipment, it is used but it is not burned out in quite the same way a processor is.  I bought 8 of these for ~$30 each 5 years ago and have never had a single problem.  

I briefly considered updating to a SolarFlare 5322F or even their more flagship options.  But the reality is that the improved latency isn’t really important to me. I don’t much care about the difference between 2.5 microseconds and 2.0 microseconds for my home lab.  I also don’t need the built-in timing chips.  In HFT they care about something they call tick-to-trade.  That is, the time between when the packet of market data which causes an order to be sent hits any company controlled network switch until that order is sent out of the last company controlled network switch.  To help with this, SolarFlare added timing options to be able to have a uniform clock around these.  It is useful for double checking, but most of the big companies will build out precision clocks based on GPS in their networks for this.

In addition, if I replaced these, I may have to purchase new SFP+s to make the new card work.  The ports on the card aren’t actually usable yet.  They are SFP+ ports, and they require SFP+ transceivers to actually plug networking equipment into. I had a couple dozen that are certified by SolarFlare and work quite well.  I don’t want to have to buy more in case they don’t work with the exact card I would buy.  

Ultimately, I will just stick with what I am currently using.  10G is still considered pretty high speed for a home lab.  I could try to jump up to 40G, but the networking switches for that are still pretty expensive.  I updated my networking equipment a couple years ago, which I may turn that into another post at some point. Suffice it to say, outside the datacenter, it is pretty difficult to encounter 40G. 10G is fine for now.

The motherboard also has a pair of 10G Intel x550 based RJ-45 copper ports.  This is more of a fall back.  I don’t know how much equipment I will need to stuff into this server.  The 5122F uses an x8 slot, despite being PCIe Gen 2.0.  This is a bit of a waste of a slot in my opinion.  I may try to see if I can get away with just the built in x550s. 

There are some other brands as options, such as Mellanox, Aquios, even Infiniband.  I have considered working with these as well, but Intel still dominates the baseline RJ-45 based copper ports, and for good reason.  Their drivers work well, as they have put serious effort into maintaining their standing.  It is even difficult to find AMD systems that don’t use Intel networking.  The more exotic kinds of networking require extra networking equipment to get working.

At the end of the day, I will have two SFP+ ports, two 10G RJ-45 copper ports, two 1G RJ-45 copper ports for the server, and one 1G RJ-45 dedicated remote management port.  This should be enough networking, and I might be able to pair it down a bit.

Storage:

This is a complicated section.  I have thought about this and I think I will cover the hardware of the storage VM subsystem in its own series of posts, including what I plan to build out and problems encountered.  Here, I will cover only the virtualization server’s dedicated storage.  

One thing to note, with the advent of PCIe 4.0, dedicated m.2 ports are being updated to PCIe 4.0 as well.  That opened up to several possibilities.  In addition there are several miniSAS ports on the motherboard for storage options.  I want to briefly cover all of the options to consider. 

The first kind of storage is the Hard Disk Drive (HDD).  This is the old style spinny-disc as I call it.  Its primary purpose is to provide a great deal of storage capacity, but it is quite slow (order of magnitude is 1-10ms access time). 

The primary manufacturers here are Seagate, Western Digital, and Hitachi.  Hitachi focuses on the high end.  This is a good bet for servers.  Western Digital tends to focus on the mid range.  They have a large line of products directed at both consumer and enterprise.  I would say they aren’t quite as high quality as Hitachi, but they are solid.  Seagate is probably the lower quality.  I do think their recent line of hard drives is more stable than their reputation, but they did have a large number of years where they deserved their reputation. 

Backblaze does a great study on this where you can get some decent real world numbers on failure rates [11].  They do not cover all drives or all firmware options.  I tend to use this as more a gut check than anything else.  It does conform to what I would typically think: Hitachi > Western Digital > Seagate.  I think that people believe there is a larger difference between them than actually exists, and I have used and will continue to use Seagate.  I probably would not use them in this case for the primary hard drive.  However, for backup things up in a RAID? Yes, easy choice.

The next kind of storage is the Solid State Drive (SSD).  This is more recent, but not cutting edge anymore.  The SSD has no moving parts, unlike the HDD.  This gives rise to its own set of issues.  They both tend to use the same set of interfaces (SATA III / SAS3), and have comparable bandwidths. Although that is a result of the interface, not necessarily capability.  However, the SSDs are MUCH lower latency.  A typical SSD latency is in the 10-100s of microseconds, not in the 1-10s of milliseconds like the HDD.  This is a serious performance increase.  It is also why a decade ago everyone said the best decision for a more responsive computer was to replace the HDD with an SSD. 

The primary technology of the SSD is what kind of cell it is composed of. SLC, or Single-level-cell, is where each circuit in the SSD stores one bit of information. MLC, multi-level-cell, stores 2 bits of data in dual page. TLC, triple-level-cell stores 3, and the recently introduced QLC or quad-level-cell stores 4 bits of data. These tend to be thought of as increasing in performance, which is true, but not nearly as much as people think. They also differ in write endurance, which for the SSD means how many times a particular cell can be changed. The more bits per cell reduces endurance. However, this number are rarely important, as they are so high that they just don’t get hit in even heavy enterprise applications.

SSD failure rates are a bit more difficult to come up with hard data on.  I have not been able to find anything nearly as good as the Backblaze numbers.  I can say this, SSD failure rates are not a function of use.  I’ll repeat that as it is a common misconception.  SSD failure rates are NOT A FUNCTION OF USE [12][13].   Google has stated this multiple times.  There is a common belief that SLC, MLC, etc. all represent increasing cases of failure, this does not appear to be true in practice. It may be true in theory, based on what manufacturers are rating drives for, but that isn’t as important.

That being said, I am not of the opinion there are no advantages to SLC over MLC or QLC.  I do think that SLC is in general a tad more reliable towards common usage.  They have lower error rates, where data has been corrupted, or sectors have become unreadable.  Their addressing can be faster, as there is less to address.  Though, for the same reason large RAM chips added more space by increasing addressability at the cost of speed, that is what QLC and MLC do.  They make it cheaper to have larger spaces.  This is a small performance cost. 

It is also important to discern whether an SSD is targeted towards Enterprise, or Consumer use cases. I think enterprise design philosophy also makes for more reliability in operation as opposed to consumer.   A good example of this is the power-loss write protection enterprise SSDs have.  Those SSDs have a battery that will ensure that the SSD cache is flushed to non-volatile storage before the power loss causes the chip to fail.  It is these kinds of design decisions that make them more reliable. 

The overprovisioning that they have does not seem to decrease the failure rate.  However, the overprovisioning (extra storage cells beyond what the drive is rated as having) does allow for the drive to recover from unrecoverable read sectors better and in general have better performance because it has access to more flash for use.  For example a write operation on flash is frequently a read -> modify -> write, and having a blank section or a section one does not care what is there can lead to more just write (as opposed to the read -> modify -> write chain) because of optimization.  

Last of the modern storage options is the NVMe SSD.  The first generation of SSDs basically used the HDD interface and had faster access time.  The biggest issue with this, is that the interface was designed with spinning discs in mind.  SATA caps out at 6.0 Gbps and SAS at 12.0 Gbps.  These are very slow compared to what flash can provide, let alone RAM buffered flash drives. 

The NVMe protocol is specifically designed to get around the bottlenecks of the SATA interface by using the PCIe lanes.  This, in practical terms, gives the option of about 25.0 Gbps read and about 10.0 Gbps write. This will continue to increase because the max bandwidth of a PCIe 3.0 x4 lane is ~32 Gbps.  The biggest limitation on this protocol?  That lack of available PCI lanes in modern systems.  Hence why AMD is pushing the number of lanes hard

The old SATA interface and AHCI protocol is likely towards the end of its life.  Or rather I think it will simply become used only in servers where a large amount of storage is useful, like my storage VM!  Anyways, the NVMe SSD is a relatively new technology, probably hitting mainstream ~2018.  

One very large advantage of  NVMe over the SATA SSD is IOPS.  IOPS stands for Input/Output Operations Per Second.  This is very important for applications like databases, where most data being fetched and modified is small, not large.  Another item that has a similar profile?  The operating system.  It has a large number of small library files that constantly need to be loaded and/or unloaded from memory.  In addition, there is a reason that random 4K read/ write is a standard test for many storage drives, that is the memory page size for x86.  When a system runs out of available physical RAM, it will start storing that memory onto the disc in what is called a page file.  Linux is more explicit about it than Windows, it gets its own partition, but make no mistake, Windows does it too.  When it writes it will be 4K sized. 

This large number of small operations is why SSDs and now NVMe SSDs are vastly preferred as OS drives.  Theoretically, NVMe SSDs achieve ~500,000 IOPS to around ~90,000 IOPS for SATA SSDs.  In practical terms, these are never achieved, it is a demonstration benchmark.  I consider NVMe SSDs to achieve about 2-5x IOPS than SATA SSDs, and really, that’s all I need for comparisons here.  I don’t need to know exactly how much better.

There are literally hundreds of SSD/flash manufacturers.  I don’t want to go over all of them, but I will list the ones I trust to do a good job (and one I don’t with the reason.).  Samsung, Crucial, Corsair, Toshiba, Intel, and Micron are solid.  I am sure I left out some good ones, but those are the big ones that come to mind.  I would avoid SanDisk, especially their SkyHawk brand.  This is because they are not consistent in manufacturing.  It appears they grab whatever they can from secondary sources and rebrand them. So I cannot expect that, even if buying the same model, I would end up with identical manufacturing.  This is an issue for RAID arrays.

Alright, looking at my motherboard selection, I have two m.2 slots.  I think this is the better interface for all of the VMs as it will be the maximal IOPS interface.  With two of them I would like to RAID 1 these two for redundancy.  I would also set up a backup service, but that is probably better discussed later.  Given this it’s time to hit some research. 

The truth is I use a few review sites for this kind of research. Anandtech, servethehome, and storagereview are my favorites.  I found a general overview of the space [14], and saw a new brand I had never heard of, Sabrent, being talked about nicely.  There were only 3 brands offering PCIe 4.0 m.2 at the time, Sabrent, Corsair, and Gigabyte.  The Sabrent was getting favorable reviews and there was some hype about the new Phison 14 controller chip.  I wanted to try the new PCIe 4.0 interface to maximize the potential IOPS.  I considered going with an older version of a more enterprise grade SSD. Ultimately I wanted to go enterprise on the storage VM, but RAID 1 mirroring these for the shared storage space ought to be enough for my purposes. I went with the Sabrent.

Chassis:

Alright, when it comes to chassis I need to clarify.  For most server chassis, they include the power supply units with the chassis and are, in fact, custom designed for the chassis they go into.  There are a few standard kinds of power supply, such as the standard ATX power supply, which aligns specifically with the type of chassis.  However, with the kind of system I am building, I very much doubt that a standard PSU size will work.  Therefore I will be selecting the chassis first, and then worrying about the power supply second.

Now, how much power do I actually need?  There is a nice power supply calculator attached with pcpartpicker.com where one can add any particular consumer component and it will calculate the estimated power needed.  However, it is difficult to do this for an EPYC based system, since pcpartpicker focuses on consumer systems.  I can use this to gather estimated TDPs for specific components. 

Starting from the top, the EPYC 7702p has a TDP of 200 watts [15]. 

The memory is difficult to estimate, neither the datasheet nor the product description page provide hard numbers. Using pcpartpicker there is a DDR-3200 64GB stick I can get the estimated TDP of 14.5 watts (it is the corsair vengeance LPX).  So a reasonable estimate for my 512GB is 116 watts. 

The motherboard has the same issue, as the actual power consumption of the motherboard isn’t in the datasheet.  Based on looking at a few of the high end x570 motherboards in pcpartpicker it would be around 100 watts. 

Next, I looked into the NVidia GTX 1080.  It appears to be around 180 watts active [16], although NVidia recommends 230 watts per card [17]. I calculated that by subtracting the SLI power vs single card power recommendations.  Not exact, but should be pretty darn close. 

Next, the solarflare SFN-5122F is 4.9 watts [18], pretty darn low power, but most network adaptors are at this point. 

The Sabrent 4.0 2tb is 10 watts from pcpartpicker. 

I’ll also skip ahead a little on the storage VM and include the 2 HBAs power there, which is 13 watts each [19]. 

At the end, I will also need to include the efficiency loss of the power supply as this indicates the power available. It is also most likely the source of the NVidia estimate vs the actual power observation requirements.  So the calculation is:

PartPower Consumption
EPYC 7702p200 w
Tyan S8030100 w
Micron DDR4-3200116 w
NVidia 1080 GTX230 w
NVidia 1080 GTX230 w
SolarFlare SFN 5122F5 w
Sabrent Rocket 4.0 2TB10 w
Sabrent Rocket 4.0 2TB10 w
HBAs26 w
Totals927 w
Assuming efficiency loss (85%)1090 w

So I will be looking for at least an 1100 watt power supply.  There is a bit of fudge factor as I am not 100% certain whether the measurements listed are pre or post PSU efficiency metrics.  If I want to add another Video card to this system, I will need approximately 1450 watts.  That might be a problem.  Now I should note, these are all maximum values, and the actual draw will scale down with what the system is really needing at any point.  With EPYC shutting off unused cores, not pumping full power on the 1080s all the time, and a host of other causes, I should not expect to hit that max number.  That being said, if I don’t plan right, a brief spike and the whole system goes down.

846E1-R1200B

Okay, now I made a mistake here.  I did not do that calculation at first.  I figured 1100 should be enough, as it matched a rather cheap chassis I found on ebay.  Following through with that thought, I put out a couple offers on a CSE-846E1 based chassis [20] and a 748TQ-R1400B [21].  I ended up winning them both.  I used neither. 

The first one worked fine, but the power supplies were just too loud for me.  The second didn’t work at all.  The power supplies never turned on. Heck they only sent me one of the two promised power supplies and only after I complained sent the second.  It never turned on. Unfortunately, I didn’t figure that out until much later because the motherboard was delayed a couple months.  Luckily these were pretty darn cheap because they were used and old. 

748TQ-R1400B

But I should have calculated the power requirements and then realized one more thing about myself.  Essentially, keep your customer’s priorities foremost in the design process.  In this case, that customer is me, and I hate noise. 

I’m the guy that moved all of my computers into a rack and paid people to wire a thunderbolt 3 cable, two dvi 1.4 cables, and two usb 3.0 cables throughout my house so I didn’t even have to be in the same room as the computers.  This is even after I replaced every fan in every system, server or not, with Noctua fans of the appropriate size.  I buy giant CPU heatsinks to make doubly sure I get that 140 mm fan which can be replaced with Noctua and made very quiet. 

I would rather buy a giant room fan than listen to the aircraft carrier fans that come in standard server chassis’.  Data Centers don’t really care about noise.  They pump in a lot of cold air, build basically air tunnels throughout the building structure, and use small high rpm fans to push air through the system and overcome less than open air designs.  They do not care about noise in the least.  Data Center technicians can attest to this (in addition to me, I’ve been in about a dozen different Data Centers and Super Computers).  These chassis fans were audible through multiple walls.  This was a mistake I made about not knowing myself. 

745BTQ-R1K28B-SQ

After realizing the depth of these errors, compounded, I ended up giving a buddy the SC845E1 chassis, so he could update his media server. Then I made the above calculation, and bought a 745BTQ-R1K28B-SQ [22].  This is from Supermicro’s super quiet series of workstation chassis.  I have an old one of this series as well for my workstation.  This series is my favorite by a wide margin. 

I looked for a copy of my workstation chassis, which is a 1400 watt redundant quiet power supply chassis, but unfortunately I could not locate one for sale.  Barring that, this is the largest power supply in the SQ series that I could find.  This might not be enough power for the penultimate system.  I am considering lowering the project, maybe cutting something. However, power requirement calculations and peak power aren’t exact.  It is entirely possible the 1280 watts I have here is enough.  

I did try to swap with my workstation chassis, unfortunately the workstation motherboard didn’t fit in the new chassis, and I don’t see a solution at this point.  Maybe I will encounter another of my workstation chassis at some point, for now I will simply plow forward with what I have.  

References

[1] https://www.anandtech.com/show/15723/the-intel-z490-motherboard-overview

[2] https://ark.intel.com/content/www/us/en/ark/products/205684/intel-xeon-platinum-8380hl-processor-38-5m-cache-2-90-ghz.html

[3] https://www.asus.com/Commercial-Servers-Workstations/X99E_WS/

[4] https://www.broadcom.com/products/pcie-switches-bridges/pcie-switches/pex8747

[5] https://www.youtube.com/watch?v=38fOF41emtI&t=19s

[6] https://www.youtube.com/watch?v=I0zGeb9M8zM

[7] https://www.youtube.com/watch?v=iHJ16hD4ysk

[8] https://en.wikipedia.org/wiki/Memory_timings

[9] https://en.wikipedia.org/wiki/Hamming_code

[10] https://www.micron.com/products/dram-modules/rdimm/part-catalog/mta36asf8g72pz-3g2

[11] https://www.backblaze.com/blog/hard-drive-stats-for-2019/

[12] https://www.zdnet.com/article/ssd-reliability-in-the-enterprise/

[13] https://www.zdnet.com/article/ssd-reliability-in-the-real-world-googles-experience/

[14] https://www.anandtech.com/show/15269/anandtech-year-in-review-2019-solid-state-drives

[15] https://www.amd.com/en/products/cpu/amd-epyc-7702p

[16] https://www.extremetech.com/computing/310217-how-much-power-do-gpus-actually-consume

[17] https://www.realhardtechx.com/index_archivos/Page362.htm

[18] https://www.bhphotovideo.com/c/product/1017852-REG/solarflare_solr_5122f_10g_sfn5122f_plus_2_solr_sfm10g_sr_bundle.html

[19] https://www.storagereview.com/review/lsi-sas-9300-8i-and-9300-8e-hbas-review

[20] https://www.supermicro.com/products/chassis/4U/846/SC846E1-R1200.cfm

[21] https://www.supermicro.com/en/products/chassis/4U/748/SC748TQ-R1400B

[22] https://www.supermicro.com/en/products/chassis/4U/745/SC745BTQ-R1K28B-SQ

Leave a Reply