Introducing Fail File #1, where I admit to screwing something up and reflect on what I’ve learned
SAN2.daisettalabs.net, the NAS4Free server I built to simulate some of the functions I perform at work with big boy SANs, crashed last night.
Or, to put it another way, I pushed that little AMD-powered, FreeBSD-running, Broadcom-connected, ZFS-flavored franken-array to the breaking point:
Love the directness of BSD. The iSCSI Target process was killed in cold blood, resulting in the death of several child partitions. What’s more, in just a few words, I have the suspect (Kernel) the motive (swap space) & the victim (iSCSI). Windows would have said, “The service terminated unexpectedly…error 0x081942ad-SOL”
Such are the perils of concentrated block storage, amiright? Instantly my Hyper-V Cluster Shared Volumes + the 8 or 9 VMs inside them dropped:
So what happened here?
I failed to grok the grub or fsck the fdisk or something and gave BSD an inadequate amount of swap space on the root 10GB partition slice. Then I lobbed some iSCSI packets its way from multiple sources and the kernel, starved for resources (because I’m using about 95% of my RAM for the ARC), decided to kill istgt, the iSCSI target service.
Thinking back to the winter, when I ran Nexenta -derived from Sun’s Solaris, not BSD-based- the failure sequence was different, but I’m not sure it was better.
When I was pounding the Nexenta SAN2 back in the winter, volleying 175,000+ iSCSI packets per second its way onto hardware that was even more ghetto, Nexenta did what any good human engineer does: compensate for the operator’s errors & abuses.
It was kind of neat to see. Whether I was running SQLIO simulations, an iometer run, robocopy or eseutil, or just turning on a bunch of VMs simultaneously, one by one, Nexenta services would start to drop as resources were exhausted.
First the gui (NMV it’s called). Then SSH. And finally, sometimes the console itself would lock up (NMC).
But never iSCSI, the disk subsystem, the ARC or L2ARC…those pieces never dropped.
Now to be fair, the GUI, SSH & console services never really turned back on either….you might end up with a durable storage system you couldn’t interact with at all until hard reset, but at least the LUNs stayed online.
This BSD box, in contrast, kills the most important service I’m running on it, but has the courtesy to admit to it and doesn’t make me get up out of my seat: GUI/SSH all other processes are running fine and I’ve instantly identified the problem and will engineer against it.
One model is resilient, bending but not breaking; the other is durable up to a point, and then it just snaps.
Which model is better for a given application?
Fail File Lesson #1: It’s just as important to understand how things fail as it is to understand why they fail, so that you can properly engineer against it. I never thought inadequate swap space would result in a homicidal kernel gunning for the most important service on the box…now I know.
Labworks #1: Building a durable, performance-oriented ZFS box for Hyper-V, VMware
Primary Goal: To build a durable and performance-oriented storage array using Sun’s fantastic, 128 bit, high-integrity Zetabyte File System for use with Lab Hyper-V CSVs & Windows clusters, VMware ESXi 5.5, other hypervisors,
The ARC: My RAM makes your SSD look like a couplel of old, wheezing 15k drives
Secondary Goal: Leverage consumer-grade SSDs to increase/multiply performance by using them as ZFS Intent Log (ZIL) write-cache and L2ARC read cache
Bonus: The Windows 7 PC in the living room that’s running Windows Media Center with CableCARD & HD Home Run was running out of DVR disk space and can’t record to SMB shares but can record to iSCSI LUNs.
Technologies used: iSCSI, MPIO, LACP, Jumbo Frames, IOMETER, SQLIO, ATTO, Robocopy, CrystalDiskMark, FreeBSD, NAS4Free, Windows Server 2012 R2, Hyper-V 3.0, Converged switch, VMware, standard switch, Cisco SG300
I picked the Gigabyte board above because it’s got an outstanding eight SATA 6Gbit ports, all running on the native AMD A88x Bolton-D4 chipset, which, it turns out, isn’t supported well in Illumos (see Lab Notes below).
I added to that a cheap $20 Marve 9128se two port SATA 6gbit PCIe card, which hosts the boot volume & the SanDisk SSD.
[table]
Disk Type, Quantity, Size, Format, Speed, Function
WD Red 2.5″ with NASWARE, 6, 1TB, 4KB AF, SATA 3 5400RPM, Zpool Members
I’m not finished with all the benchmarking, which is notoriously difficult to get right, but here’s a taste. Expect a followup soon.
All shots below involved lzp2 compression on SAN2
SQLIO Short Test:
Obviously seeing the benefit of ZFS compression & ARC at the front end. IOPS become more realistic toward the middle and right as read cache is exhausted. Consistently in around 150MB-240Mb/s though, the limit of two 1GbE cables.
ATTO standard run:
I’ve got a big write problem somewhere. Is it the ZIL, which don’t seem to be performing under BSD as they did under Nexenta? Something else? Could also be related to the Test Volume being formatted NTFS 64kb. Still trying to figure it out
NFS Tests:
None so far. From a VMware perspective, I want to rebuild the Standard switch as a distributed switch now that I’ve got a VCenter appliance running. But that’s not my priority at the moment.
Durability Tests:
Pulled two drives -the limit on RAIDZ2- under normal conditions. Put them back in, saw some alerts about the “administrator pulling drives” and the Zpool being in a degraded state. My CSVs remained online, however. Following a short zpool online command, both drives rejoined the pool and the degraded error went away.
Fun shots:
Because it’s not all about repeatable lab experiments. Here’s a Gifcam shot from Node-1 as it completely saturates both 2x1GbE Intel NICs:
and some pretty blinking lights from the six 2.5″ drives:
Lab notes & Lessons Learned:
First off, I’d like to buy a beer for the unknown technology enthusiast/lab guy who uttered these sage words of wisdom, which I failed to heed:
You buy cheap, you buy twice
Listen to that man, would you? Because going consumer, while tempting, is not smart. Learn from my mistakes: if you have to buy, buy server boards.
Secondly, I prefer NexentaStor to NAS4Free with ZFS, but like others, I worry about and have been stung by Open Solaris/Illumos hardware support. Most of that is my own fault, cf the note above, but still: does Illumos have a future? I’m hopeful, NextentaStor is going to appear at next month’s Storage Field Day 5, so that’s a good sign, and version 4.0 is due out anytime.
The Illumos/Nexenta command structure is much more intuitive to me than FreeBSD. In place of your favorite *nix commands, Nexenta employs some great, verb-noun show commands, and dtrace, the excellent diagnostic/performance tool included in Solaris is baked right into Nexenta. In NAS4Free/FreeBSD 9.1, you’ve got to add a few packages to get the equivalent stats for the ARC, L2ARC and ZFS, and adding dtrace involves a make & kernel modification, something I haven’t been brave enough to try yet.
Next: Jumbo Frames for the win. From Node-1, the desktop in my office, my Core i5-4670k CPU would regularly hit 35-50% utilization during my standard SQLIO benchmark before I configured jumbo frames from end-to-end. Now, after enabling Jumbo frames on the Intel NICs, the Hyper-V converged switch, the SG-300 and the ZFS box, utilization peaks at 15-20% during the same SQLIO test, and the benchmarks have show an increase as well. Unfortunately in FreeBSD world, adding jumbo frames is something you have to do on the interface & routing table, and it doesn’t persist across reboots for me, though that may be due to a driver issue on the Broadcom card.
The Western Digital 2.5″ drives aren’t stellar performers and they aren’t cheap, but boy are they quiet, well-built, and run cool, asking politely for only 1 watt under load. I’ve returned the hot, loud & failure prone HGST 3.5″ 2 TB drives I borrowed from work; it’s too hard to put them in a chassis that’s short-depth.
Lastly, ZFS’ adaptive replacement cache, which I’ve enthused over a lot in recent weeks, is quite the value & performance-multiplier. I’ve tested Windows Server 2012 R2 Storage Appliance’s tiered storage model, and while I was impressed with it’s responsiveness, ReFS, and ability to pool storage in interesting ways, nothing can compete with ZFS’ ARC model. It’s simply awesome; deceptively-simple, but awesome.
Lesson is that if you’re going to lose an entire box to storage in your lab, your chosen storage system better use every last ounce of that box, including its RAM, to serve storage up to you. 2012 R2 doesn’t, but I’m hopeful soon that it may (Update 1 perhaps?)
Here’s a cool screenshot from Nexenta, my last build before I re-did everything, showing ARC-hits following a cold boot of the array (top), and a few days later, when things are really cooking for my Hyper-V VMs stored, which are getting tagged with ZFS’ “Most Frequently Used” category and thus getting the benefit of fast RAM & L2ARC:
Next Steps:
Find out why my writes suck so bad.
Test Nas4Free’s NFS performance
Test SMB 3.0 from a virtual machine inside the ZFS box
Sell some stuff so I can buy a proper SLC SSD drive for the ZIL
Re-build the rookie Standard Switch into a true Distributed Switch in ESXi
Links/Knowledge/Required Reading Used in this Post: