Introducing Fail File #1, where I admit to screwing something up and reflect on what I’ve learned
SAN2.daisettalabs.net, the NAS4Free server I built to simulate some of the functions I perform at work with big boy SANs, crashed last night.
Or, to put it another way, I pushed that little AMD-powered, FreeBSD-running, Broadcom-connected, ZFS-flavored franken-array to the breaking point:
Such are the perils of concentrated block storage, amiright? Instantly my Hyper-V Cluster Shared Volumes + the 8 or 9 VMs inside them dropped:
So what happened here?
I failed to grok the grub or fsck the fdisk or something and gave BSD an inadequate amount of swap space on the root 10GB
partition slice. Then I lobbed some iSCSI packets its way from multiple sources and the kernel, starved for resources (because I’m using about 95% of my RAM for the ARC), decided to kill istgt, the iSCSI target service.
Thinking back to the winter, when I ran Nexenta -derived from Sun’s Solaris, not BSD-based- the failure sequence was different, but I’m not sure it was better.
When I was pounding the Nexenta SAN2 back in the winter, volleying 175,000+ iSCSI packets per second its way onto hardware that was even more ghetto, Nexenta did what any good human engineer does: compensate for the operator’s errors & abuses.
It was kind of neat to see. Whether I was running SQLIO simulations, an iometer run, robocopy or eseutil, or just turning on a bunch of VMs simultaneously, one by one, Nexenta services would start to drop as resources were exhausted.
First the gui (NMV it’s called). Then SSH. And finally, sometimes the console itself would lock up (NMC).
But never iSCSI, the disk subsystem, the ARC or L2ARC…those pieces never dropped.
Now to be fair, the GUI, SSH & console services never really turned back on either….you might end up with a durable storage system you couldn’t interact with at all until hard reset, but at least the LUNs stayed online.
This BSD box, in contrast, kills the most important service I’m running on it, but has the courtesy to admit to it and doesn’t make me get up out of my seat: GUI/SSH all other processes are running fine and I’ve instantly identified the problem and will engineer against it.
One model is resilient, bending but not breaking; the other is durable up to a point, and then it just snaps.
Which model is better for a given application?
Fail File Lesson #1: It’s just as important to understand how things fail as it is to understand why they fail, so that you can properly engineer against it. I never thought inadequate swap space would result in a homicidal kernel gunning for the most important service on the box…now I know.