#StorageGlory Achieved : 30 Days on a Windows SAN

 Behold, these three remain. File. Block. Object. And the greatest of these is block.  – Sr. Systems Engineer St. Paul, in a letter to confused storage engineers in Thessalonika

Right. So a couple weeks back I teased the hardware specs of the new storage array I built for the Daisetta Lab at home.

Software-defined. x86. File and block. Multipath. Intel. And some Supermicro. Storage utopia up in the Daisetta Lab
Software-defined. x86. File and block. Multipath. Intel. And some Supermicro. Storage utopia up in the Daisetta Lab

My idea was to combine all types of disks -rotational 3.5″ & 2.5″ drives, SSDs, mSATAs, hell, I considered USB- into one tight, well-built storage box for my lab and home data needs. A sort of Storage Ark, if you will; all media types were welcome, but only if they came in twos (for mirroring & Parity sake, of course) and only if they rotated at exactly 7200 RPM and/or leveled their wears evenly across the silica.

And onto this unholy motley crue of hard disks I slapped a software architecture that promised to abstract all the typical storage driver, interface, and controller nonsense away, far, far away in fact, to a land where the storage can be mixed, the controllers diverse, and by virtue of the software-definition bits, network & hypervisor agnostic. In short, I wanted to build an agnostic #StorageGlory box in the Daisetta Lab.

Right. So what did I use to achieve this? ZFS and Zpools?

Hell no, that’s so January.

VSAN? Ha! I’m no Chris Wahl.

I used Windows, naturally.

That’s right. Windows. Server 2012 R2 to be specific, running Core + Infrastructure GUI with 8GB of RAM, and some 17TB of raw disk space available to it. And a little technique developed by the ace Microsoft server team called Tiered Storage Spaces.

Was a #StorageGlory Achievement Unlocked, or was it a dud?

Here’s my review after 30 days on my Windows SAN: san.daisettalabs.net.

The Good

It doesn’t make you pick a side in either storage or storage-networking: Do you like abstracted pools of storage, managed entirely by software? Put another way, do you hate your RAID controller and crush on your old-school NetApp filer, which seemingly could do everything but object storage?

When I say block, do you instinctively say file? Or vice-versa?

Well then my friend, have I got a storage system for your lab (and maybe production!) environment: Windows Storage Spaces (now with Tiering!) offers just about everything guys like you or me need in a storage system for lab & home media environments. I love it not just because it’s Microsoft, but also because it doesn’t make me choose between storage & storage-networking paradigms. It’s perhaps the ultimate agnostic storage technology, and I say that as someone who thinks about agnosticism and storage.

A lot.

You know what I’m talking about. Maybe today, you’ll need some block storage for this VM or that particular job. Maybe you’re in a *nix state of mind and want to fiddle with NFS. Or perhaps you’re feeling bold & courageous and decide to try out VMware again, building some datastores on both iSCSI LUNs and NFS shares. Then again, maybe you want to see what SMB 3.0 3.0 is all about, the MS fanboys sure seem to be talking it up.

The point is this: I don’t care what your storage fancy is, but for lab-work (which makes for excellence in work-work) you need a storage platform that’s flexible and supportive of as many technologies as possible and is, hopefully, software-defined.

And that storage system is -hard to believe I’ll grant you- Windows Server 2012 R2.

I love storage and I can’t think of one other storage system -save for maybe NetApp- that let’s me do crazy things like store .vmdks inside of .vhdxs (oh the vIrony!), use SMB 3 multichannel over the same NICs I’m using for iSCSI traffic, create snapshots & clones just like big filers all while giving me the performance-multiplier benefits of SSDs and caching and a reasonable level of resiliency.

File this one under WackWackStorageGloryAchievedWindows boys and girls.

I can do it all with Storage Spaces in 2012 R2.

As I was thinking about how to write about Storage Spaces, I decided to make a chart, if only to help me keep it straight. It’s rough but maybe you’ll find it useful as you think about storage abstraction/virtualization tech:

Storage-Compared
And yes. Ex post facto dedupe is a made up term. By me. It’s latin for “After the fact, dedupe,” because I always scheduled my dedupes for Saturday night, when the IO load on the filer was low. Ex post facto dedupe is in contrast to some newer storage companies that offer inline compression & dedupe, but none of the ones above offer this, sadly.

It’s easy to build and supports your disks & controllers: This is a Microsoft product. Which means it’s easy to deploy & build for your average server guy. Mine’s running on a very skinny, re-re-purposed SanDisk Ready Cache SSD. With Windows 2012 R2 server running the Infrastructure Management GUI (no explorer.exe, just Server Manager + your favorite snap-ins), it’s using about 6GB of space on the boot drive.

And drivers for the Intel C226 SATA controller, the LSI 9218si SAS card, and the extra ASMedia 1061 controller were all installed automagically by Windows during the build.

The only other system that came close to being this easy to install -as a server product- was Oracle Solaris 11.2 Beta. It found, installed drivers for, and exposed all controllers & disks, so I was well on my way to going the ZFS route again, but figured I’d give Windows a chance this time around.

Nexenta 4, in contrast, never loaded past the Install Community Edition screen.

It’s improved a lot over 2012: Storage Spaces almost two years ago now, and I remember playing with it at work a bit. I found it to be a mind-f*** as it was a radically different approach to storage within the Windows server context.

I also found it to be slow, dreadfully slow even, and not very survivable. Though it did accept any disk disk I gave it, it didn’t exactly like it when I removed a USB drive during an extended write test. And it didn’t take the disk back at the conclusion of the test either.

Like everything else in Microsoft’s current generation, Storage Spaces in 2012 R2 is much better, more configurable, easier to monitor, and more tolerant of disk failures.

It also has something for the IOPS speedfreak inside all of us.

Storage Spaces, abstract this away
Storage Spaces, abstract this away

Tiered Storage Spaces & Adjustable write cache: Coming from ZFS & the Adaptive Replacement Cache, the ZFS Intent Log, the SLOG, and L2ARC, I was kind of hooked on the idea of using massive amounts of my ECC RAM to function as a sort of poor-mans NVRAM.

Windows can’t do that, but with Tiered Storage Spaces, you can at least drop a few SSDs in your array (in my case three x 256GB 840 EVO & one 128GB Samsung 830), mix them into your disk pool, and voila! Fast read-cache, with a Microsoft-flavored MRU/LFU algorithm of some type keeping your hottest data on the fastest disks and your old data on the cheep ‘n deep rotationals.

What’s more, going with Tiered Storage Spaces gives you a modest 1GB write cache, but as I found out, you can increase that up to 10GB.

Which i naturally did while building this guy out. I mean, who wouldn’t want more write-cache?

But there’s a huge gotcha buried in the Technet and blogposts I found about this. I wanted to pool all my disks together into as large of a single virtual disk as possible, then pack iSCSI-connected .vhdxs, SMB 3 shares, and more inside that single, durable & tiered virtual disk.What I didn’t want was several virtual disks (it helped me to think of virtual disks as a sort of Aggregate) with SMB 3 shares and vhdx files stored haphazardly between them.

Which is what you get when you adjust the write-cache size. Recall that I have a capacity of about 17TB raw among all my disks. Building a storage pool, then a virtual disk with a 10GB write cache gave me a tiered virtual disk with a maximum size of about 965GB.  More on that below.

It can be wicked fast, but so is RAID 0: Check out my standard SQLIO benchmark routine, which I run against all storage technologies that come my way. The 1.5 hour test is by no means comprehensive -and I’m not saying the IOPS counter is accurate at all (showing max values across all tests by the way)- but I like this test because it lets me kick the tires on my array, take her out for a spin, and see how she handles.

And with a “Simple” layout (no redundancy, probably equivalent to RAID 0), she handles pretty damn well, but even I’m not crazy enough to run tiered storage spaces in a simple layout config:

storage spaces
These three tests (1.5 hours each, identical setup against multiple configs) were done locally on the array, not over my home network

What’s odd is how poorly the array performed with 10GB of “Write Cache.” Not sure what happened here, but as you can see, latency spiked higher during the 10GB write cache write phase of the test than just about every other test segment.

Something to do with parity no doubt.

For my lab & home storage needs, I settled on a Mirror 2-way parity setup that gives me moderate performance with durability in mind, though not much as you’ll see below.

Making the most of my lab/home network and my NICs: Recall that I have six GbE NICs on this box. Two are built into the Supermicro board itself (Intel), and the other four come by way of a quad-port Intel I350-T4 server NIC.

Anytime you’re planning to do a Microsoft cluster in the 1GbE world, you need lots of NICs. It’s a bit of a crutch in some respects, especially in iSCSI. Typically you VLAN off each iSCSI NIC for your Hyper-V hosts and those NICs do one thing and one thing only: iSCSI, or Live Migration, or CSV etc. Feels wasteful.

But on my new storage box at home, I can use them for double-duty: iSCSI (or LM/CSV) as well as SMB 3. Yes!

Usually I turn off Client for Microsoft Networks (the SMB file sharing toggle in NIC properties) on each dedicated NIC (or vEthernet), but since I want my file cake & my block cake at the same time, I decided to turn SMB on on all iSCSI vEthernet adapters (from the physical & virtual hosts) and leave SMB on the iSCSI NICs on san.daisettalabs.net as well.

The end result? This:

[table caption=”Storage Networking-All of the Above Approach” width=”500″ colwidth=”20|100|50″ colalign=”left|left|center|left|right”]
nic,Name,VLAN,IP,Function
1,MGMT,100,192.168.100.15,MGMT & SMB3
2,CLNT,102,192.168.102.15,Home net & SMB3
3,iSCSI-10,10,172.16.10.x,iSCSI & SMB3
4,iSCSI-11,10,172.16.11.x,iSCSI & SMB3
5,iSCSI-12,10,172.16.12.x,iSCSI & SMB3
[/table]

That’s five, count ’em five NICs (or discrete channels, more specifically) I can use to fully soak in the goodness that is SMB 3 multichannel, with the cost of only a slightly unsettling epistemological question about whether iSCSI NICs are truly iSCSI if they’re doing file storage protocols.

Now SMB 3 is so transparent (on by default) you almost forget that you can configure it, but there’s quite a few ways to adjust file share performance. Aidan Finn argues for constraining SMB 3 to certain NICs, while Jose Barreto details how multichannel works on standalone physical NICs, a pair in a team, and multiple teams of NICs.

I haven’t decided which model to follow (though on san.daisettalabs.net, I’m not going to change anything or use Converged switching…it’s just storage), but SMB 3 is really exciting and it’s great that with Storage Spaces, you can have high performance file & block storage. I’ve hit 420MB/sec on synchronous file copies from san to host and back again. Outstanding!

I Finally got iSNS to work and it’s…meh: One nice thing about san.daisettalabs.net is that that’s all you need to know…the FQDN is now the resident iSCSI Name Server, meaning it’s all I need to set on an MS iSCSI Initiator. It’s a nice feature to have, but probably wasn’t worth the 30 minutes I spent getting it to work (hint: run set-wmiinstance before you run iSNS cmdlets in powershell!) as iSNS isn’t so great when you have…

SMI-S, which is awesome for Virtual Machine Manager fans: SMI-S, you’re thinking, what the the hell is that? Well, it’s a standardized framework for communicating block storage information between your storage array and whatever interface you use to manage & deploy resources on your array. Developed by no less an august body than the Storage Networking Industry Association (SNIA), it’s one of those “standards” that seem like a good idea, but you can’t find it much in the wild as it were. I’ve used SMI-S against a NetApp Filer (in the Classic DoT days, not sure if it works against cDoT) but your Nimbles, your Pures, and other new players in the market get the same funny look on their face when you ask them if they support SMI-S.

“Is that a vCenter thing?” they ask.

Sigh.

Microsoft, to its credit, does. Right on Windows Server. It’s a simple feature you install and two or three powershell commands later, you can point Virtual Machine Manager at it and voila! Provision, delete, resize, and classify iSCSI LUNS on your Windows SAN, just like the big boys do (probably) in Azure, only here, we’re totally enjoying the use of our corpulent.vhdx drives, whereas in Azure, for some reason, they’re still stuck on .vhds like rookies. Haha!

Single Pane o' glass in VMM with SMI-S for the Hyper-V set
Single Pane o’ glass in VMM with SMI-S, GUIDs galore and more for the Hyper-V set

It’s a very stable storage platform for Microsoft Clustering: I’ve built a lot of Microsoft Hyper-V clusters. A lot. More than half a dozen in production, and probably three times that in dev or lab environments, so it’s like second nature to me. Stable storage & networking are not just important factors in Microsoft clusters, they are the only factors.

So how is it building out a Hyper-V cluster atop a Windows SAN? It’s the same, and different at the same time, but, unlike so many other cluster builds, I passed the validation test on the first attempt with green check marks everywhere. And weeks have gone by without a single error in the Failover Clustering snap-in; it’s great.

The Bad

It’s expensive and seemingly not as redundant as other storage tech: When you build your storage pool out of offlined disks, your first choice is going to involve (just like other storage abstraction platforms) disk redundancy. Microsoft makes it simple, but doesn’t really tell you the cost of that redundancy until later in the process.

Recall that I have 17TB of raw storage on san.daisettalabs.net, organized as follows:

[table]

Disk Type, Quantity, Size, Format, Speed, Function

WD Red 2.5″ with NASWARE, 6, 1TB, 4KB AF, SATA 3 5400RPM, Cheep ‘n deep

Samsung 840 EVO SSD, 3, 256GB, 512byte, 250MB/read, Tiers not fears

Samsung 830 SSD, 1, 128GB, 512byte, 250MB/read, Tiers not fears

HGST 3.5″ Momentus, 6, 2TB, 512byte, 105MB/r/w, Cheep ‘n deep

[/table]

Now, according to my trusty IOPS Excel calculator, if I were to use traditional RAID 5 or RAID 6 on that set of spinners, I’d get about 16.5TB usable in the former, 15TB usable in the latter (assuming RAID penalty of 5 & 6, respectively)

For much of the last year, I’ve been using ZFS & RAIDZ2 on the set of six WD Red 2.5″ drives. Those have a raw capacity of 6TB. In RAIDZ2 (roughly analogous to RAID 6), I recall getting about 4.2TB usable.

All in all, traditional RAID & ZFS’ RAIDZ cost me between 12% and  35% of my capacity respectively.

So how much does Windows Storage Spaces resiliency model (Mirrored, 2-way parity) cost me? A lot. We’re in RAID-DP territory here people:

 

storagespaces5

Ack! With 17TB of raw storage, I get about 5.7TB usable, a cost of about 66%!

And for that, what kind of resieliency do I get?

I sure as hell can’t pull two disks simultaneously, as I did live during prod in my ZFS box. I can suffer the loss of only a single disk. And even then, other Windows bloggers point to some pain as the array tries to adjust.

Now, I’m not the brightest on RAID & parity and such, so perhaps there’s a more resilient, less costly way to use Storage Spaces with Tiering, but wow…this strikes me as a lot of wasted disk.

Not as easy to de-abstract the storage: When a disk array is under load, one of my favorite things to do is watch how the IO hits the physical elements in the array. Modern disk arrays make what your disks are doing abstract, almost invisible, but to truly understand how these things work, sometimes you just want the modern equivalent of lun stats.

In ZFS, I loved just letting gstat run, which showed me the load my IO was placing on the ARC, the L2ARC and finally, the disks. Awesome stuff:

In this Gifcam, watch ada0-6 as they struggle under load with the "Always Sync" option enabled.
In this Gifcam, watch ada0-6 as they struggle under load with the “Always Sync” option enabled.

As best as I can tell, there’s no live powershell equivalent to gstat for Storage Spaces. There are teases though; you can query your disks, get their SMART vitals, and more, but peeling away the onion layers and actually watching how Windows handles your IO would make Storage Spaces the total package.

Bottom line

So that’s about it: this is the best storage box I’ve built in the Daisetta Lab. No regrets going with Windows. The platform is mature, stable, offers very good performance, and decent resiliency, if at a high disk cost.

I’m so impressed I’ve checked my Windows SAN skepticism at the door and would run this in a production environment at a small/medium business (clustered, in the Scaled Out File Server role). Cost-wise, it’s a bargain. Check out this array: it’s the same exact Hardware a certain upstart Storage vendor I like (that rhymes with Gymbal Porridge) sells, but for a lot less!

#StorageGlory achieved. At home. In my garage.

Labworks 1:4-7 – The Last Word in ZFS Labworks

Greetings to you Labworks readers, consumers, and conversationalists. Welcome to the last  verse of Labworks Chapter 1, which has been all about building a durable and performance-oriented ZFS storage array for Hyper-V and/or VMware.

Let’s review where we’ve been:

[table]

Labworks Chapter, Verse, Subject, Title & URL

Labworks 1:, 1, Storage, Building a Durable and Performance-Oriented ZFS Box for Hyper-V & VMware

,2-3, StorageI Heart the ARC & Let’s Pull Some Drives!

[/table]

Today we’re going to circle back to the very end of Labworks 1:1, where I assigned myself some homework: find out why my writes suck so bad. We’re going to talk about a man named ZIL and his sidekick the SLOG and then we’re going to check out some Excel charts and finish by considering ZFS’ sync models.

But first, some housekeeping: SAN2, the ZFS box, has undergone minor modification. You can find the current array setup below. Also, I have a new switch in the Daisetta Lab, and as switching is intimately tied to storage networking & performance, it’s important I detail a little bit about it.

Labworks 1:4 – Small Business SG300 vs Catalyst 2960S

Cisco’s SG-300 & SG-500 series switches are getting some pretty good reviews, especially in a home lab context. I’ve got an SG-300 and really like it as it offers a solid spectrum of switching options at Layer 2 as well as a nice Layer 3-lite mode all for a tick under $200. It even has a real web-interface if your CLI-shy, which

Small Business Cisco != Linksys
Small Business Cisco != Linksys

I’m not but some folks are.

Sadly for me & the Daisetta Lab, I need more ports than my little SG-300 has to offer. So I’ve removed it from my rack and swapped it for a 2960S-48TS-L from the office, but not just any 2960S.

No, I have spiritual & emotional ties to this 2960s, this exact one. It’s the same 2960s I used in my January storage bakeoff of a Nimble array, the same 2960s on which I broke my Hyper-V & VMware cherry in those painful early days of virtualization, yes, this five year old switch is now in my lab:

The pride of Cisco's 2009 Desktop Switching series, the 2960s
The pride of Cisco’s 2009 Desktop Switching series, the 2960s

Sure it’s not a storage switch, in fact it’s meant for IDFs and end-users and if the guys on that great storage networking podcast from a few weeks back knew I was using this as a storage switch, I’d be finished in this industry for good.

But I love this switch and I’m glad its at the top of my rack. I saved 1U, the energy costs of this switch vs two smaller ones are probably a wash, and though I lost Layer 3 Lite, I gained so much more: 48 x 1GbE ports and full LAN-licensed Cisco IOS v 15.2, which, agnostic computing goals aside for a moment, just feels so right and so good.

And with the increased amount of full-featured switch ports available to me, I’ve now got LACP teams of three on agnostic_node_1 & 2, jumbo frames from end to end, and the same VLAN layout.

Here’s the updated Labworks schematic and the disk layout for SAN2:

Lab 1-4-5 - Daisetta Labs

[table]

Disk Type, Quantity, Size, Format, Speed, Function

WD Red 2.5″ with NASWARE, 6, 1TB, 4KB AF, SATA 3 5400RPM, Zpool Members

Samsung 840 EVO SSD, 1, 128GB, 512byte, SATA 3, L2ARC Read Cache

Samsung 830 SSD, 1, 128GB, 512byte, SATA 3, L2ARC Read Cache

Seagate 2.5″ Momentus, 1, 500GB, 512byte, 80MB/r/w, Boot/swap/system

[/table]

Labworks 1:5 – A Man named ZIL and his sidekick, the SLOG

Labworks 1:1 was all about building durable & performance-oriented storage for Hyper-V & VMware. And one of the unresolved questions I aimed to solve out of that post was my poor write performance.

Review the hardware table and you’ll feel like I felt. I got me some SSD and some RAM, I provisioned a ZIL so write-cache that inbound IO already ZFS, amiright? Show me the IOPSMoney Jerry!

Well, about that. I mischaracterized the ZIL and I apologize to readers for the error. Let’s just get this out of the way: The ZFS Intent Log (ZIL) is not a write-cache device as I implied in Labworks 1:1.

ZFS storage in excellent Good/Better/Best format
ZFS storage layout in excellent Good/Better/Best format courtesy of Nexenta, which has some outstanding documentation & guides

The ZIL, whether spread out among your rotational disks by ZFS design, or applied to a Separate Log Device (a SLOG), is simply a synchronous writes mechanism, a log designed to ensure data integrity and report (IO ACK) back to the application layer that writes are safe somewhere on your rotational media. The ZIL & SLOG are also a disaster recovery mechanisms/devices ; in the event of power-loss, the ZIL, or the ZIL functioning on a SLOG device, will ensure that the writes it logged prior to the event are written to your spinners when your disks are back online.

Now there seem to be some differences in how the various implementations of ZFS look at the ZIL/SLOG mechanism.

Nexenta Community Edition, based off Illumos which is the open source descendant of Sun’s Solaris, says your SLOG should just be a write-optimized SSD, but even that’s more best practice than hard & fast requirement. Nexenta touts the ZIL/SLOG as a performance multiplier, and their excellent documentation has helpful charts and graphics reinforcing that.

In contrast, the most popular FreeBSD ZFS implementations documentation paints the ZIL as likely more trouble than its worth. FreeNAS actively discourages you from provisioning a SLOG unless it’s enterprise-grade, accurately pointing out that the ZIL & a SLOG device aren’t write-cache and probably won’t make your writes faster anyway, unless you’re NFS-focused (which I’m proudly, defiantly even, not) or operating a large database at scale.

ZIL me

What’s to account for the difference in documentation & best practice guides? I’m not sure; some of it’s probably related to *BSD vs Illumos implementations of ZFS, some of it’s probably related to different audiences & users of the free tier of these storage systems.

The question for us here is this: Will you benefit from provisioning a SLOG device if you build a ZFS box for Hyper-V and VMWare storage for iSCSI?

I hate sounding like a waffling storage VAR here, but I will: it depends. I’ve run both Nexenta and NAS4Free; when I ran Nexenta, I saw my SLOG being used during random & synchronous write operations. In NAS4Free, the SSD I had dedicated as a SLOG never showed any activity in zfs-stats, gstat or any other IO disk tool I could find.

One could spend weeks of valuable lab time verifying under which conditions a dedicated SLOG device adds performance to your storage array, but I decided to cut bait. Check out some of the links at the bottom for more color on this, but in the meantime, let me leave you with this advice: if you have $80 to spend on your FreeBSD-based ZFS storage, buy an extra 8GB of RAM rather than a tiny, used SLC or MLC device to function as your SLOG. You will almost certainly get more performance out of a larger ARC than by dedicating a disk as your SLOG.

Labworks 1:6 – Great…so, again, why do my writes suck? 

Recall this SQLIO test from Labworks 1:1:

sqlio lab 1 short test

As you can see, read or write, I was hitting a wall at around 235-240 megabytes per second during much of “Short Test”, which is pretty close to the theoretical limit of an LACP team with two GigE NICs.

But as I said above, we don’t have that limit anymore. Whereas there were once 2x1GbE Teams, there are now 3x1GbE. Let’s see what the same test on the same 4KB block/4KB NTFS volume yields now.

SQLIO short test, take two, sort by Random vs Sequential writes & reads:

labworks147

By jove, what’s going on here? This graph was built off the same SQLIO recipe, but looks completely different than Labworks 1. For one, the writes look much better, and reads look much worse. Yet step back and the patterns are largely the same.

It’s data like this that makes benchmarking, validating & ultimately purchasing storage so tricky. Some would argue with my reliance on SQLIO and those arguments have merit, but I feel SQLIO, which is easy to script/run and automate, can give you some valuable hints into the characteristics of an array you’re considering.

Let’s look at the writes question specifically.

Am I really writing 350MB/s to SAN2?

storagenetworkingforthewinOn the one hand, everything I’m looking at says YES: I am a Storage God and I have achieved #StorageGlory inside the humble Daisetta Lab HQ on consumer-level hardware:

  • SAN2 is showing about 115MB/s to each Broadcom interface during the 32KB & 64KB samples
  • Agnostic_Node_1 perfmon shows about the same amount of traffic eggressing the three vEthernet adapters
  • The 2960S is reflecting all that traffic; I’m definitely pushing about 350 megabytes per second to SAN2; interface port channel 3 shows TX load at 219 out of 255 and maxing out my LACP team

On the other hand, I am just an IT Mortal and something bothers:

  • CPU is very high on SAN2 during the 32KB & 64KB runs…so busy it seems like the little AMD CPU is responsible for some of the good performance marks
  • While I’m a fan of the itsy-bitsy 2.5″ Western Digitial RED 1TB drives in SAN2, under no theoretical IOPS model is it likely that six of them, in RAIDZ-2 (RAID 6 equivalent) can achieve 5,000-10,000 IOPS under traditional storage principles. Each drive by itself is capable of only 75-90 IOPS
  • If something is too good to be true, it probably is

49286241Sr. Storage Engineer Neo feels really frustrated at this point; he can’t figure out why his writes suck, or even if they suck, and so he wanders up to the Oracle to get her take on the situation and comes across this strange Buddha Storage kid.

Labworks 1:7 – The Essence of ZFS & New Storage model

In effect, what we see here is is just a sample of the technology & techniques that have been disrupting the storage market for several years now: compression & caching multiply performance of storage systems beyond what they should be capable of, in certain scenarios.

As the chart above shows, the test2 volume is compressed by SAN2 using lzjb. On top of that, we’ve got the ZFS ARC, L2ARC, and the ZIL in the mix. And then, to make things even more complicated, we have some sync policies ZFS allows us to toggle. They look like this:

sync policy

The sync toggle documentation is out there and you should understand it it is crucial to understanding ZFS, but I want to demonstrate the choices as well.

I’ve got three choices + the compression options. Which one of these combinations is going to give me the best performance & durability for my Hyper-V VMs?

SQLIO Short Test Runs 3-6, all PivotTabled up for your enjoyment and ease of digestion:

compressionsync

As is usually the case in storage, IT, and hell, life in general, there are no free lunches here people. This graph tells you what you already know in your heart: the safest storage policy in ZFS-land (Always Sync, that is to say, commit writes to the rotationals post haste as if it was the last day on earth) is also the slowest. Nearly 20 seconds of latency as I force ZFS to commit everything I send it immediately (vs flush it later), which it struggles to do at a measly average speed of 4.4 megabytes/second.

Compression-wise, I thought I’d see a big difference between the various compression schemes, but I don’t. Lzgb, lz4, and the ultra-space-saving/high-cpu-cost gzip-9 all turn in about equal results from an IOPS & performance perspective. It’s almost a wash, really, and that’s likely because of the predictable nature of the IO SQLIO is generating.

Labworks 1:Epilogue

Last point: ZFS, as Chris Wahl pointed out, is a sort of virtualization layer atop your storage. Now if you’re a virtualization guy like me or Wahl, that’s easy to grasp; Windows 2012 R2’s Storage Spaces concept is similar in function.

But sometimes in virtualization, it’s good to peel away the abstraction onion and watch what that looks like in practice. ZFS has a number of tools and monitors that look at your Zpool IO, but to really see how ZFS works, I advise you to run gstat. GStat shows what your disks are doing and if you’re carefully setting up your environment, you ought to be able to see the effects of your settings on each individual spindle.

In this Gifcam, watch ada0-6 as they struggle under load with the "Always Sync" option enabled.
In this Gifcam, watch ada0-5 (the western digitals)as they struggle under load with the “Always Sync” option enabled. Notice that the zvol/Alpha-Pool/Test2 volume (The logical volume construct) is at 100% busy and the ops/s are not very stellar.

Now look at this gstat sample. Under SQLIO-load, the zvol is showing 10,000 IOPS, 300+MB/s. But ada0-5, the physical drives, aren't doing squat.

Now look at this gstat sample. Under SQLIO-load, the zvol is showing 10,000 IOPS, 300+MB/s. But ada0-5, the physical drives, aren’t doing squat for several seconds at a time as SAN2 absorbs & processes all the IO coming at it.

That, friends, is the essence of ZFS.

 Links/Knowledge/Required Reading Used in this Post:

[table]
Resource, Author, Summary

Nexenta’s awesome whitepapers and guides, Nexenta, Find ’em and collect ’em good stuff on MPIO config and ZFS performance

Comparing SSD vs NoSSD in Nexenta w/NFS, Larry Smith, A fellow ZFS fan with more focus on NFS & VMware

Get the Most out of ZFS SSD, Sebastian “vBagpipes” Laubscher, Sebastian finds a different way to provision the ZIL/SLOG

Nexenta & Scale, Hans DeLeenHeer, Fellow #TFD delegate looks at ZFS tiers in superhero context

SLOG/ZIL Insight, FreeNAS forum, Great forum-focused post on SLOG/ZIL in BSD ZFS

SLOG Blog, Oracle, 2007 post about the ZIL & SLOG heralding storage di

 Zpool and ZIL management, Magnus Strahlert, Excellent how-to guide for ZIL/L2ARC provisioning

[/table]

 

Fail File : SAN down! SAN down! All Nodes respond

Introducing Fail File #1, where I admit to screwing something up and reflect on what I’ve learned

SAN2.daisettalabs.net, the NAS4Free server I built to simulate some of the functions I perform at work with big boy SANs, crashed last night.

Or, to put it another way, I pushed that little AMD-powered, FreeBSD-running, Broadcom-connected, ZFS-flavored franken-array to the breaking point:

Untitled picture
Love the directness of BSD. The iSCSI Target process was killed in cold blood, resulting in the death of several child partitions. What’s more, in just a few words, I have the suspect (Kernel) the motive (swap space) & the victim (iSCSI). Windows would have said, “The service terminated unexpectedly…error 0x081942ad-SOL”

 

Such are the perils of concentrated block storage, amiright? Instantly my Hyper-V Cluster Shared Volumes + the 8 or 9 VMs inside them dropped:

csvs

So what happened here?

I failed to grok the grub or fsck the fdisk or something and gave BSD an inadequate amount of swap space on the root 10GB partition slice. Then I lobbed some iSCSI packets its way from multiple sources and the kernel, starved for resources (because I’m using about 95% of my RAM for the ARC), decided to kill istgt, the iSCSI target service.

Thinking back to the winter, when I ran Nexenta -derived from Sun’s Solaris, not BSD-based- the failure sequence was different, but I’m not sure it was better.

When I was pounding the Nexenta SAN2 back in the winter, volleying 175,000+ iSCSI packets per second its way onto hardware that was even more ghetto, Nexenta did what any good human engineer does: compensate for the operator’s errors & abuses.

It was kind of neat to see. Whether I was running SQLIO simulations, an iometer run, robocopy or eseutil, or just turning on a bunch of VMs simultaneously, one by one, Nexenta services would start to drop as resources were exhausted.

First the gui (NMV it’s called). Then SSH. And finally, sometimes the console itself would lock up (NMC).

But never iSCSI, the disk subsystem, the ARC or L2ARC…those pieces never dropped.

Now to be fair, the GUI, SSH & console services never really turned back on either….you might end up with a durable storage system you couldn’t interact with at all until hard reset, but at least the LUNs stayed online.

This BSD box, in contrast, kills the most important service I’m running on it, but has the courtesy to admit to it and doesn’t make me get up out of my seat: GUI/SSH all other processes are running fine and I’ve instantly identified the problem and will engineer against it.

One model is resilient, bending but not breaking; the other is durable up to a point, and then it just snaps.

Which model is better for a given application?

Fail File Lesson #1: It’s just as important to understand how things fail as it is to understand why they fail, so that you can properly engineer against it. I never thought inadequate swap space would result in a homicidal kernel gunning for the most important service on the box…now I know.

Labworks #1: Building a durable, performance-oriented ZFS box for Hyper-V, VMware

Welcome to my first Labworks post in which I test, build & validate a ZFS storage solution for my home Hyper-V & VMware lab.

Be sure to check out the followup lab posts on this same topic in the table below!

[table]

Labworks Chapter, Section, Subject, Title & URL

Labworks 1:, 1, Storage, Building a Durable and Performance-Oriented ZFS Box for Hyper-V & VMware

,2-3, Storage, I Heart the ARC & Let’s Pull Some Drives!

[/table]

Labworks  #1: Building a durable, performance-oriented ZFS box for Hyper-V, VMware

Primary Goal: To build a durable and performance-oriented storage array using Sun’s fantastic, 128 bit, high-integrity Zetabyte File System for use with Lab Hyper-V CSVs & Windows clusters, VMware ESXi 5.5, other hypervisors,

 

The ARC: My RAM makes your SSD look like 15k drives
The ARC: My RAM makes your SSD look like a couplel of old, wheezing 15k drives

Secondary Goal: Leverage consumer-grade SSDs to increase/multiply performance by using them as ZFS Intent Log (ZIL) write-cache and L2ARC read cache

Bonus: The Windows 7 PC in the living room that’s running Windows Media Center with CableCARD & HD Home Run was running out of DVR disk space and can’t record to SMB shares but can record to iSCSI LUNs.

Technologies used: iSCSI, MPIO, LACP, Jumbo Frames, IOMETER, SQLIO, ATTO, Robocopy, CrystalDiskMark, FreeBSD, NAS4Free, Windows Server 2012 R2, Hyper-V 3.0, Converged switch, VMware, standard switch, Cisco SG300

Schematic: 

Click for larger
Click for larger.

Hardware Notes:
[table]
System, Motherboard, Class, CPU, RAM, NIC, Hypervisor
Node-1, Asus Z87-K, Consumer, Haswell i-5, 24GB, 2x1GbE Intel I305, Hyper-V
Node-2, Biostar HZZMU3, Consumer, Ivy Bridge i-7, 24GB, 2x1GbE Broadcom BC5709C, Hyper-V
Node-3, MSI 760GM-P23, Consumer, AMD FX-6300, 16GB, 2x1GbE Intel i305, ESXi 5.5
san2, Gigabyte GA-F2A88XM-D3H, Consumer, AMD A8-5500, 24GB, 4x1GbE Broadcom BC5709C, NAS4Free
sw01, Cisco SG300-10 Port, Small Busines, n/a, n/a, 10x1GbE, n/a
[/table]

Array Setup:

I picked the Gigabyte board above because it’s got an outstanding eight SATA 6Gbit ports, all running on the native AMD A88x Bolton-D4 chipset, which, it turns out, isn’t supported well in Illumos (see Lab Notes below).

I added to that a cheap $20 Marve 9128se two port SATA 6gbit PCIe card, which hosts the boot volume & the SanDisk SSD.

[table]

Disk Type, Quantity, Size, Format, Speed, Function

WD Red 2.5″ with NASWARE, 6, 1TB, 4KB AF, SATA 3 5400RPM, Zpool Members

Samsung 840 EVO SSD, 1, 128GB, 512byte, 250MB/read, L2ARC Read Cache

SanDisk Ultra Plus II SSD, 1, 128GB, 512byte, 250MB/read & 250MB/write?, ZIL

Seagate 2.5″ Momentus, 1, 500GB, 512byte, 80MB/r/w, Boot/swap/system

[/table]

Performance Tests:

I’m not finished with all the benchmarking, which is notoriously difficult to get right, but here’s a taste. Expect a followup soon.

All shots below involved lzp2 compression on SAN2

SQLIO Short Test: 

sqlio lab 1 short test
Obviously seeing the benefit of ZFS compression & ARC at the front end. IOPS become more realistic toward the middle and right as read cache is exhausted. Consistently in around 150MB-240Mb/s though, the limit of two 1GbE cables.

 

ATTO standard run:

atto
I’ve got a big write problem somewhere. Is it the ZIL, which don’t seem to be performing under BSD as they did under Nexenta? Something else? Could also be related to the Test Volume being formatted NTFS 64kb. Still trying to figure it out

 

NFS Tests:

None so far. From a VMware perspective, I want to rebuild the Standard switch as a distributed switch now that I’ve got a VCenter appliance running. But that’s not my priority at the moment.

Durability Tests:

Pulled two drives -the limit on RAIDZ2- under normal conditions. Put them back in, saw some alerts about the “administrator pulling drives” and the Zpool being in a degraded state. My CSVs remained online, however. Following a short zpool online command, both drives rejoined the pool and the degraded error went away.

Fun shots:

Because it’s not all about repeatable lab experiments. Here’s a Gifcam shot from Node-1 as it completely saturates both 2x1GbE Intel NICs:

test

and some pretty blinking lights from the six 2.5″ drives:

0303141929-MOTION

Lab notes & Lessons Learned:

First off, I’d like to buy a beer for the unknown technology enthusiast/lab guy who uttered these sage words of wisdom, which I failed to heed:

You buy cheap, you buy twice

Listen to that man, would you? Because going consumer, while tempting, is not smart. Learn from my mistakes: if you have to buy, buy server boards.

Secondly, I prefer NexentaStor to NAS4Free with ZFS, but like others, I worry about and have been stung by Open Solaris/Illumos hardware support. Most of that is my own fault, cf the note above, but still: does Illumos have a future? I’m hopeful, NextentaStor is going to appear at next month’s Storage Field Day 5, so that’s a good sign, and version 4.0 is due out anytime.

The Illumos/Nexenta command structure is much more intuitive to me than FreeBSD. In place of your favorite *nix commands, Nexenta employs some great, verb-noun show commands, and dtrace, the excellent diagnostic/performance tool included in Solaris is baked right into Nexenta. In NAS4Free/FreeBSD 9.1, you’ve got to add a few packages to get the equivalent stats for the ARC, L2ARC and ZFS, and adding dtrace involves a make & kernel modification, something I haven’t been brave enough to try yet.

Next: Jumbo Frames for the win. From Node-1, the desktop in my office, my Core i5-4670k CPU would regularly hit 35-50% utilization during my standard SQLIO benchmark before I configured jumbo frames from end-to-end. Now, after enabling Jumbo frames on the Intel NICs, the Hyper-V converged switch, the SG-300 and the ZFS box, utilization peaks at 15-20% during the same SQLIO test, and the benchmarks have show an increase as well. Unfortunately in FreeBSD world, adding jumbo frames is something you have to do on the interface & routing table, and it doesn’t persist across reboots for me, though that may be due to a driver issue on the Broadcom card.

The Western Digital 2.5″ drives aren’t stellar performers and they aren’t cheap, but boy are they quiet, well-built, and run cool, asking politely for only 1 watt under load. I’ve returned the hot, loud & failure prone HGST 3.5″ 2 TB drives I borrowed from work; it’s too hard to put them in a chassis that’s short-depth.

Lastly, ZFS’ adaptive replacement cache, which I’ve enthused over a lot in recent weeks, is quite the value & performance-multiplier. I’ve tested Windows Server 2012 R2 Storage Appliance’s tiered storage model, and while I was impressed with it’s responsiveness, ReFS, and ability to pool storage in interesting ways, nothing can compete with ZFS’ ARC model. It’s simply awesome; deceptively-simple, but awesome.

Lesson is that if you’re going to lose an entire box to storage in your lab, your chosen storage system better use every last ounce of that box, including its RAM, to serve storage up to you. 2012 R2 doesn’t, but I’m hopeful soon that it may (Update 1 perhaps?)

Here’s a cool screenshot from Nexenta, my last build before I re-did everything, showing ARC-hits following a cold boot of the array (top), and a few days later, when things are really cooking for my Hyper-V VMs stored, which are getting tagged with ZFS’ “Most Frequently Used” category and thus getting the benefit of fast RAM & L2ARC:

cache

Next Steps:

  • Find out why my writes suck so bad.
  • Test Nas4Free’s NFS performance
  • Test SMB 3.0 from a virtual machine inside the ZFS box
  • Sell some stuff so I can buy a proper SLC SSD drive for the ZIL
  • Re-build the rookie Standard Switch into a true Distributed Switch in ESXi

Links/Knowledge/Required Reading Used in this Post:

[table]
Resource, Author, Summary
Three Example Home Lab Storage Designs using SSDs and Spinning Disk, Chris Wahl, Good piece on different lab storage models
ZFS, Wikipedia, Great overview of ZFS history and features
Activity of the ZFS Arc, Brendan Gregg, Excellent overview of ZFS’ RAM-as-cache
Hybrid Storage Pool Performance, Brendan Gregg, Details ZFS performance
FreeBSD Jumbo Frames, NixCraft, Applying MTU correctly
Hyper-V vEthernet Jumbo Frames, Darryl Van der Peijl, Great little powershell script to keep you out of regedit
Nexenta Community Edition 3.1.5, NexentaStor, My personal preference for a Solaris-derived ZFS box
Nas4Free, Nas4Free.org, FreeBSD-based ZFS; works with more hardware
[/table]