Labworks 1:4-7 – The Last Word in ZFS Labworks

Greetings to you Labworks readers, consumers, and conversationalists. Welcome to the last  verse of Labworks Chapter 1, which has been all about building a durable and performance-oriented ZFS storage array for Hyper-V and/or VMware.

Let’s review where we’ve been:

[table]

Labworks Chapter, Verse, Subject, Title & URL

Labworks 1:, 1, Storage, Building a Durable and Performance-Oriented ZFS Box for Hyper-V & VMware

,2-3, StorageI Heart the ARC & Let’s Pull Some Drives!

[/table]

Today we’re going to circle back to the very end of Labworks 1:1, where I assigned myself some homework: find out why my writes suck so bad. We’re going to talk about a man named ZIL and his sidekick the SLOG and then we’re going to check out some Excel charts and finish by considering ZFS’ sync models.

But first, some housekeeping: SAN2, the ZFS box, has undergone minor modification. You can find the current array setup below. Also, I have a new switch in the Daisetta Lab, and as switching is intimately tied to storage networking & performance, it’s important I detail a little bit about it.

Labworks 1:4 – Small Business SG300 vs Catalyst 2960S

Cisco’s SG-300 & SG-500 series switches are getting some pretty good reviews, especially in a home lab context. I’ve got an SG-300 and really like it as it offers a solid spectrum of switching options at Layer 2 as well as a nice Layer 3-lite mode all for a tick under $200. It even has a real web-interface if your CLI-shy, which

Small Business Cisco != Linksys
Small Business Cisco != Linksys

I’m not but some folks are.

Sadly for me & the Daisetta Lab, I need more ports than my little SG-300 has to offer. So I’ve removed it from my rack and swapped it for a 2960S-48TS-L from the office, but not just any 2960S.

No, I have spiritual & emotional ties to this 2960s, this exact one. It’s the same 2960s I used in my January storage bakeoff of a Nimble array, the same 2960s on which I broke my Hyper-V & VMware cherry in those painful early days of virtualization, yes, this five year old switch is now in my lab:

The pride of Cisco's 2009 Desktop Switching series, the 2960s
The pride of Cisco’s 2009 Desktop Switching series, the 2960s

Sure it’s not a storage switch, in fact it’s meant for IDFs and end-users and if the guys on that great storage networking podcast from a few weeks back knew I was using this as a storage switch, I’d be finished in this industry for good.

But I love this switch and I’m glad its at the top of my rack. I saved 1U, the energy costs of this switch vs two smaller ones are probably a wash, and though I lost Layer 3 Lite, I gained so much more: 48 x 1GbE ports and full LAN-licensed Cisco IOS v 15.2, which, agnostic computing goals aside for a moment, just feels so right and so good.

And with the increased amount of full-featured switch ports available to me, I’ve now got LACP teams of three on agnostic_node_1 & 2, jumbo frames from end to end, and the same VLAN layout.

Here’s the updated Labworks schematic and the disk layout for SAN2:

Lab 1-4-5 - Daisetta Labs

[table]

Disk Type, Quantity, Size, Format, Speed, Function

WD Red 2.5″ with NASWARE, 6, 1TB, 4KB AF, SATA 3 5400RPM, Zpool Members

Samsung 840 EVO SSD, 1, 128GB, 512byte, SATA 3, L2ARC Read Cache

Samsung 830 SSD, 1, 128GB, 512byte, SATA 3, L2ARC Read Cache

Seagate 2.5″ Momentus, 1, 500GB, 512byte, 80MB/r/w, Boot/swap/system

[/table]

Labworks 1:5 – A Man named ZIL and his sidekick, the SLOG

Labworks 1:1 was all about building durable & performance-oriented storage for Hyper-V & VMware. And one of the unresolved questions I aimed to solve out of that post was my poor write performance.

Review the hardware table and you’ll feel like I felt. I got me some SSD and some RAM, I provisioned a ZIL so write-cache that inbound IO already ZFS, amiright? Show me the IOPSMoney Jerry!

Well, about that. I mischaracterized the ZIL and I apologize to readers for the error. Let’s just get this out of the way: The ZFS Intent Log (ZIL) is not a write-cache device as I implied in Labworks 1:1.

ZFS storage in excellent Good/Better/Best format
ZFS storage layout in excellent Good/Better/Best format courtesy of Nexenta, which has some outstanding documentation & guides

The ZIL, whether spread out among your rotational disks by ZFS design, or applied to a Separate Log Device (a SLOG), is simply a synchronous writes mechanism, a log designed to ensure data integrity and report (IO ACK) back to the application layer that writes are safe somewhere on your rotational media. The ZIL & SLOG are also a disaster recovery mechanisms/devices ; in the event of power-loss, the ZIL, or the ZIL functioning on a SLOG device, will ensure that the writes it logged prior to the event are written to your spinners when your disks are back online.

Now there seem to be some differences in how the various implementations of ZFS look at the ZIL/SLOG mechanism.

Nexenta Community Edition, based off Illumos which is the open source descendant of Sun’s Solaris, says your SLOG should just be a write-optimized SSD, but even that’s more best practice than hard & fast requirement. Nexenta touts the ZIL/SLOG as a performance multiplier, and their excellent documentation has helpful charts and graphics reinforcing that.

In contrast, the most popular FreeBSD ZFS implementations documentation paints the ZIL as likely more trouble than its worth. FreeNAS actively discourages you from provisioning a SLOG unless it’s enterprise-grade, accurately pointing out that the ZIL & a SLOG device aren’t write-cache and probably won’t make your writes faster anyway, unless you’re NFS-focused (which I’m proudly, defiantly even, not) or operating a large database at scale.

ZIL me

What’s to account for the difference in documentation & best practice guides? I’m not sure; some of it’s probably related to *BSD vs Illumos implementations of ZFS, some of it’s probably related to different audiences & users of the free tier of these storage systems.

The question for us here is this: Will you benefit from provisioning a SLOG device if you build a ZFS box for Hyper-V and VMWare storage for iSCSI?

I hate sounding like a waffling storage VAR here, but I will: it depends. I’ve run both Nexenta and NAS4Free; when I ran Nexenta, I saw my SLOG being used during random & synchronous write operations. In NAS4Free, the SSD I had dedicated as a SLOG never showed any activity in zfs-stats, gstat or any other IO disk tool I could find.

One could spend weeks of valuable lab time verifying under which conditions a dedicated SLOG device adds performance to your storage array, but I decided to cut bait. Check out some of the links at the bottom for more color on this, but in the meantime, let me leave you with this advice: if you have $80 to spend on your FreeBSD-based ZFS storage, buy an extra 8GB of RAM rather than a tiny, used SLC or MLC device to function as your SLOG. You will almost certainly get more performance out of a larger ARC than by dedicating a disk as your SLOG.

Labworks 1:6 – Great…so, again, why do my writes suck? 

Recall this SQLIO test from Labworks 1:1:

sqlio lab 1 short test

As you can see, read or write, I was hitting a wall at around 235-240 megabytes per second during much of “Short Test”, which is pretty close to the theoretical limit of an LACP team with two GigE NICs.

But as I said above, we don’t have that limit anymore. Whereas there were once 2x1GbE Teams, there are now 3x1GbE. Let’s see what the same test on the same 4KB block/4KB NTFS volume yields now.

SQLIO short test, take two, sort by Random vs Sequential writes & reads:

labworks147

By jove, what’s going on here? This graph was built off the same SQLIO recipe, but looks completely different than Labworks 1. For one, the writes look much better, and reads look much worse. Yet step back and the patterns are largely the same.

It’s data like this that makes benchmarking, validating & ultimately purchasing storage so tricky. Some would argue with my reliance on SQLIO and those arguments have merit, but I feel SQLIO, which is easy to script/run and automate, can give you some valuable hints into the characteristics of an array you’re considering.

Let’s look at the writes question specifically.

Am I really writing 350MB/s to SAN2?

storagenetworkingforthewinOn the one hand, everything I’m looking at says YES: I am a Storage God and I have achieved #StorageGlory inside the humble Daisetta Lab HQ on consumer-level hardware:

  • SAN2 is showing about 115MB/s to each Broadcom interface during the 32KB & 64KB samples
  • Agnostic_Node_1 perfmon shows about the same amount of traffic eggressing the three vEthernet adapters
  • The 2960S is reflecting all that traffic; I’m definitely pushing about 350 megabytes per second to SAN2; interface port channel 3 shows TX load at 219 out of 255 and maxing out my LACP team

On the other hand, I am just an IT Mortal and something bothers:

  • CPU is very high on SAN2 during the 32KB & 64KB runs…so busy it seems like the little AMD CPU is responsible for some of the good performance marks
  • While I’m a fan of the itsy-bitsy 2.5″ Western Digitial RED 1TB drives in SAN2, under no theoretical IOPS model is it likely that six of them, in RAIDZ-2 (RAID 6 equivalent) can achieve 5,000-10,000 IOPS under traditional storage principles. Each drive by itself is capable of only 75-90 IOPS
  • If something is too good to be true, it probably is

49286241Sr. Storage Engineer Neo feels really frustrated at this point; he can’t figure out why his writes suck, or even if they suck, and so he wanders up to the Oracle to get her take on the situation and comes across this strange Buddha Storage kid.

Labworks 1:7 – The Essence of ZFS & New Storage model

In effect, what we see here is is just a sample of the technology & techniques that have been disrupting the storage market for several years now: compression & caching multiply performance of storage systems beyond what they should be capable of, in certain scenarios.

As the chart above shows, the test2 volume is compressed by SAN2 using lzjb. On top of that, we’ve got the ZFS ARC, L2ARC, and the ZIL in the mix. And then, to make things even more complicated, we have some sync policies ZFS allows us to toggle. They look like this:

sync policy

The sync toggle documentation is out there and you should understand it it is crucial to understanding ZFS, but I want to demonstrate the choices as well.

I’ve got three choices + the compression options. Which one of these combinations is going to give me the best performance & durability for my Hyper-V VMs?

SQLIO Short Test Runs 3-6, all PivotTabled up for your enjoyment and ease of digestion:

compressionsync

As is usually the case in storage, IT, and hell, life in general, there are no free lunches here people. This graph tells you what you already know in your heart: the safest storage policy in ZFS-land (Always Sync, that is to say, commit writes to the rotationals post haste as if it was the last day on earth) is also the slowest. Nearly 20 seconds of latency as I force ZFS to commit everything I send it immediately (vs flush it later), which it struggles to do at a measly average speed of 4.4 megabytes/second.

Compression-wise, I thought I’d see a big difference between the various compression schemes, but I don’t. Lzgb, lz4, and the ultra-space-saving/high-cpu-cost gzip-9 all turn in about equal results from an IOPS & performance perspective. It’s almost a wash, really, and that’s likely because of the predictable nature of the IO SQLIO is generating.

Labworks 1:Epilogue

Last point: ZFS, as Chris Wahl pointed out, is a sort of virtualization layer atop your storage. Now if you’re a virtualization guy like me or Wahl, that’s easy to grasp; Windows 2012 R2’s Storage Spaces concept is similar in function.

But sometimes in virtualization, it’s good to peel away the abstraction onion and watch what that looks like in practice. ZFS has a number of tools and monitors that look at your Zpool IO, but to really see how ZFS works, I advise you to run gstat. GStat shows what your disks are doing and if you’re carefully setting up your environment, you ought to be able to see the effects of your settings on each individual spindle.

In this Gifcam, watch ada0-6 as they struggle under load with the "Always Sync" option enabled.
In this Gifcam, watch ada0-5 (the western digitals)as they struggle under load with the “Always Sync” option enabled. Notice that the zvol/Alpha-Pool/Test2 volume (The logical volume construct) is at 100% busy and the ops/s are not very stellar.

Now look at this gstat sample. Under SQLIO-load, the zvol is showing 10,000 IOPS, 300+MB/s. But ada0-5, the physical drives, aren't doing squat.

Now look at this gstat sample. Under SQLIO-load, the zvol is showing 10,000 IOPS, 300+MB/s. But ada0-5, the physical drives, aren’t doing squat for several seconds at a time as SAN2 absorbs & processes all the IO coming at it.

That, friends, is the essence of ZFS.

 Links/Knowledge/Required Reading Used in this Post:

[table]
Resource, Author, Summary

Nexenta’s awesome whitepapers and guides, Nexenta, Find ’em and collect ’em good stuff on MPIO config and ZFS performance

Comparing SSD vs NoSSD in Nexenta w/NFS, Larry Smith, A fellow ZFS fan with more focus on NFS & VMware

Get the Most out of ZFS SSD, Sebastian “vBagpipes” Laubscher, Sebastian finds a different way to provision the ZIL/SLOG

Nexenta & Scale, Hans DeLeenHeer, Fellow #TFD delegate looks at ZFS tiers in superhero context

SLOG/ZIL Insight, FreeNAS forum, Great forum-focused post on SLOG/ZIL in BSD ZFS

SLOG Blog, Oracle, 2007 post about the ZIL & SLOG heralding storage di

 Zpool and ZIL management, Magnus Strahlert, Excellent how-to guide for ZIL/L2ARC provisioning

[/table]

 

Fail File : SAN down! SAN down! All Nodes respond

Introducing Fail File #1, where I admit to screwing something up and reflect on what I’ve learned

SAN2.daisettalabs.net, the NAS4Free server I built to simulate some of the functions I perform at work with big boy SANs, crashed last night.

Or, to put it another way, I pushed that little AMD-powered, FreeBSD-running, Broadcom-connected, ZFS-flavored franken-array to the breaking point:

Untitled picture
Love the directness of BSD. The iSCSI Target process was killed in cold blood, resulting in the death of several child partitions. What’s more, in just a few words, I have the suspect (Kernel) the motive (swap space) & the victim (iSCSI). Windows would have said, “The service terminated unexpectedly…error 0x081942ad-SOL”

 

Such are the perils of concentrated block storage, amiright? Instantly my Hyper-V Cluster Shared Volumes + the 8 or 9 VMs inside them dropped:

csvs

So what happened here?

I failed to grok the grub or fsck the fdisk or something and gave BSD an inadequate amount of swap space on the root 10GB partition slice. Then I lobbed some iSCSI packets its way from multiple sources and the kernel, starved for resources (because I’m using about 95% of my RAM for the ARC), decided to kill istgt, the iSCSI target service.

Thinking back to the winter, when I ran Nexenta -derived from Sun’s Solaris, not BSD-based- the failure sequence was different, but I’m not sure it was better.

When I was pounding the Nexenta SAN2 back in the winter, volleying 175,000+ iSCSI packets per second its way onto hardware that was even more ghetto, Nexenta did what any good human engineer does: compensate for the operator’s errors & abuses.

It was kind of neat to see. Whether I was running SQLIO simulations, an iometer run, robocopy or eseutil, or just turning on a bunch of VMs simultaneously, one by one, Nexenta services would start to drop as resources were exhausted.

First the gui (NMV it’s called). Then SSH. And finally, sometimes the console itself would lock up (NMC).

But never iSCSI, the disk subsystem, the ARC or L2ARC…those pieces never dropped.

Now to be fair, the GUI, SSH & console services never really turned back on either….you might end up with a durable storage system you couldn’t interact with at all until hard reset, but at least the LUNs stayed online.

This BSD box, in contrast, kills the most important service I’m running on it, but has the courtesy to admit to it and doesn’t make me get up out of my seat: GUI/SSH all other processes are running fine and I’ve instantly identified the problem and will engineer against it.

One model is resilient, bending but not breaking; the other is durable up to a point, and then it just snaps.

Which model is better for a given application?

Fail File Lesson #1: It’s just as important to understand how things fail as it is to understand why they fail, so that you can properly engineer against it. I never thought inadequate swap space would result in a homicidal kernel gunning for the most important service on the box…now I know.

Labworks #1: Building a durable, performance-oriented ZFS box for Hyper-V, VMware

Welcome to my first Labworks post in which I test, build & validate a ZFS storage solution for my home Hyper-V & VMware lab.

Be sure to check out the followup lab posts on this same topic in the table below!

[table]

Labworks Chapter, Section, Subject, Title & URL

Labworks 1:, 1, Storage, Building a Durable and Performance-Oriented ZFS Box for Hyper-V & VMware

,2-3, Storage, I Heart the ARC & Let’s Pull Some Drives!

[/table]

Labworks  #1: Building a durable, performance-oriented ZFS box for Hyper-V, VMware

Primary Goal: To build a durable and performance-oriented storage array using Sun’s fantastic, 128 bit, high-integrity Zetabyte File System for use with Lab Hyper-V CSVs & Windows clusters, VMware ESXi 5.5, other hypervisors,

 

The ARC: My RAM makes your SSD look like 15k drives
The ARC: My RAM makes your SSD look like a couplel of old, wheezing 15k drives

Secondary Goal: Leverage consumer-grade SSDs to increase/multiply performance by using them as ZFS Intent Log (ZIL) write-cache and L2ARC read cache

Bonus: The Windows 7 PC in the living room that’s running Windows Media Center with CableCARD & HD Home Run was running out of DVR disk space and can’t record to SMB shares but can record to iSCSI LUNs.

Technologies used: iSCSI, MPIO, LACP, Jumbo Frames, IOMETER, SQLIO, ATTO, Robocopy, CrystalDiskMark, FreeBSD, NAS4Free, Windows Server 2012 R2, Hyper-V 3.0, Converged switch, VMware, standard switch, Cisco SG300

Schematic: 

Click for larger
Click for larger.

Hardware Notes:
[table]
System, Motherboard, Class, CPU, RAM, NIC, Hypervisor
Node-1, Asus Z87-K, Consumer, Haswell i-5, 24GB, 2x1GbE Intel I305, Hyper-V
Node-2, Biostar HZZMU3, Consumer, Ivy Bridge i-7, 24GB, 2x1GbE Broadcom BC5709C, Hyper-V
Node-3, MSI 760GM-P23, Consumer, AMD FX-6300, 16GB, 2x1GbE Intel i305, ESXi 5.5
san2, Gigabyte GA-F2A88XM-D3H, Consumer, AMD A8-5500, 24GB, 4x1GbE Broadcom BC5709C, NAS4Free
sw01, Cisco SG300-10 Port, Small Busines, n/a, n/a, 10x1GbE, n/a
[/table]

Array Setup:

I picked the Gigabyte board above because it’s got an outstanding eight SATA 6Gbit ports, all running on the native AMD A88x Bolton-D4 chipset, which, it turns out, isn’t supported well in Illumos (see Lab Notes below).

I added to that a cheap $20 Marve 9128se two port SATA 6gbit PCIe card, which hosts the boot volume & the SanDisk SSD.

[table]

Disk Type, Quantity, Size, Format, Speed, Function

WD Red 2.5″ with NASWARE, 6, 1TB, 4KB AF, SATA 3 5400RPM, Zpool Members

Samsung 840 EVO SSD, 1, 128GB, 512byte, 250MB/read, L2ARC Read Cache

SanDisk Ultra Plus II SSD, 1, 128GB, 512byte, 250MB/read & 250MB/write?, ZIL

Seagate 2.5″ Momentus, 1, 500GB, 512byte, 80MB/r/w, Boot/swap/system

[/table]

Performance Tests:

I’m not finished with all the benchmarking, which is notoriously difficult to get right, but here’s a taste. Expect a followup soon.

All shots below involved lzp2 compression on SAN2

SQLIO Short Test: 

sqlio lab 1 short test
Obviously seeing the benefit of ZFS compression & ARC at the front end. IOPS become more realistic toward the middle and right as read cache is exhausted. Consistently in around 150MB-240Mb/s though, the limit of two 1GbE cables.

 

ATTO standard run:

atto
I’ve got a big write problem somewhere. Is it the ZIL, which don’t seem to be performing under BSD as they did under Nexenta? Something else? Could also be related to the Test Volume being formatted NTFS 64kb. Still trying to figure it out

 

NFS Tests:

None so far. From a VMware perspective, I want to rebuild the Standard switch as a distributed switch now that I’ve got a VCenter appliance running. But that’s not my priority at the moment.

Durability Tests:

Pulled two drives -the limit on RAIDZ2- under normal conditions. Put them back in, saw some alerts about the “administrator pulling drives” and the Zpool being in a degraded state. My CSVs remained online, however. Following a short zpool online command, both drives rejoined the pool and the degraded error went away.

Fun shots:

Because it’s not all about repeatable lab experiments. Here’s a Gifcam shot from Node-1 as it completely saturates both 2x1GbE Intel NICs:

test

and some pretty blinking lights from the six 2.5″ drives:

0303141929-MOTION

Lab notes & Lessons Learned:

First off, I’d like to buy a beer for the unknown technology enthusiast/lab guy who uttered these sage words of wisdom, which I failed to heed:

You buy cheap, you buy twice

Listen to that man, would you? Because going consumer, while tempting, is not smart. Learn from my mistakes: if you have to buy, buy server boards.

Secondly, I prefer NexentaStor to NAS4Free with ZFS, but like others, I worry about and have been stung by Open Solaris/Illumos hardware support. Most of that is my own fault, cf the note above, but still: does Illumos have a future? I’m hopeful, NextentaStor is going to appear at next month’s Storage Field Day 5, so that’s a good sign, and version 4.0 is due out anytime.

The Illumos/Nexenta command structure is much more intuitive to me than FreeBSD. In place of your favorite *nix commands, Nexenta employs some great, verb-noun show commands, and dtrace, the excellent diagnostic/performance tool included in Solaris is baked right into Nexenta. In NAS4Free/FreeBSD 9.1, you’ve got to add a few packages to get the equivalent stats for the ARC, L2ARC and ZFS, and adding dtrace involves a make & kernel modification, something I haven’t been brave enough to try yet.

Next: Jumbo Frames for the win. From Node-1, the desktop in my office, my Core i5-4670k CPU would regularly hit 35-50% utilization during my standard SQLIO benchmark before I configured jumbo frames from end-to-end. Now, after enabling Jumbo frames on the Intel NICs, the Hyper-V converged switch, the SG-300 and the ZFS box, utilization peaks at 15-20% during the same SQLIO test, and the benchmarks have show an increase as well. Unfortunately in FreeBSD world, adding jumbo frames is something you have to do on the interface & routing table, and it doesn’t persist across reboots for me, though that may be due to a driver issue on the Broadcom card.

The Western Digital 2.5″ drives aren’t stellar performers and they aren’t cheap, but boy are they quiet, well-built, and run cool, asking politely for only 1 watt under load. I’ve returned the hot, loud & failure prone HGST 3.5″ 2 TB drives I borrowed from work; it’s too hard to put them in a chassis that’s short-depth.

Lastly, ZFS’ adaptive replacement cache, which I’ve enthused over a lot in recent weeks, is quite the value & performance-multiplier. I’ve tested Windows Server 2012 R2 Storage Appliance’s tiered storage model, and while I was impressed with it’s responsiveness, ReFS, and ability to pool storage in interesting ways, nothing can compete with ZFS’ ARC model. It’s simply awesome; deceptively-simple, but awesome.

Lesson is that if you’re going to lose an entire box to storage in your lab, your chosen storage system better use every last ounce of that box, including its RAM, to serve storage up to you. 2012 R2 doesn’t, but I’m hopeful soon that it may (Update 1 perhaps?)

Here’s a cool screenshot from Nexenta, my last build before I re-did everything, showing ARC-hits following a cold boot of the array (top), and a few days later, when things are really cooking for my Hyper-V VMs stored, which are getting tagged with ZFS’ “Most Frequently Used” category and thus getting the benefit of fast RAM & L2ARC:

cache

Next Steps:

  • Find out why my writes suck so bad.
  • Test Nas4Free’s NFS performance
  • Test SMB 3.0 from a virtual machine inside the ZFS box
  • Sell some stuff so I can buy a proper SLC SSD drive for the ZIL
  • Re-build the rookie Standard Switch into a true Distributed Switch in ESXi

Links/Knowledge/Required Reading Used in this Post:

[table]
Resource, Author, Summary
Three Example Home Lab Storage Designs using SSDs and Spinning Disk, Chris Wahl, Good piece on different lab storage models
ZFS, Wikipedia, Great overview of ZFS history and features
Activity of the ZFS Arc, Brendan Gregg, Excellent overview of ZFS’ RAM-as-cache
Hybrid Storage Pool Performance, Brendan Gregg, Details ZFS performance
FreeBSD Jumbo Frames, NixCraft, Applying MTU correctly
Hyper-V vEthernet Jumbo Frames, Darryl Van der Peijl, Great little powershell script to keep you out of regedit
Nexenta Community Edition 3.1.5, NexentaStor, My personal preference for a Solaris-derived ZFS box
Nas4Free, Nas4Free.org, FreeBSD-based ZFS; works with more hardware
[/table]