Labworks 1:4-7 – The Last Word in ZFS Labworks

Greetings to you Labworks readers, consumers, and conversationalists. Welcome to the last  verse of Labworks Chapter 1, which has been all about building a durable and performance-oriented ZFS storage array for Hyper-V and/or VMware.

Let’s review where we’ve been:

[table]

Labworks Chapter, Verse, Subject, Title & URL

Labworks 1:, 1, Storage, Building a Durable and Performance-Oriented ZFS Box for Hyper-V & VMware

,2-3, StorageI Heart the ARC & Let’s Pull Some Drives!

[/table]

Today we’re going to circle back to the very end of Labworks 1:1, where I assigned myself some homework: find out why my writes suck so bad. We’re going to talk about a man named ZIL and his sidekick the SLOG and then we’re going to check out some Excel charts and finish by considering ZFS’ sync models.

But first, some housekeeping: SAN2, the ZFS box, has undergone minor modification. You can find the current array setup below. Also, I have a new switch in the Daisetta Lab, and as switching is intimately tied to storage networking & performance, it’s important I detail a little bit about it.

Labworks 1:4 – Small Business SG300 vs Catalyst 2960S

Cisco’s SG-300 & SG-500 series switches are getting some pretty good reviews, especially in a home lab context. I’ve got an SG-300 and really like it as it offers a solid spectrum of switching options at Layer 2 as well as a nice Layer 3-lite mode all for a tick under $200. It even has a real web-interface if your CLI-shy, which

Small Business Cisco != Linksys
Small Business Cisco != Linksys

I’m not but some folks are.

Sadly for me & the Daisetta Lab, I need more ports than my little SG-300 has to offer. So I’ve removed it from my rack and swapped it for a 2960S-48TS-L from the office, but not just any 2960S.

No, I have spiritual & emotional ties to this 2960s, this exact one. It’s the same 2960s I used in my January storage bakeoff of a Nimble array, the same 2960s on which I broke my Hyper-V & VMware cherry in those painful early days of virtualization, yes, this five year old switch is now in my lab:

The pride of Cisco's 2009 Desktop Switching series, the 2960s
The pride of Cisco’s 2009 Desktop Switching series, the 2960s

Sure it’s not a storage switch, in fact it’s meant for IDFs and end-users and if the guys on that great storage networking podcast from a few weeks back knew I was using this as a storage switch, I’d be finished in this industry for good.

But I love this switch and I’m glad its at the top of my rack. I saved 1U, the energy costs of this switch vs two smaller ones are probably a wash, and though I lost Layer 3 Lite, I gained so much more: 48 x 1GbE ports and full LAN-licensed Cisco IOS v 15.2, which, agnostic computing goals aside for a moment, just feels so right and so good.

And with the increased amount of full-featured switch ports available to me, I’ve now got LACP teams of three on agnostic_node_1 & 2, jumbo frames from end to end, and the same VLAN layout.

Here’s the updated Labworks schematic and the disk layout for SAN2:

Lab 1-4-5 - Daisetta Labs

[table]

Disk Type, Quantity, Size, Format, Speed, Function

WD Red 2.5″ with NASWARE, 6, 1TB, 4KB AF, SATA 3 5400RPM, Zpool Members

Samsung 840 EVO SSD, 1, 128GB, 512byte, SATA 3, L2ARC Read Cache

Samsung 830 SSD, 1, 128GB, 512byte, SATA 3, L2ARC Read Cache

Seagate 2.5″ Momentus, 1, 500GB, 512byte, 80MB/r/w, Boot/swap/system

[/table]

Labworks 1:5 – A Man named ZIL and his sidekick, the SLOG

Labworks 1:1 was all about building durable & performance-oriented storage for Hyper-V & VMware. And one of the unresolved questions I aimed to solve out of that post was my poor write performance.

Review the hardware table and you’ll feel like I felt. I got me some SSD and some RAM, I provisioned a ZIL so write-cache that inbound IO already ZFS, amiright? Show me the IOPSMoney Jerry!

Well, about that. I mischaracterized the ZIL and I apologize to readers for the error. Let’s just get this out of the way: The ZFS Intent Log (ZIL) is not a write-cache device as I implied in Labworks 1:1.

ZFS storage in excellent Good/Better/Best format
ZFS storage layout in excellent Good/Better/Best format courtesy of Nexenta, which has some outstanding documentation & guides

The ZIL, whether spread out among your rotational disks by ZFS design, or applied to a Separate Log Device (a SLOG), is simply a synchronous writes mechanism, a log designed to ensure data integrity and report (IO ACK) back to the application layer that writes are safe somewhere on your rotational media. The ZIL & SLOG are also a disaster recovery mechanisms/devices ; in the event of power-loss, the ZIL, or the ZIL functioning on a SLOG device, will ensure that the writes it logged prior to the event are written to your spinners when your disks are back online.

Now there seem to be some differences in how the various implementations of ZFS look at the ZIL/SLOG mechanism.

Nexenta Community Edition, based off Illumos which is the open source descendant of Sun’s Solaris, says your SLOG should just be a write-optimized SSD, but even that’s more best practice than hard & fast requirement. Nexenta touts the ZIL/SLOG as a performance multiplier, and their excellent documentation has helpful charts and graphics reinforcing that.

In contrast, the most popular FreeBSD ZFS implementations documentation paints the ZIL as likely more trouble than its worth. FreeNAS actively discourages you from provisioning a SLOG unless it’s enterprise-grade, accurately pointing out that the ZIL & a SLOG device aren’t write-cache and probably won’t make your writes faster anyway, unless you’re NFS-focused (which I’m proudly, defiantly even, not) or operating a large database at scale.

ZIL me

What’s to account for the difference in documentation & best practice guides? I’m not sure; some of it’s probably related to *BSD vs Illumos implementations of ZFS, some of it’s probably related to different audiences & users of the free tier of these storage systems.

The question for us here is this: Will you benefit from provisioning a SLOG device if you build a ZFS box for Hyper-V and VMWare storage for iSCSI?

I hate sounding like a waffling storage VAR here, but I will: it depends. I’ve run both Nexenta and NAS4Free; when I ran Nexenta, I saw my SLOG being used during random & synchronous write operations. In NAS4Free, the SSD I had dedicated as a SLOG never showed any activity in zfs-stats, gstat or any other IO disk tool I could find.

One could spend weeks of valuable lab time verifying under which conditions a dedicated SLOG device adds performance to your storage array, but I decided to cut bait. Check out some of the links at the bottom for more color on this, but in the meantime, let me leave you with this advice: if you have $80 to spend on your FreeBSD-based ZFS storage, buy an extra 8GB of RAM rather than a tiny, used SLC or MLC device to function as your SLOG. You will almost certainly get more performance out of a larger ARC than by dedicating a disk as your SLOG.

Labworks 1:6 – Great…so, again, why do my writes suck? 

Recall this SQLIO test from Labworks 1:1:

sqlio lab 1 short test

As you can see, read or write, I was hitting a wall at around 235-240 megabytes per second during much of “Short Test”, which is pretty close to the theoretical limit of an LACP team with two GigE NICs.

But as I said above, we don’t have that limit anymore. Whereas there were once 2x1GbE Teams, there are now 3x1GbE. Let’s see what the same test on the same 4KB block/4KB NTFS volume yields now.

SQLIO short test, take two, sort by Random vs Sequential writes & reads:

labworks147

By jove, what’s going on here? This graph was built off the same SQLIO recipe, but looks completely different than Labworks 1. For one, the writes look much better, and reads look much worse. Yet step back and the patterns are largely the same.

It’s data like this that makes benchmarking, validating & ultimately purchasing storage so tricky. Some would argue with my reliance on SQLIO and those arguments have merit, but I feel SQLIO, which is easy to script/run and automate, can give you some valuable hints into the characteristics of an array you’re considering.

Let’s look at the writes question specifically.

Am I really writing 350MB/s to SAN2?

storagenetworkingforthewinOn the one hand, everything I’m looking at says YES: I am a Storage God and I have achieved #StorageGlory inside the humble Daisetta Lab HQ on consumer-level hardware:

  • SAN2 is showing about 115MB/s to each Broadcom interface during the 32KB & 64KB samples
  • Agnostic_Node_1 perfmon shows about the same amount of traffic eggressing the three vEthernet adapters
  • The 2960S is reflecting all that traffic; I’m definitely pushing about 350 megabytes per second to SAN2; interface port channel 3 shows TX load at 219 out of 255 and maxing out my LACP team

On the other hand, I am just an IT Mortal and something bothers:

  • CPU is very high on SAN2 during the 32KB & 64KB runs…so busy it seems like the little AMD CPU is responsible for some of the good performance marks
  • While I’m a fan of the itsy-bitsy 2.5″ Western Digitial RED 1TB drives in SAN2, under no theoretical IOPS model is it likely that six of them, in RAIDZ-2 (RAID 6 equivalent) can achieve 5,000-10,000 IOPS under traditional storage principles. Each drive by itself is capable of only 75-90 IOPS
  • If something is too good to be true, it probably is

49286241Sr. Storage Engineer Neo feels really frustrated at this point; he can’t figure out why his writes suck, or even if they suck, and so he wanders up to the Oracle to get her take on the situation and comes across this strange Buddha Storage kid.

Labworks 1:7 – The Essence of ZFS & New Storage model

In effect, what we see here is is just a sample of the technology & techniques that have been disrupting the storage market for several years now: compression & caching multiply performance of storage systems beyond what they should be capable of, in certain scenarios.

As the chart above shows, the test2 volume is compressed by SAN2 using lzjb. On top of that, we’ve got the ZFS ARC, L2ARC, and the ZIL in the mix. And then, to make things even more complicated, we have some sync policies ZFS allows us to toggle. They look like this:

sync policy

The sync toggle documentation is out there and you should understand it it is crucial to understanding ZFS, but I want to demonstrate the choices as well.

I’ve got three choices + the compression options. Which one of these combinations is going to give me the best performance & durability for my Hyper-V VMs?

SQLIO Short Test Runs 3-6, all PivotTabled up for your enjoyment and ease of digestion:

compressionsync

As is usually the case in storage, IT, and hell, life in general, there are no free lunches here people. This graph tells you what you already know in your heart: the safest storage policy in ZFS-land (Always Sync, that is to say, commit writes to the rotationals post haste as if it was the last day on earth) is also the slowest. Nearly 20 seconds of latency as I force ZFS to commit everything I send it immediately (vs flush it later), which it struggles to do at a measly average speed of 4.4 megabytes/second.

Compression-wise, I thought I’d see a big difference between the various compression schemes, but I don’t. Lzgb, lz4, and the ultra-space-saving/high-cpu-cost gzip-9 all turn in about equal results from an IOPS & performance perspective. It’s almost a wash, really, and that’s likely because of the predictable nature of the IO SQLIO is generating.

Labworks 1:Epilogue

Last point: ZFS, as Chris Wahl pointed out, is a sort of virtualization layer atop your storage. Now if you’re a virtualization guy like me or Wahl, that’s easy to grasp; Windows 2012 R2’s Storage Spaces concept is similar in function.

But sometimes in virtualization, it’s good to peel away the abstraction onion and watch what that looks like in practice. ZFS has a number of tools and monitors that look at your Zpool IO, but to really see how ZFS works, I advise you to run gstat. GStat shows what your disks are doing and if you’re carefully setting up your environment, you ought to be able to see the effects of your settings on each individual spindle.

In this Gifcam, watch ada0-6 as they struggle under load with the "Always Sync" option enabled.
In this Gifcam, watch ada0-5 (the western digitals)as they struggle under load with the “Always Sync” option enabled. Notice that the zvol/Alpha-Pool/Test2 volume (The logical volume construct) is at 100% busy and the ops/s are not very stellar.

Now look at this gstat sample. Under SQLIO-load, the zvol is showing 10,000 IOPS, 300+MB/s. But ada0-5, the physical drives, aren't doing squat.

Now look at this gstat sample. Under SQLIO-load, the zvol is showing 10,000 IOPS, 300+MB/s. But ada0-5, the physical drives, aren’t doing squat for several seconds at a time as SAN2 absorbs & processes all the IO coming at it.

That, friends, is the essence of ZFS.

 Links/Knowledge/Required Reading Used in this Post:

[table]
Resource, Author, Summary

Nexenta’s awesome whitepapers and guides, Nexenta, Find ’em and collect ’em good stuff on MPIO config and ZFS performance

Comparing SSD vs NoSSD in Nexenta w/NFS, Larry Smith, A fellow ZFS fan with more focus on NFS & VMware

Get the Most out of ZFS SSD, Sebastian “vBagpipes” Laubscher, Sebastian finds a different way to provision the ZIL/SLOG

Nexenta & Scale, Hans DeLeenHeer, Fellow #TFD delegate looks at ZFS tiers in superhero context

SLOG/ZIL Insight, FreeNAS forum, Great forum-focused post on SLOG/ZIL in BSD ZFS

SLOG Blog, Oracle, 2007 post about the ZIL & SLOG heralding storage di

 Zpool and ZIL management, Magnus Strahlert, Excellent how-to guide for ZIL/L2ARC provisioning

[/table]

 

Fail File : SAN down! SAN down! All Nodes respond

Introducing Fail File #1, where I admit to screwing something up and reflect on what I’ve learned

SAN2.daisettalabs.net, the NAS4Free server I built to simulate some of the functions I perform at work with big boy SANs, crashed last night.

Or, to put it another way, I pushed that little AMD-powered, FreeBSD-running, Broadcom-connected, ZFS-flavored franken-array to the breaking point:

Untitled picture
Love the directness of BSD. The iSCSI Target process was killed in cold blood, resulting in the death of several child partitions. What’s more, in just a few words, I have the suspect (Kernel) the motive (swap space) & the victim (iSCSI). Windows would have said, “The service terminated unexpectedly…error 0x081942ad-SOL”

 

Such are the perils of concentrated block storage, amiright? Instantly my Hyper-V Cluster Shared Volumes + the 8 or 9 VMs inside them dropped:

csvs

So what happened here?

I failed to grok the grub or fsck the fdisk or something and gave BSD an inadequate amount of swap space on the root 10GB partition slice. Then I lobbed some iSCSI packets its way from multiple sources and the kernel, starved for resources (because I’m using about 95% of my RAM for the ARC), decided to kill istgt, the iSCSI target service.

Thinking back to the winter, when I ran Nexenta -derived from Sun’s Solaris, not BSD-based- the failure sequence was different, but I’m not sure it was better.

When I was pounding the Nexenta SAN2 back in the winter, volleying 175,000+ iSCSI packets per second its way onto hardware that was even more ghetto, Nexenta did what any good human engineer does: compensate for the operator’s errors & abuses.

It was kind of neat to see. Whether I was running SQLIO simulations, an iometer run, robocopy or eseutil, or just turning on a bunch of VMs simultaneously, one by one, Nexenta services would start to drop as resources were exhausted.

First the gui (NMV it’s called). Then SSH. And finally, sometimes the console itself would lock up (NMC).

But never iSCSI, the disk subsystem, the ARC or L2ARC…those pieces never dropped.

Now to be fair, the GUI, SSH & console services never really turned back on either….you might end up with a durable storage system you couldn’t interact with at all until hard reset, but at least the LUNs stayed online.

This BSD box, in contrast, kills the most important service I’m running on it, but has the courtesy to admit to it and doesn’t make me get up out of my seat: GUI/SSH all other processes are running fine and I’ve instantly identified the problem and will engineer against it.

One model is resilient, bending but not breaking; the other is durable up to a point, and then it just snaps.

Which model is better for a given application?

Fail File Lesson #1: It’s just as important to understand how things fail as it is to understand why they fail, so that you can properly engineer against it. I never thought inadequate swap space would result in a homicidal kernel gunning for the most important service on the box…now I know.

Fresh ZFS on Hyper-V : Nexenta 4.01 CE or I have my evening planned

nexenta4

There’s been some changes in the Daisetta Lab.

Details soon, but sometimes, the fun just can’t wait on my ability to blog it.

Here’s a hint:

  • Nexenta 4.01 Community Edition LIVE
  • 2x Samsung 256GB SSD in RAID 0
  • 3x HGST 2TB in RAID 0
  • Core i5-4670k
  • 16GB RAM on the VM
  • Hyper-V 3.0 & a Legacy virtual NIC because, sadly, Nexenta doesn’t see the Hyper-V Synthetic NIC

The Sammy SSDs put in this outstanding effort last night as I was breaking down into fits of maniacal laughter:

2ssdraid0

1GB/second writes is impressive, but what has ZFS taught us through the course of Labworks 1?

The ARC is still the king.