It’s been a bit quiet here on the AC blog because I’m neck deep in thinking about storage at work. Aside from the Child Partition at home (now 14 months and beginning to speak and make his will known, love the little guy), all my bandwidth over the last three weeks has been set to Priority DSCP values and directed at reading, testing, thinking, and worrying about a storage refresh at work.
You could say I’m throwing a one-man Storage Field Day, every day, and every minute, for the last several weeks.
And finally, this week: satisfaction. Testing. Validation. Where the marketing bullshit hits the fan and splashes back onto me, or magically arrays itself into a beautiful Van Gogh on the server room wall.
Yes. I have some arrays in my shop. And some servers. And the pride of Cisco’s 2009 mid-level desktop switching line connecting them all.
Join me as I play.
My employer is a modest-sized company with a hard-on for value, so while we’ve tossed several hundred thousand dollars at incumbent in the last four years (only to be left with a terrifying upgrade to clustered incumbent OS that I’ll have to fit into an 18 hour window), I’m being given a budget of well-equipped mid-level Mercedes sedan to offset some of the risk our stank-ass old DS14MK2 shelves represent.
We’re not replacing our incumbent, we’re simply augmenting it. But with what?
After many months, there are now only two contenders left. And I racked/stacked/cabled them up last week in preparation for a grand bakeoff, a battle royale between Nimble & incumbent.
Meet the Nimble Storage CS260 array. Sixteen total drives, comprised of 12x3TB 7.2K spinners + 4x300GB SSDs, making for around 33TB raw, and depending on your compression rates, 25-50TB usable (crazy I know).
Nimble appeals to me for so many reasons: it’s a relatively simple, compact and extremely fast iSCSI target, just the kind of thing I crave as a virtualization engineer. The 260 sports dual controllers with 6x1GbE interfaces on each, has a simple upgrade path to 10GbE if I ever get that, new controllers and more. On the downside the controllers are Active/Passive, there’s no native support for M/CS (but plenty for MPIO) and well, it doesn’t have a big blocky N logo on the front, which is a barrier for entry because who ever got fired for buying the blue N?
On top of the Nimble is a relatively odd array: a incumbent (incumbent now) incumbent array with 2x800GB SanDisk Enterprise SSDs & 10x1TB 7200 RPM spinning disks. This guy sports dual controllers as well, 2x1GbE iSCSI interfaces & 2x1GbE MGMT interfaces per controller and something like 9TB usable, but each volume you create can have its own RAID policy. Oh, did I mention it’s got an operating system very different from good old Data OnTAP?
So that’s what I got. When you’re a SME with a very limited budget but a very tight and Microsoft-oriented stack, your options are limited.
Anyway onto the fun and glory at 30,000 IOPS
Here’s the bakeoff setup, meant to duplicate as closely as possible my production stack. Yes it’s a pretty pathetic setup, but I’m doing what I can here with what I got:
Nimble v. incumbent Bakeoff
- 1x Cisco Catalyst 2960s with IOS
- 1x 2011 Mac Pro tower with 2x Xeon 5740, 16GB RAM, and 2xGbE
- 1x Dell PowerEdge 1950 with old-ass Xeon (x1), 16GB RAM, and 2xGbE
- 1x Dell PowerEdge R900, 4x Xeon 5740, 128GB RAM, 4xGbE
- OS: Server 2012 R2
- Hyper Visor: Hyper-V 3.0
- VMs: 2012 R2
- NICs: I adopted the Converged Fabric architecture that’s worked out really well for us in our datacenter, only instead of clicking through and building out vSwitches in System Center VMM, I did it in Powershell without System Center. So I essentially have this:
- pServer 1 &2: 1 LACP team (2x1GbE) with converged virtual switch and five virtual NICs (each tagged for appropriate VLANs) on the management OS
- pServer3: 1 LACP team (4x1GbE) on this R900 box, which is actually a production Hyper-V server at our HQ. So pServer2 is not a member of my Hyper-V Cluster, but just a simple host with a 4gb teamed interface and a f(#$*(@ iSCSI vswitch on top (yes, yes, I know, don’t team iSCSI they say, but haven’t you ever wanted to?)
- All the virtual switch performance customization you can shake a stick at. Seriously, I need to push some packets. And I want angry packets, jacked up on PCP, ready to fight the cops. I want to break that switch, make smoke come out of it even. The Nimble & incumbent sport better CPUs than any of these physical servers so I looked for every optimization on virtual & physical switches
- Cisco Switch: Left half is Nimble, right half is incumbent, host teams are divided between the two sides with -hopefully- equal amounts of buffer memory, ASIC processing power etc. All ports trunked save for the iSCSI ports. incumbent is on VLAN 662, Nimble is on VLAN 661. One uplink to my MDF switches.
- VM Fleet: Seven total (so far) with between 2GB and 12GB RAM, 2-16vCPU and several teams of teams. Most virtual machines have virtual nics attached to both incumbent & Nimble VLANs
- Volumes: 10 on each array. 2x CSV, 4xSQL, and 4xRDM (raw disk maps, general purpose iSCSI drives intended for virtual machines). All volumes equal in size. The incumbent, as I’m learning, requires a bit more forethought into setting it up, so I’ve dedicated the 2x800GB SSDs as SSD cache across a disk pool, which encompasses every spinner in the array
I wish I could post a grand sweeping and well considered benchmark routine ala AnandTech, but to be honest, this is my first real storage bakeoff in years, and I’m still working on the nice and neat Excel file with results. I can do a follow-up later but so far, here’s the tools I’m using and the concepts I’m trying to test/proof:
- SQLIO: Intended to mimic as closely as possible our production SQL servers, workloads & volumes
- IOMETER: same as above plus intended to mimic terminal services login storms
- Robocopy: Intended to break my switch
- Several other things I suffer now in my production stack
- Letting the DBA have his way with a VM and several volumes as well
All these are being performed simultaneously. So one physical host will be robocopying 2 terabytes of ISO files to a virtual machine which is parked inside a CSV in the NImble in the same CSV as another VM which is running a mad SQLIO test on a Nimble RDM. You get the idea. Basically everything you always wanted to but couldn’t do on your production SAN.
So far, from the Nimble, I’m routinely sustaining 20,000 IOPs with four or five tests going on simultaneously (occasionally I toss in an ATTO 2GB random throughput test just for shits, grins, and drama) and sometimes peaking at 30,000 IOPS.
The Nimble isn’t missing a beat:
What else can we throw at this thing? ATTO, the totally non-predictive, non-enterprise storage benchmarking application! I ran this Saturday night in the midst of 2 SQLIO runs, one IOMETER SQL-oriented run, and two robocopies.
So yeah. The Nimble is taking all that my misfit army of Mac hardware, a PowerEdge 1950 that practically begs us to send it to a landfill in rural China every time I power it on, and a heavyweight R900 whose glory days were last decade, and its laughing at me.
Choke on this I say.
Please sir may I have another? the Nimble responds.
So what did we do? What any mid-career, sufficiently caustic and overly-cynical IT Pro would do in this situation: yank drives. Under load. 2xSSD and 1xHDD to be specific.
And then pull the patch cables out of the active controller.
Take that Nimble! How you like me now and so forth.
And lo, what does the Nimble do?
Behold the 3U wonder box that you can setup in an afternoon, sustains 25-30,000 IOPs, draws about 5.2 amps, and yet doesn’t lose a single one of my VMs after my boss violently and hysterically starts pulling shit out of its handsome, SuperMicro-built enclosure.
Sure some of the SQLIO results paused for about 35-40 seconds. And I still prefer M/CS over MPIO. But I can’t argue with the results. I didn’t lose a VM in a Nimble CSV. I dropped only one or two pings during the handover, and IO resumed post-gleeful drive pulls.
I mean this is crazy right? There’s only 16 drives in there. 12 of which spin. I can feel the skepticism in you right now….there’s no replacement for displacement right? Give me spindle count or give me death. My RAID DP costs me a ton of spindles, but that’s the way God intended it, you’re thinking.
So in the end (incumbent tests forthcoming), what I/we really have to choose is whether to believe the Nimble magic.
I’m sold on it and want that array in my datacenter post-haste. Sure, it’s not a Filer. I’ll never host a native SMB 3.0 share on it. I’ll miss my breakfast confection command line (Nimble CLI feels like Busy Box by the way, but can’t confirm), but I’ll have CASL to play with. I can even divvy out some “aggressive” cache policies to my favorite developer guys and/or my most painful, highest-cost, user workloads.
As far as the business goes? From my seat, it’s the smart bet. The Nimble performs extremely well for our workloads and is cost effective.
For a year now I’ve been reading Nimble praise in my Enterprise IT feed. Neat to see that, for once, reality measured up to the hype.
More on my bakeoff at work and storage evolution at home later this week. Cheers.
Editor note: This post has been edited and certain information removed since its original posting on Jan 21.