Storage Glory at 30,000 IOPS

It’s been a bit quiet here on the AC blog because I’m neck deep in thinking about storage at work. Aside from the Child Partition at home (now 14 months and beginning to speak and make his will known, love the little guy), all my bandwidth over the last three weeks has been set to Priority DSCP values and directed at reading, testing, thinking, and worrying about a storage refresh at work.

You could say I’m throwing a one-man Storage Field Day, every day, and every minute, for the last several weeks.

And finally, this week: satisfaction. Testing. Validation. Where the marketing bullshit hits the fan and splashes back onto me, or magically arrays itself into a beautiful Van Gogh on the server room wall.

Yes. I have some arrays in my shop. And some servers. And the pride of Cisco’s 2009 mid-level desktop switching line connecting them all.

Join me as I play.

My employer is a modest-sized company with a hard-on for value, so while we’ve tossed several hundred thousand dollars at incumbent in the last four years (only to be left with a terrifying upgrade to clustered incumbent OS that I’ll have to fit into an 18 hour window), I’m being given a budget of well-equipped mid-level Mercedes sedan to offset some of the risk our stank-ass old DS14MK2 shelves represent.

We’re not replacing our incumbent, we’re simply augmenting it. But with what?

After many months, there are now only two contenders left. And I racked/stacked/cabled them up last week in preparation for a grand bakeoff, a battle royale between Nimble & incumbent.

Meet the Nimble Storage CS260 array. Sixteen total drives, comprised of 12x3TB 7.2K spinners + 4x300GB SSDs, making for around 33TB raw, and depending on your compression rates, 25-50TB usable (crazy I know).

Nimble appeals to me for so many reasons: it’s a relatively simple, compact and extremely fast iSCSI target, just the kind of thing I crave as a virtualization engineer. The 260 sports dual controllers with 6x1GbE interfaces on each, has a simple upgrade path to 10GbE if I ever get that, new controllers and more. On the downside the controllers are Active/Passive, there’s no native support for M/CS (but plenty for MPIO) and well, it doesn’t have a big blocky N logo on the front, which is a barrier for entry because who ever got fired for buying the blue N?

On top of the Nimble is a relatively odd array: a incumbent (incumbent now) incumbent array with 2x800GB SanDisk Enterprise SSDs & 10x1TB 7200 RPM spinning disks. This guy sports dual controllers as well, 2x1GbE iSCSI interfaces & 2x1GbE MGMT interfaces per controller and something like 9TB usable, but each volume you create can have its own RAID policy. Oh, did I mention it’s got an operating system very different from good old Data OnTAP?

So that’s what I got. When you’re a SME with a very limited budget but a very tight and Microsoft-oriented stack, your options are limited.

Anyway onto the fun and glory at 30,000 IOPS

Here’s the bakeoff setup, meant to duplicate as closely as possible my production stack. Yes it’s a pretty pathetic setup, but I’m doing what I can here with what I got:

Nimble v. incumbent Bakeoff

  • 1x Cisco Catalyst 2960s with IOS
  • 1x 2011 Mac Pro tower with 2x Xeon 5740, 16GB RAM, and 2xGbE
  • 1x Dell PowerEdge 1950 with old-ass Xeon (x1), 16GB RAM, and 2xGbE
  • 1x Dell PowerEdge R900, 4x Xeon 5740, 128GB RAM, 4xGbE
  • OS: Server 2012 R2
  • Hyper Visor: Hyper-V 3.0
  • VMs: 2012 R2
  • NICs: I adopted the Converged Fabric architecture that’s worked out really well for us in our datacenter, only instead of clicking through and building out vSwitches in System Center VMM, I did it in Powershell without System Center. So I essentially have this:
    • pServer 1 &2: 1 LACP team (2x1GbE) with converged virtual switch and five virtual NICs (each tagged for appropriate VLANs) on the management OS
    • pServer3: 1 LACP team (4x1GbE) on this R900 box, which is actually a production Hyper-V server at our HQ. So pServer2 is not a member of my Hyper-V Cluster, but just a simple host with a 4gb teamed interface and a f(#$*(@ iSCSI vswitch on top (yes, yes, I know, don’t team iSCSI they say, but haven’t you ever wanted to?)
    • All the virtual switch performance customization you can shake a stick at. Seriously, I need to push some packets. And I want angry packets, jacked up on PCP, ready to fight the cops. I want to break that switch, make smoke come out of it even. The Nimble & incumbent sport better CPUs than any of these physical servers so I looked for every optimization on virtual & physical switches
  • Cisco Switch: Left half is Nimble, right half is incumbent, host teams are divided between the two sides with -hopefully- equal amounts of buffer memory, ASIC processing power etc. All ports trunked save for the iSCSI ports. incumbent is on VLAN 662, Nimble is on VLAN 661. One uplink to my MDF switches.
  • VM Fleet: Seven total (so far) with between 2GB and 12GB RAM, 2-16vCPU and several teams of teams. Most virtual machines have virtual nics attached to both incumbent & Nimble VLANs
  • Volumes: 10 on each array. 2x CSV, 4xSQL, and 4xRDM (raw disk maps, general purpose iSCSI drives intended for virtual machines). All volumes equal in size. The incumbent, as I’m learning, requires a bit more forethought into setting it up, so I’ve dedicated the 2x800GB SSDs as SSD cache across a disk pool, which encompasses every spinner in the array

The tests:

I wish I could post a grand sweeping and well considered benchmark routine ala AnandTech,  but to be honest, this is my first real storage bakeoff in years, and I’m still working on the nice and neat Excel file with results. I can do a follow-up later but so far, here’s the tools I’m using and the concepts I’m trying to test/proof:

  • SQLIO: Intended to mimic as closely as possible our production SQL servers, workloads & volumes
  • IOMETER: same as above plus intended to mimic terminal services login storms
  • Robocopy: Intended to break my switch
  • Several other things I suffer now in my production stack
  • Letting the DBA have his way with a VM and several volumes as well

All these are being performed simultaneously. So one physical host will be robocopying 2 terabytes of ISO files to a virtual machine which is parked inside a CSV in the NImble in the same CSV as another VM which is running a mad SQLIO test on a Nimble RDM. You get the idea. Basically everything you always wanted to but couldn’t do on your production SAN.

So far, from the Nimble, I’m routinely sustaining 20,000 IOPs with four or five tests going on simultaneously (occasionally I toss in an ATTO 2GB random throughput test just for shits, grins, and drama) and sometimes peaking at 30,000 IOPS.

The Nimble isn’t missing a beat:

nimble3

 

What else can we throw at this thing? ATTO, the totally non-predictive, non-enterprise storage benchmarking application!
What else can we throw at this thing? ATTO, the totally non-predictive, non-enterprise storage benchmarking application! I ran this Saturday night in the midst of 2 SQLIO runs, one IOMETER SQL-oriented run, and two robocopies.

So yeah. The Nimble is taking all that my misfit army of Mac hardware, a PowerEdge 1950 that practically begs us to send it to a landfill in rural China every time I power it on, and a heavyweight R900 whose glory days were last decade, and its laughing at me.

Choke on this I say.

Please sir may I have another? the Nimble responds.

So what did we do? What any mid-career, sufficiently caustic and overly-cynical IT Pro would do in this situation: yank drives. Under load. 2xSSD and 1xHDD to be specific.

And then pull the patch cables out of the active controller.

Take that Nimble!  How you like me now and so forth.

Results:

And lo, what does the Nimble do?

Behold the 3U wonder box that you can setup in an afternoon,  sustains 25-30,000 IOPs, draws about 5.2 amps, and yet doesn’t lose a single one of my VMs after my boss violently and hysterically starts pulling shit out of its handsome, SuperMicro-built enclosure.

Sure some of the SQLIO results paused for about 35-40 seconds. And I still prefer M/CS over MPIO. But I can’t argue with the results. I didn’t lose a VM in a Nimble CSV. I dropped only one or two pings during the handover, and IO resumed post-gleeful drive pulls.

nimble

 

nimble2
Storage Glory.

I mean this is crazy right? There’s only 16 drives in there. 12 of which spin. I can feel the skepticism in you right now….there’s no replacement for displacement right? Give me spindle count or give me death. My RAID DP costs me a ton of spindles, but that’s the way God intended it, you’re thinking.

So in the end (incumbent tests forthcoming), what I/we really have to choose is whether to believe the Nimble magic.

I’m sold on it and want that array in my datacenter post-haste. Sure, it’s not a Filer. I’ll never host a native SMB 3.0 share on it. I’ll miss my breakfast confection command line (Nimble CLI feels like Busy Box by the way, but can’t confirm), but I’ll have CASL to play with. I can even divvy out some “aggressive” cache policies to my favorite developer guys and/or my most painful, highest-cost, user workloads.

As far as the business goes? From my seat, it’s the smart bet. The Nimble performs extremely well for our workloads and is cost effective.

For a year now I’ve been reading Nimble praise in my Enterprise IT feed. Neat to see that, for once, reality measured up to the hype.

More on my bakeoff at work and storage evolution at home later this week. Cheers.

 

Editor note: This post has been edited and certain information removed since its original posting on Jan 21.

16 thoughts on “Storage Glory at 30,000 IOPS

  1. What happens when your working set grows larger than the SSDs?

    Also have you considered the poor reliability of a commodity hardware, # of 9s support?

    App Integration?

    Like

    • What happens when your working set grows larger than the SSDs?

      Performance craters? Everything goes to HDD? Is that what you’re getting at?

      Also have you considered the poor reliability of a commodity hardware, # of 9s support?

      Of course. The drives are enterprise. Chassis seems solid enough. The two controllers are built well. Then again, I’ve never thought that “It’s a Supermicro!” as a particularly good insult or slam. I’d rather have the R&D poured into file system, storage, etc than in the 3U chassis, wouldn’t you?

      As far as reliability: I could tell you tales on the NetApp side. Sure who doesn’t love receiving a replacement drive thanks to ASUP before they knew there was a problem. But Infosight is getting some rave reviews and after digging into for a week, I find Nimble asup to be at least equal to NetApp, and InfoSight to be a bit better than My AutoSupport.

      App integration: System Center. It’s lacking, currently. But for the price?

      We’ve replaced most NetApp software products with Microsoft equivalents. Sadly SnapManager for Hyper-V never measured up to the equivalent VMWare products.

      Like

  2. Jeff we just moved from NetApp FAS3040s to a Nimble solution also, a pair of CS260G arrays with 3 shelves. The stuff is amazing. Replaced literally a rack and a half of storage and generates at minimum 3x the performance (not that we need it). We played the pricing game with NetApp for awhile also. They started at nearly double the cost of our Nimble solution. But seemed like they were scared because they dropped the price on 3 occasions all the way down to what they said was a ‘price match’ not even knowing what our solution from Nimble was going to cost. I think the blood is in the water.

    As for the response above I have a feeling that was a NetApp person, because that’s the exact same commentary we got on 2-3 occasions going down this road. Supermicro generic chassis… not enterprise grade hardware… we develop our own firmware for the drives… Nimble doesn’t have 5 9’s certification (by the way there’s a # of 9s ‘certification’?) It really got to the point to where it was just obvious they were on the attack the whole time. Which in the end helped drive us away.

    Good choice, we love the simplicity and we love the capabilities. Not to mention the future outlook for the company considering they went public right after we make our purchase.

    Like

  3. I currently administer a couple of NetApp filers (entry-level FAS2xxx series), one of which is due for support renewal in a couple of months. They are pretty full and although they have performed well, the cost of additional disk trays is eye watering.

    So I’ve read up on Nimble storage and I’m extremely impressed with what it provides and a decent price.

    The plan for us is to buy Nimble instead of adding an additional NetApp shelf. It will be a big performance improvement as well.

    Like

  4. Wow. This really is the ‘battle royale’ to compare performance of a hybrid array with an end of life array with no cache. Now back to looking for bake off between 10GbE and 10 Base-2…

    Like

    • I’ve driven an entry level 4 node Nutanix cluster at 60,000 IOPS and 2GBps, and on the higher models you are looking at 150,000 IOPS and 6GBps, and when you want to expand and add a node it adds a controller so you don’t bottleneck the storage.

      Like

  5. There are also other added benefits of Nimble, built in snapshot backups/replication, instant snapshots, no snapshot reserve requirements, zero copy cloning of LUNs/snapshots, OS upgrades without downtime, etc etc.

    I personally have very little experience with storage vendors, and I was highly skeptical, but we’ve had this system for well over a year without a single drive failure or outage of any sort.

    Don’t get me wrong, it has some challenges but this is truly a phenomenal technology that has to be seen to be believed.

    Like

  6. I though that the active standby controller configuration was better than active / active. In active / standby, I can drive the array to 100 % and not worry about what happens in a failover event. On a NetApp active / active setup, when you’re driving any controller over 50% you run the risk of degredation … Near 80% on each and failover becomes pointless as latency becomes so high you may as well switch the thing off. Where are the NetApp tools to help you understand when you’re at risk? It seems to me that there are none and you have to revert to perfstats and analysis for an opinion on the matter.

    In addition, active / standby means you’re not splitting the available disk between two nodes. You’ve saved yourself two parity disks and possibly another spare. At 1 to 3 TB a pop, that’s a hell of a lot of capacity.

    Like

  7. All I can say is wow. Used a single Nimble to replace three old and new Equallogic units (running 50+ VMs). Performance is stellar, and according to the unit’s stats, it’s barely being taxed.

    Like

    • We have approximately 162 VMs (including multiple SQL and Exchange servers) running on 7 ESXi 5.5 hosts, using a single Nimble 460G and 2 expansion shelves. About the only performance hit we’re taking is our SSD cache which is now close to needing expanded (though we have everything being cached currently and I’m sure we can pull a significant load off, e.g. SQL/Exchange logs).

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s