As you’ll recall from part 1, much of my time at work lately has been consumed by planning, testing and executing mass Storage Live Migration of 65+ .vhdx files from our old filer (built by a company that rhymes with PetTap) & its end-of-life 7200 RPM shelves to our new hotness, a Nimble CS260.
Essentially I’ve been a sort of IT Moses, planning the Exodus of my .vhdxs out of harsh, intolerably slow conditions, through some hazards & risks (Storage Migration during production), and into the promised land of speed, 1-5ms latency (rather than 20ms+!!), and user happiness.
Now VMware guys have a ton of really awesome tools to vMotion their .vmdks around their vCenters & vSpheres and their ESXis and now their VSANS too. They can tune their NFS for ludicrous speed and their Distributed Switches, now with Extra Power LACP support, can speak CDP & even tell you what physical port the SFP+ 10GbE adapter is plugged into.
Do I sound envious? Green with it even? Well I am.
In comparison, I got me some System Center Virtual Machine Manager (2012 SP1), Microsoft Failover Cluster mmc snap-in, a 6509e with two awesome x6748 performance blades (February 2008’s centerpiece switch mod in the popular “Hot Switch Racks” pin-up calendar), Hyper-Vs converged fabric design, 8x1GbE LACP teams, Manage Engine’s NetFlow, boat-loads of enthusiasm and a git ‘r done attitude.
And this is what I’ve got to git done:
And it has to be done with zero downtime because we have a 24/6 operational tempo, and I like my Saturdays.
One of my main worries as I’ve tried to quarterback this transition has been the switch. Recall from part 1 how I’m oversubscribed to hell & back on my two 6748s:
I fear the harsh judgment of my networking peers (You’re doing that with that?!?!) so let me just get it out there: yes, I’m essentially using my 6509 & these two blades as a bus for storage IO. In fact, iSCSI traffic accounts for about 90% of all traffic on the switch in any given 24 hour period:
Perhaps I’m doing things with this switch & with iSCSI that no sane engineer would, but I have to say, this has proven to be pretty durable and adequate as far as performance goes. Would I like some refreshing multi-channel SMB 3 file storage, some relief from the block IO blues & Microsoft clustering headaches? Yes of course, but I’ve got to shepherd the VMs to their new home first.
And to do that, I’ve got to master what this switch is doing on an hour by hour basis as my users log in around the clock.
So I pulled some Netflow data together, cracked my knuckles, and got busy with Excel and Pivot tables.
I’m glad I went through this exercise because it changed the .vhdx parade route & timing. What I thought was the busiest part of my infrastructure’s day was wrong, by a large factor. Here’s 8 days worth of Netflow traffic on the iSCSI & Live Migration VLANs, averaged out. Few live migrations were made during this period:
What you see here are the three login storms (Times on the graph are MST, they start early down under) as my EU, North America & Australia/New Zealand users login to their session virtualization VMs or hit the production SQL databases or run their reports.
I always thought EU punched my stack the hardest; our offices there have as many employees as North America, but only one or two time zones rather than three in North America.
But EU & North America and Australia combined don’t hit my switch fabric as hard as I do. Yes, the monkey on my back is…me. Well, me & the DBA and his incurable devotion to SQL backups in the evening. My crutch is DPM.
I won’t go into too much detail here but this was pretty surprising. At times over the eight days, Netflow would record more than 1 billion packets traversing the switch in one evening hour; the peak “payload” was north of 1 terabyte of iSCSI bytes/hour on some days!
Now I’m not a networking guy (though I do love Wifi experimenting), but what I saw here concerned me, gave me pause. Between the switch blades, I’ve supposedly got a 40 Gigabit/s backplane, to my Supervisor 720 modules, but is that real 40Gbit/s or marketing 40Gbit/s?
The key question: Am I stressing this 6509e, or does it have more to give?
Show fabric utilization detail said I was consuming only 20% of the switch fabric during my exploratory storage migrations, and that was at peak. 4 gigabit/second per port group.
Is that all you got? the 6509e seemingly taunted me.
But oh my stars, better check the buffers:
ACK! One dropped buffer call or whatever you call it, 7 weeks ago, way before I had the Nimble in place. Still….that’s one drop too many for me.
Stop pushing me so hard, the 6509e pleaded with me.
So I did what any self-respecting & fearful admin would do: call TAC. Show me the way home TAC, get me out of the fix I’m in or at least sooth my worry. Tell me I have nothing to worry about, or tell me I need to buy a Supe 2T to do what I want to do, just give me some certainty!
A few show tech supports, one webex session and one telephonic call with a nice engineer from Costa Rica later, I had some certainty. The config was sound, the single buffer drop was concerning but wasn’t repeating, even when I thought I was stressing the switch.
And I didn’t need to buy a Supe 2T.
On to the Exodus/.vhdx parade.
In all my fretting about the switch, I was forgetting one thing: the feeble filer is old and slow and can’t push that much IO to the Nimble in the first place.
As best I can figure it, I can do about five storage live migrations simultaneously. Beyond that, I fear that luns will drop on the filer.
To the Nimble, it’s no sweat:
Netflow’s view from the same period:
Love it when a plan comes together! I should have this complete within a few days, then I can safely upgrade my filer.