Defending IT amidst the novel WannaCry worm

It’s been a hell of a few days here in the trenches of Information Technology in 2017. Where to begin?

Between explaining how this all works to concerned friends & family, answering my employer’s questions about our patching posture & status, and reading the news & analysis, I think it’s safe to say that WCry has been in my thoughts for every one of the last 72 hours, including the 24 hours of Mother’s Day and all the hours I spent in restless slumber.

Yes, that’s right. WCry was on my mind even as I celebrated Mother’s day for the three women I’m close to in my life who are mothers. Wow. Just wow.

Having had the chance to catch my breath, I’ve got some informed observations about this global incident from my perspective as an IT Pro. Why is WCry as interesting & novel as it is potent and effective in 2017? And is there any defense of an IT team one might make if their organization got pwned by WCry?

I contemplate both questions below.

WCry successfully chains a social engineering attack with a technical exploit resulting in automated organization pwnage
WCry begins as a social engineering/phishing attack on users in the place they love and hate by equal measure: their Inbox. Using Subject lines that draw the eye, the messages include malicious attachments. This facet of WCry is not new of course…..it’s routine and has been in IT for at least two decades.

How WannaCry works

Once the attachment is clicked, WCry pivots, unleashing an NSA-built cyberweapon upon the enterprise by scanning port 445 across the local /24, cycling through cached RDP accounts and calling special attention to SQL & Exchange services, presumably to price the ransom accordingly.

Then it encrypts. Nearly everything.

All of this from a single email opened by a gullible user.

This behavior -socially engineered attack on human meatbag + scan + pivot to the rest of the network- is also not novel, new or remarkable.  In fact, security Pros call this behavior “moving laterally” through an enterprise and they usually talk about it being done from “jump box” or “beach head” that’s been compromised via social engineering. Typically, security pros will reserve those terms to describe the behavior of a skilled & hostile hacker meatbag intent on pwning a targeted organization.

Where WCry is novel is that it in effect automates the hacker out of the picture, making the whole org pwnage process way more efficient. This is Organization-crippling, self-replicating malware at scale. Think Sony Pictures 2014, applied everywhere automatically minus the North Korean hacker units at the keyboard.

 

The red Wcry “Ooops” message is both informative and visually impressive, which multiplies its influence beyond its victims
As these things go, I couldn’t help but be impressed with Wcry’s incredibly detailed and anxiety-inducing UI announcing a host’s Wcry infection:

This image, or some variant thereof, has appeared on everything from train station arrival/departure boards to manufacturing floor PCs to hospital MRIs to good old-fashioned desktop PCs in Russia’s Interior Ministry. The psychological effects of seeing this image on infected hardware, then seeing it again on popular social media sites, the evening news, and newspapers around the world over the last few days are hard to determine, but I know this: this had an effect on normal consumers and users of technology across the globe. Sitting on my lap Saturday, my four year old saw the image in my personal OneNote pastebin and asked me, “Daddy, is that an alarm? Why does it show a lock? Do you have key?”

What’s interesting is that while computer users saw this or a screensaver version of this image, in reality you could click past it or minimize it in some way. Yet images of this application have proliferated on Twitter, FaceTube and elsewhere. Ransomware used to just announce itself in the root of your file share or your c:\user\username\documents folder: now it poses for screen caps and cell phone pics which multiplies its effectiveness as a PsyOps weapon. By Saturday I was reading multiple articles in my iPad’s Apple News about how regular people could protect themselves from the ‘global cyberattack.’

Its function is not just about encrypting file shares like earlier ransomware campaigns, but about owning Enterprises
If my organization or any organization I was advising got hit by WCry, my gut feeling is that I wouldn’t feel secure about my Forest/Domain integrity until I burned it down and started over. Why? Well, big IT security organizations like Verizon’s Enterprise Security group typically don’t classify ransomware as a ‘data breach’ event. Yet, as we know, Wcry installs a Pulsar backdoor that enables persistent access in the future. This feels like a very effective escalation of what it means to be ransomed in modern IT organizations, so yeah, I wouldn’t feel secure until our forest/domain was burned to the ground.

It is the manifestation of a Snoverism : Today’s nation-state cyberweapon is tomorrow’s script-kiddie attack
I was listening to the father of Powershell, Jeff Snover once and he implanted yet another Snoverism in my brain.  He said, paraphrasing here, that Today’s nation-state attack is tomorrow’s script-kiddie attack. What the what?

Jeff Snover, speaker of wisdom

Let’s unpack: the democratization of technology, the shift to agile, DevOps, and other development disciplines along with infrastructure automation has lead to a lot of great things being developed, released and consumed by users very quickly. In the consumer world this has been great -Alexa is always improving with new skills…Apple can release security patches rapidly, and FaceTube can instantly perform A/B testing on billions of people simultaneously. But not well understood by many is the fact that Enterprises and even individuals can harness these tools and techniques to instantly build and operate data systems globally, to get their product, whatever it may be, to market faster. The classic example of this is Shadow IT, wherein someone in your finance team purchases a few seats on Salesforce to get around the slow & plodding IT team.

I think Snover was observing that bad guys get the same benefits from modern technology techniques & the cloud as consumers and business users do.

And as I write this on Monday, what are we seeing? WCry is posted on GitHub and new variants are being created without the kill-switch/sandbox detection domain. Eternal Blue, the component of Wcry that exploits SMB1, was literally just a few months ago a specialized tool in the NSA’s cyber weapons arsenal. By tomorrow it will be available to any kid who wants it, or, even worse, as a push-button turn-key service anybody can employ against anybody else.

The democratization of technology means that no elite or special knowledge, techniques or tools are required to harness technology to some end. All you need is motive and motivation to do things at scale. This week, we learned that the democratization of technology is a huge double-edged sword.

It was blunted by a clever researcher for about $11
Again on the democratization of technology front, I find it fascinating that MalwareTech was able to blunt this attack by spending $11 of his own money to purchase the domain he found encoded in the output of his decompile. He’s the best example of what a can-do technologist can do, given the right amount of tools and freedom to pursue his craft.

It has laid bare the heavy costs of technical debt for which there is no obvious solution
Technical debt is a term used in software engineering circles and computer science curricula, but I also think it can and should apply to infrastructure thinking. What’s technical debt? Take it away Wikipedia:

Technical Debt is a metaphor referring to the eventual consequences of poor system design, software architecture, or software development within a codebase. The debt can be thought of as work that needs to be done before a particular job can be considered proper or complete. If the debt is not repaid, then it will keep on accumulating interest, making it hard to implement changes later on.

I can’t tell you how many times and at how many organizations I’ve seen this play out. Technical Debt, from an IT Pro’s perspective, can be the refusal to correct a misconfiguration of an important device upon which many services are dependent, or it can be a poorly-designed security regime that takes bad practice and cements it into formal process & habit, or it can be a refusal to give IT the necessary political cover & power to change bad practices or bad design into something durable and agile, or it can be refusing to patch your systems out of fear or a desire to kick the can down the road a bit.
Over time, efforts will be made to pay that technical debt down, but unless a conscious effort is made consistently to keep it low, technical debt eventually -inevitably- becomes just as crippling to an organization as credit card debt becomes to a consumer. Changes to IT systems that in other organizations are routine & easy become hard and difficult; and hard changes in other companies are close to impossible in yours.

This is a really bad place to be for an IT Pro, and now WCry made it even worse by exploiting organizations that have high technical debt, particularly as it relates to patching. Indeed, it’s almost as if the author of this malware understood at a basic fundamental level how much technical debt organizations in the real world carry.

There is no obvious solution to this. We can’t force people to use technology a certain way, or even to think of technology in a certain way. The point of going into business is to make money, not to build durable & secure and flexible technology systems, unless that is your business. Cloud services are the obvious answer, but they can’t do things like run MRI machines or interface with robots on the Nissan assembly line. At least not yet. And nobody wants regulation, but that’s a topic for another post.

It has shown how hard it is to maintain & patch systems that are in-use for more than a typical workday
If we ignore the way WCry rampaged through Russia, China and other places where properly licensing your software is considered optional, something else interesting emerges: the organizations that were hardest hit by Wcry were ones in which technology is likely in use beyond the standard 8 hour workday, which likely makes patching those technology systems all the more difficult.

While reporting on the NHS fiasco has zoomed in on the fact that the UK’s healthcare system had Windows XP widely deployed, I don’t think that tells the whole story, even if it’s true that 100% of NHS systems ran XP, it still doesn’t tell the whole story.  I can easily see how patching in such environments could be difficult based on how much those systems are used.  Hospitals and even out-patient facilities typically operate more than 8 hours a day; finding a slot of time in a given 24 hour period in which you can with the consent of the hospital, offline healthcare devices like MRI machines to update & reboot them is probably more difficult than it is in a company where systems are only required to be up between 7am and 6pm, for instance.

On and on down the list of Wcrypt’s corporate vicitms this pattern continues:

  • Nissan: factory controlled machines were infected with WCry. How easy is it to patch these systems amid what is surely a fast-paced, multi-shift, high-volume operating tempo?
  • German Train system: Literally computers that make the trains run on time have been hit by WCry. Trains and planes operate more than 8 hours a day, making them difficult to patch
  • Telefonica & Portugal Telecom: another infrastructure company that operates beyond a standard 8 hour day that got hit by WCry

I know banks & universities were hit as well, but they’re the exception that points at the rule emerging: Security is hard enough in an 8 hour a day organization. But it’s extra, extra hard when half of a 24 hour day, or even 2/3rds of a 24 hour day is off-limits for patching. Without well-understood processes, buy-in and support from management, discipline and focus on the part of a talented IT team,  such high tempo operating environments will inevitably fall behind the security curve and be preyed upon by WCry and its successors.

It has demonstrated dramatically the perpetual tension between uptime, security and the incentives thereof for IT
This is similar to the patching-is-hard-in-high-tempo organizations claim, but focuses on IT incentives. For the first 2o or 30 years of Information Technology, our collective goal and mission in life was to create, build and maintain business systems that have as much uptime as possible. We call this ‘9s’ as in, “how many ya got?!?”, and it’s about the only useful objective measure by which management continues to sign our check.

Here, I’ll show you how it works:

IT Pro # 1: I got five 9s of uptime this month, that’s less than 26 seconds of unplanned downtime!

IT Pro #2: Still doesn’t touch my record in March of 2015, where I had six 9s (2.59 seconds of downtime) for this service!

Uptime is our raison d’etre, the thing we get paid to deliver the most. We do not get paid, in general, to practice our craft the right way, or the best practice way, per se. We certainly do not get paid to guard against science-fiction tales of security threats involving cyber-weapon worms that encrypt all our data.

We are paid to keep things up and running because, at the end of the day, we’re a cost center in the business. It takes a rare and unique and charismatic manager with support from the business to change that mindset, to get an organization beyond a place where it merely views IT as a cost-center and a place to call when things that are supposed to be up are down.

And that’s part of the reason why Wcry was so effective around the globe.

It has spawned a bunch of ignorant commentary from non-technical people who are outraged at Microsoft

Zeynep Tufecki, an outstanding scholar of good reputation studying the impact of technology on society wrote a piece in the NYT this weekend that had my blood boiling. Effectively, she blames Microsoft and incompetent IT teams for this mess:

First, companies like Microsoft should discard the idea that they can abandon people using older software. The money they made from these customers hasn’t expired; neither has their responsibility to fix defects. Besides, Microsoft is sitting on a cash hoard estimated at more than $100 billion (the result of how little tax modern corporations pay and how profitable it is to sell a dominant operating system under monopolistic dynamics with no liability for defects).

This is absurd on its face. She’s essentially arguing that software manufacturers extend warranties on software forever. She continues:

For example, Chromebooks and Apple’s iOS are structurally much more secure because they were designed from the ground up with security in mind, unlike Microsoft’s operating systems.

Tufecki, whom I really like and enjoy reading, is trolling us. 93% of Google’s handsets don’t run the latest Google OS, which means many people -close to a billion by my count- are, through now fault of their own, carrying around devices that aren’t up to date. Should they be supported forever too? And Apple’s iPhone, as much as I love it, can’t run an Assembly line that manufacturers cars nevermind coordinate an MRI machine.

Rubbish. Disappointed she wrote this.

For all the reasons above, Wcry is not the fault of Microsoft any more than it’s the fault of the element Copper. If anything, the fault for this lies in the way we think about and use technology as businesses and as individuals. Certainly, IT shares some of the blame in these organizations, but there are mitigating factors as I spoke about above.

Mostly, I lay the blame at the NSA for losing these damned things in the first place. If they can’t keep things secure, what hope do most IT shops have?

It has inspired at least one headline writer to say your data is safer with FaceTube than with your hospital
Again, more rubbish and uninformed nonsense from the normals. Sure, my data might be safer from third party hackers if I were to house it inside FaceTube, but then again, adtech companies might just buy that same dataset, anonymized, connect dots from that set to my online behavior dataset, and figure out who I really am. That’s FaceTube’s business, after all!

Find Office problems before they find you with Telemetry server

I’ve not always had a bromance with Microsoft’s Office suite. I cut my word processing teeth on WordPerfect 5.1, did most of my undergrad papers in BeOS’ one productivity suite ((GoBe Productive, still the best Office suite name)) , and touch-typed my way to graduating cum laude in grad school with countless Turabian-style Google Docs papers.

Office?

That was for corporate suits, man. Rich corporate suits.

But all that’s ancient history. Or maybe I’ve become a suit. Either way, I’m loving Office today.

In 2015, Office has transformed into the ultimate agnostic git ‘r done productivity package. It’s free to use in many cases, but if you want to ‘own’ it, you can subscribe to it, just like HBO ((For the IT Pro, this is a huge advantage, as a cheap E-class sub gives you access to your own Exchange instance, your own Sharepoint server, and your own Office tenant. It’s awesome!)) . It’s also available on just about any device or computing system you can think of, works just as well inside a browser as Google Docs does, and has an enormous install base.

telemetry
From the Office Telemetry PDF guide, linked below

Office has become so impressive and so ubiquitous that it’s truly a platform unto itself, consumed a la carte or as part of a well-balanced Microsoft meal. I’m bullish on Windows but if Office’s former partner ever sunsets, I’m convinced my kid and his kid will still grow up in an Office world.

All of that makes Office really important for IT, so important that you as an IT Guy should consider standing-up some easy instrumentation around it.

Enter Office Telemetry, a super-simple package that flows your Office data to a SQL collector, mashes it up, and gives you important insight into how your users are using Office. It also surfaces the problems in Office -or Office documents- before your users do, and it’s free.

Oh, did I mention it’s called Office Telemetry? This thing makes you feel like an astronaut when you’re using it!

Here’s how you deploy it. Total time: about an hour.

  1. Download the Office 2013 ADMX/ADML files for Group Policy and deploy them to your Domain Controllers.
  2. Spin-up a 2008 R2 or 2012 VM, or find a modestly-equipped physical box that at least has Windows Management Framework 3.0/Powershell 3.0 on it. If it has a SQL 2012 instance on it that you can use, even better. If not, don’t stress and proceed to the next step.
  3. Set-aside a folder on a separate volume (ideally) for the telemetry data. If you’re going to flow data from hundreds of Office users, plan for a minimum of 5-25 megabytes per user, at a minimum.
    • If your users are on the WAN, plan accordingly. Telemetry data is pretty lightweight (50k chunks for older Office clients, 64k chunks for Office 2013)
  4. gptelemetryInstall Office ProPlus 2013 or 365 on the VM. You do not need to use an Office 365 license for it to run.
  5. Download the Deploy Office Telemetry powershell script package from TechNet or via Script Browser in Powershell ISE.
  6. Because it’s a script, you’ll need to temporarily change your server’s execution policy, self-sign it, or configure Group Policy as appropriate to run it. TechNet has instructions.
  7. Run the script; it will download SQL 2012 express and install it for you if you don’t have SQL. It will also set proper SMB read/modify permissions on that folder you set up earlier.
  8. As if that wasn’t enough, the script will give you a single registry keyfile you can use to deploy to your user’s machines.
  9. But I prefer the Group Policy/SCCM route. Remember the ADMX files you deployed? Flip the switches as appropriate under User Configuration>Administrative Templates>Microsoft Office 2013> Telemetry Dashboard.
  10. Sit back, and watch the data flow in, and pat yourself on the back because you’re being a proactive IT Pro!

As I’ve deployed this solution, I’ve found broken documents, expensive add-ons that delay Office, and multiple other issues that were easy to resolve but difficult to surface. It’s totally worth your time to install it.

Office Telemetry PDF

It’s been awhile since I posted about my home lab, Daisettalabs.net, but rest assured, though I’ve been largely radio silent on it, I’ve been busy.

If 2013 saw the birth of Daisetta Labs.net, 2014 was akin to the terrible twos, with some joy & victories mixed together with teething pains and bruising.

So what’s 2015 shaping up to be?

Well, if I had to characterize it, I’d say it’s #LabGlory, through and through. Honestly. Why?

I’ve assembled a home lab that’s capable of simulating just about anything I run into in the ‘wild’ as a professional. And that’s always been the goal with my lab: practicing technology at home so that I can excel at work.

Let’s have a look at the state of the lab, shall we?

Hardware & Software

Daisetta Labs.net 2015 is comprised of the following:

  • Five (5) physical servers
  • 136 GB RAM
  • Sixteen (16) non-HT Cores
  • One (1) wireless access point
  • One (1) zone-based Firewall
  • Two (2) multilayer gigabit switches
  • One (1) Cable modem in bridge mode
  • Two (2) Public IPs (DHCP)
  • One (1) Silicon Dust HD
  • Ten (10) VLANs
  • Thirteen (13) VMs
  • Five (5) Port-Channels
  • One (1) Windows Media Center PC

That’s quite a bit of kit, as a former British colleague used to say. What’s it all do? Let’s dive in:

Physical Layout

The bulk of my lab gear is in my garage on a wooden workbench.

Nodes 2-4, the core switch, my Zywall edge device, modem, TV tuner, Silicon Dust device and Ooma phone all reside in a secured 12U, two post rack I picked up on ebay about two years ago for $40. One other server, core.daisettalabs.net, sits inside a mid-tower case stuffed with nine 2TB Hitachi HDDs and five 256GB SSDs below the rack.

Placing my lab in the garage has a few benefits, chief among them: I don’t hear (as many) complaints from the family cluster about noise. Also, because it’s largely in the garage, it’s isolated & out of reach of the Child Partition’s curious fingers, which, as every parent knows, are attracted to buttons of all types.

Power & Thermal

Of course you can’t build a lab at home without reliable power, so I’ve got one rack-mounted APC UPS, and one consumer-grade Cyberpower UPS for core.daisettalabs.net and all the internet gear.

On average, the lab gear in the garage consumes about 346 watts, or about 3 amps. That’s significant, no doubt, costing me about $38/month to power, or about 2/3rds the cost of a subscription to IT Pro TV or Pluralsight. 🙂

Thermals are a big challenge. My house was built in 1967, has decent insulation and holds temperature fairly well in the habitable parts of the space. But none of that is true about the garage, where my USB lab thermometer has recorded temps as low as 3C last winter and as high as 39c in Summer 2014. That’s air-temperature at the top of the rack, mind you, not at the CPU.

One of my goals for this year is to automate the shutdown/powerup of all node servers in the Garage based on the temperature reading of the USB thermometer. The $25 thermometer is something I picked up on Amazon awhile ago; it outputs to .csv but I haven’t figured out how to automate its software interface with powershell….yet.

Anyway, here’s my stack, all stickered up and ready for review:

IMG_20150329_214535914

Beyond the garage, the Daisetta Lab extends to my home’s main hallway, the living room, and of course, my home office.

Here’s the layout:

homelab2015

Compute

On the compute side of things, it’s almost all Haswell with the exception of core and node3:

[table]

Server, Architecture, CPU, Cores, RAM, Function, OS, Motherboard

Core, AMD A-series, A8-5500, 2, 8GB, Tiered Storage Spaces & DC/DHCP/DNS, Server 2012 R2, Gigabyte D4

Node1, Haswell, i7-4770k, 4, 32GB, Main PC/Office/VM host/storage, 2012R2, Supermicro X10SAT

Node2, Haswell, Xeon E3-1241, 4, 32GB, Cluster node, 2012r2 core, Supermicro X10SAF

Node3, Ivy Bridge, i7-2600, 4, 32GB, Cluster node, 2012r2 core, Biostar

Node4, Haswell, i5-4670, 4, 32GB, Cluster node/storage, 2012r2 core, Asus

[/table]

I love Haswell for its speed, thermal properties and affordability, but damn! That’s a lot of boxes, isn’t it? Unfortunately, you just can’t get very VM dense when 32GB is the max amount of RAM Haswell E3/i7 chipsets support. I love dynamic RAM on a VM as much as the next guy, but even with Windows core, it’s been hard to squeeze more than 8-10 VMs on a single host. With Hyper-V Containers coming, who knows, maybe that will change?

Node1, the pride of the fleet and my main productivity machine, boasting 2x850 Pro SSDs in RAID 0, an AMD FirePro, and Tiered Storage Spaces
Node1, the pride of the fleet and my main productivity machine, boasting 2×850 Pro SSDs in RAID 0, an AMD FirePro, and Tiered Storage Spaces

While I included it in the diagram, TVPC3 is not really a lab machine. It’s a cheap Ivy Bridge Pentium with 8GB of RAM and 3TB of local storage. It’s sole function in life is to decrypt the HD stream it receives from the Silicon Dust tuner and display HGTV for my mother-in-law with as little friction as possible. Running Windows 8.1 with Media Center, it’s the only PC in the house without battery backup.

Physical Network
About 18 months ago, I poured gallons of sweat equity into cabling my house. I ran at least a dozen CAT-5e cables from the garage to my home office, bedrooms, living room and to some external parts of the house for video surveillance.
I don’t regret it in the least; nothing like having a reliable, physical backbone to connect up your home network/lab environment!

Meet my underlay
Meet my underlay

At the core of the physical network lies my venerable Cisco 2960S-48TS-L switch. Switch1 may be a humble access-layer switch, but in my lab, the 2960S bundles 17 ports into five port channels, serves as my DG, routes with some rudimentary Layer 3 functions ((Up to 16 static routes, no dynamic route features are available)) and segments 9 VLANs and one port-security VLAN, a feature that’s akin to PVLAN.

Switch2 is a 10 port Cisco Small Business SG-300 running at Layer 3 and connected to Switch1 via a 2-port port-channel. I use a few ports on switch2 for the TV and an IP cam.

On the edge is redzed.daisettalabs.net, the Zyxel USG-50, which I wrote about last month.

Connecting this kit up to the internet is my Motorola Surfboard router/modem/switch/AP, which I run in bridge mode. The great thing about this device and my cable service is that for some reason, up to two LAN ports can be active at any given time. This means that CableCo gives me two public, DHCP addresses, simultaneously. One of these goes into a WAN port on the Zyxel, and the other goes into a downed switchport

Love Meraki's RF Spectrum chart!
Love Meraki’s RF Spectrum chart!

Lastly, there’s my Meraki MR-16, an access point a friend and Ubiquity networks fan gave me. Though it’s a bit underpowered for my tastes, I love this device. The MR-16 is trunked to switch1 and connects via an 802.3af power injector. I announce two SSIDs off the Meraki, both secured with WPA2 Personal ((WPA2 Enterprise is on the agenda this year)). Depending on which SSID you connect to, you’ll end up on the Device or VM VLANs.

Virtual Network

The virtual network was built entirely in System Center VMM 2012 R2. Nothing too fancy here, with multiple Gigabit adapters per physical host, one converged logical vSwitch and a separate NIC on each host fronting for the DMZ network:

Nodes 1, 2 & 4 are all Haswell, and are clustered. Node3 is standalone.

Thanks to VMM, building this out is largely a breeze, once you’ve settled on an architecture. I like to run the cmdlets to build the virtual & logical networks myself, but there’s also a great script available that will build a converged network for you.

A physical host typically looks like this (I say typically because I don’t have an equal number of adapters in all hosts):

I trust VLANs and VMM's segmentation abilities, but chose to build what is in effect air-gapped vSwitch for the DMZ/DIA networks
I trust VLANs and VMM’s segmentation abilities, but chose to build what is in effect air-gapped vSwitch for the DMZ/DIA networks

We’re already several levels deep in my personal abstraction cave, why stop here? Here’s the layout of VM Networks, which are distinguished from but related to logical networks in VMM:

labnet13

I get a lot of questions on this blog about jumbo frames and Hyper-V switching, and I just want to reiterate that it’s not that hard to do, and look, here’s proof:

jumbopacket

Good stuff!

Storage

And last, and certainly most-interestingly, we arrive at Daisetta Lab’s storage resources.

My lab journey began with storage testing, in particular ZFS via NexentaCore (Illumos), NAS4Free and Solaris 11. But that’s ancient history; since last summer, I’ve been all Windows, all the time in my lab, starting with SAN.Daisettalabs.net ((cf #StorageGlory : 30 Days on a Windows SAN)).

Now?

Well, I had so much fun -and importantly so few failures/pains- with Microsoft’s Tiered Storage Spaces that I’ve decided to deploy not one, or even two, but three Tiered Storage Spaces. Here’s the layout:

[table]Server, #HDD, #SSD, StoragePool Capacity, StoragePool Free, #vDisks, Function

Core, 9, 6, 16.7TB, 12.7TB, 6 So far, SMB3/iSCSI target for entire lab

Node1,2, 2, 2.05TB, 1.15TB,2, SMB3 target for Hyper-V replication

Node4,3,1, 2.86TB, 1.97TB,2, SMB3 target for Hyper-V replication

[/table]

I have to say, I continue to be very impressed with Tiered Storage Spaces. It’s super-flexible, the cmdlets are well-documented, and Microsoft is iterating on it rapidly. More on the performance of Tiered Storage Spaces in a subsequent post.

Thanks for reading!

Microsoft is the original & ultimate hyperconverged play

The In Tech We Trust Podcast has quickly became my favorite enterprise technology podcast since it debuted late last year. If you haven’t tuned into it yet, I advise you to get the RSS feed on your favored podcast player of choice ASAP.

The five gents ((Nigel Poulton, Linux trainer at Pluralsight, Hans De Leenheer,datacenter/storage and one of my secret crushes, Gabe Chapman, Marc Farley and Rick Vanover)) putting on the podcast are among the sharpest guys in infrastructure technology, have great on-air chemistry with each other, and consistently deliver an organized & smart format that hits my player on-time as expected every week. Oh, and they’ve equalized the Skype audio feeds too!

And yet….I can’t let the analysis in the two most recent shows slip by without comment. Indeed, it’s time for some tough love for my favorite podcast.

Guys you totally missed the mark discussing hyperconvergence & Microsoft over the last two shows!

For my readers who haven’t listened, here’s the compressed & deduped rundown of 50+ minutes of good stimulating conversation on hyperconvergence:

  • There’s little doubt in 2015 that hyperconverged infrastructure (HCI) is a durable & real thing in enterprise technology, and that that thing is changing the industry. HCI is real and not a fad, and it’s being adopted by customers.
  • But if HCI is a real, it’s also different things to different people; for Hans, it’s about scale-out node-based architecture, for others on the show, it’s more or less the industry definition: unified compute & storage with automation & management APIs and a GUI framework over the top.
  • But that loose definition is also evolving, as Rick Vanover sharply pointed out that EMC’s new offering, vSpex Blue, offers something more than what we’d traditionally (like two weeks ago) think of as hyperconvergence

Good stuff and good discussion.

And then the conversation turned to Microsoft. And it all went downhill. A summary of the guys’ views:

  • Microsoft doesn’t have a hyperconverged pony in the race, except perhaps Storage Spaces, which few like/adopt/bet on/understand
  • MS has ceded this battlefield to VMware
  • None of the cool & popular hyperconverged kids, save for Nutanix and Gridstore, want to play with Microsoft
  • Microsoft has totally blown this opportunity to remain relevant and Hyper-V is too hard. Marc Farley in particularly emphasized how badly Microsoft has blown hyperconvergence

I was, you might say, frustrated as I listened to this sentiment on the drive into my office today. My two cents below:

The appeal of Hyperconvergence is a two-sided coin. On the one side are all the familiar technical & operational benefits that are making it a successful and interesting part of the market.

  • It’s an appliance: Technical complexity and (hopefully) dysfunction are ironed out by the vendor so that storage/compute/network just work
  • It’s Easy: Simple to deploy, maintain, manage
  • It’s software-based and it’s evolving to offer more: As the guys on the show noted, newer HCI systems are offering more than ones released 6 months or a year ago.

The other side of that coin is less talked about, but no less powerful. HCI systems are rational cost centers, and the success of HCI marks a subtle but important shift in IT & in the market.

    • It’s a predictable check cut to fewer vendors: Hyperconvergence is also about vendor consolidation in IT shops that are under pressure to make costs predictable and smoother (not just lower).
    • It’s something other than best-of-breed: The success of HCI systems also suggests that IT shops may be shying away from best-of-breed purchasing habits and warming up to a more strategic one-throat-to-choke approach ((EMC & VMware, for instance, are titans in the industry, with best-in-class products in storage & virtualization, yet I can’t help but feel there’s more going on than the chattering classes realize. Step back and think of all the new stuff in vSphere 6, and couple it with all the old stuff that’s been rebranded as new in the last year or so by VMware. Of all that ‘stuff’, how much is best of breed, and how much of it is decent enough that a VMware customer can plausibly buy it and offset spend elsewhere?))
    • It’s some hybrid of all of the above: HCI in this scenario allows IT to have its cake and eat it too, maybe through vendor consolidation, or cost-offsets. Hard to gauge but the effect is real I think.

((As Vanover noted, EMC’s value-adds on the vSpex Blue architecture are potentially huge: if you buy vSpex Blue architecture, you get backup & replication, which means you don’t have to talk to or cut yearly checks to Commvault, Symantec or Veeam. I’ve scored touchdowns using that exact same play, embracing less-than-best Microsoft products that do the same thing as best-in-class SAN licenses))

And that’s where Microsoft enters the picture as the original -and ultimate- Hyperconverged play.

Like any solid HCI offering, Microsoft makes your hardware less important by abstracting it, but where Microsoft is different is that they scope supported solutions to x86. VMware, in contrast only hands out EVO:RAIL stickers to hardware vendors who dress x86 up and call it an appliance, which is more or less the Barracuda Networks model. ((I’m sorry. I know that was a a cheapshot,  but I couldn’t resist))

With your vanilla, Plain Jane whitebox x86 hardware, you can then use Microsoft’s Hyperconverged software system (or what I think of as Windows Server) to virtualize & abstract all the things from network (solid NFV & evolving overlay/SDN controller) to compute to storage, which features tiering, fault-tolerance, scale-out and other features usually found in traditional SAN systems.

But it doesn’t stop there. That same software powers services in an enormous IaaS/PaaS cloud, which works hand-in-hand with a federated productivity cloud that handles identity, messaging, data-mining, mail and more. The IaaS cloud, by the way, offers DR capabilities today, and you can connect to it via routing & ipsec, or you can extend your datacenter’s layer 2 broadcast domain to it if you like.

On the management/automation side, I understand/sympathize with ignorance of non-‘softies. Microsoft fans enthuse  about Powershell so much because it is -today-  a unified management system across a big chunk of the MS stack, either masked by GUI systems like System Center & Azure Pack or exposed as naked cmdlets. Powershell alone isn’t cool though, but Powershell & Windows Server aligned with truly open management frameworks like CIM, SMI-S and WBEM is very cool, especially in contrast to feature-packed but closed APIs.

On the cost side,there’s even more to the MS hyperconverged story:  Customers can buy what is in effect a single SKU (the Enterprise Agreement) and get access to most if not all of the MS stack.

Usually,organizations pay for the EA in small, easier-to-digest bites over a three year span, which the CFO likes because it’s predictable & smooth. (( Now, of course, I’m drastically simplifying Microsoft’s licensing regime and the process of buying an EA as you can’t add an EA to your cart & checkout, it’s a friggin negotiation. And yes I know everyone hates the true-up. And I grant that an EA just answers the software piece; organizations will still need the hardware, but I’d argue that de-coupling software from hardware makes purchasing the latter much, much easier, and how much hardware do you really need if you have Azure IaaS to fill in the gaps?))

Are all these Microsoft things you’ve bought best of breed? No, of course not. But you knew that ahead of time, if you did you homework.

Are they good enough in a lot of circumstances?

I’ll let you judge that for yourself, but, speaking from experience here, IT shops that go down the MS/EA route strategically do end up in the same magical, end-of-the-rainbow fairy-tale place that buyers of HCI systems are seeking.

That place is pretty great, let me tell you. It’s a place where the spend & costs are more predictable and bigger checks are cut to fewer vendors.  It’s a place where there are fewer debutante hardware systems fighting each other and demanding special attention & annual maintenance/support renewals in the datacenter. It’s also a place where you can manage things by learning verb-noun pairs in Powershell.

If that’s not the ultimate form of hyperconvergence, what is?

System Center is Dead, Long Live System Center?

MSFTSystemCenterlogo1Change is afoot for System Center, Microsoft’s stack of enterprise technology management applications that guys like me install, use, manage, and build great careers on top of. And not just little change. Big, sweeping change, I’m convinced, thanks largely to Satya Nadella, but also thanks to a new & healthy culture of pragmatism inside Microsoft.

But that pragmatic culture began with a bit of fear & intimidation for the System Center team. I’m told by a source ((Not really)) that it went down like this: Nadella strolled over to the office building where System Center is built by  segregated development teams. I’m told that the ConfigMan & VMM teams, as creators of the most popular programs in the suite, get corner offices with views of the Cascades, while the Service Manager & DPM teams fight over cubes in the interior.

Anyway, Nadella walked in one day, called them all around a handsome, gigantic, rectangular redwood work table in the center of their space. He looked at each of them quietly, then -with a roar that’s becoming legendary throughout the greater Seattle metroplex- he bent over and with enormous strength, flipped the table on its side, spilling coffee, laptops, management packs, DPM replicas, System Center Visio shapes and the pride/pain of so many onto the cold, grey marble floor.

“Some of this is going to stay. And some of it’s going to go,” he said to them, motioning to the mess on the floor.

And then, he vanished, like a ninja.

But seriously, look at all the change happening at Microsoft. Surely the System Center we love/hate/want to name our kid after is not goign to escape 2015 without some serious, deep, and heartbreaking/joy-inducing change, depending on your perspective. It’s already happening. To wit:

  • Parts of System Center are dead as of Windows Server Technical Preview: App Controller, the self-service Silverlight & http front-end to VMM has been dropped out of System Center Technical Preview.  Farewell oddly-named App Controller, can’t say I’ll miss you. In its place? Azure Pack baby.
  • In the last 45 days, the whole System Center team has been busy begging and pleading with us to give them some feedback. VMM put up a Survey Monkey , and the DPM, Orchestrator, and Service Manager blogs all have been asking readers to give them more feedback. VMM even has a Customer Panels  whose purpose is to take the pulse of working virtualization stiffs like me. That’s awesome -and reflects the broader changes in the company- but it’s also a bit scary because I love my VMM & Configman and I’m not used to being asked what I think of it, I’m used to just taking it, warts and all. ((Since they asked, I’m running SCVMM Technical Preview in the lab at home and though its changes mostly amount to removal of features in the production version, I view it as a great advancement for one reason: I can now automate the re-naming of vNICs through VMM itself, rather than some obscure netsh command/batch file thingy. Awesome))
  • There are many Configuration Management products out there, but ConfigMan is mine, and it has remained suspiciously absent from System Center Technical Preview. Now I’m not suggesting that MS is going to kill off the crown jewel of its System Center suite, but crazier things have happened. Jeffrey Snover, father of Powershell, isn’t giving up on his Desired State Configuration cmdlets, the DSC sect within the Microsoft professional community is gaining influence & strutting about the datacenter floor with some swagger, and DSC is a tool that with some maturity could largely make ConfigMan unnecessary in many environments. It probably scales to Azure better, though it doesn’t have anything in MDM as far as I know.
  • Though much improved, SCOM still strikes me as too hard  to build-out compared to Monitoring as a Service offerings like New Relic. Granted, SCOM’s cloud story was pretty strong; just two months I ago I got a taste of #MonitoringGlory when I piped an endless train of SCOM alerts/events directly into Azure Operational Insights and got, well, some insight into my stack. But guess what SCOM-fans? You no longer need SCOM for that.  Ok then.  Why would I use it?
  • There are no sacred cows at Microsoft anymore: My precious Lync? Gone. Renamed Skype for Business. The Start Screen, which I was strangely beginning to like? I’m suffering Stockholm Syndrome as I play with the latest Windows 10 build; it’s been axed! Sharepoint online public-facing websites? Starting March 9, new customers won’t have to go through the crucible some of us have gone through to stand-up a dynamic corporate website back-ended by Sharepoint in Office 365. They get to go through someone else’s crucible, like Drupal or something.
  • Nadella has a talent for picking the obvious, and he’s clear: Apparently it was Nadella who told the Microsoft Holo Lens team that what they were building was more akin to the Enterprise’s Holodeck than a new way to play shooters in XBox Online. It’s been Nadella repeating the call that there should be One Windows across all products, not an RT here, and a Windows Phone there. Like him or not, the man has some clarity on where he wants Microsoft to be; and I think that’s exactly what MS needed.

So, I have no evidence that System Center is going to get all shook up in 2015 -and I mean seriously shaken up- but it seems pretty obvious to me that with Nadella came a healthy & powerful introspection that’s really bearing some fruit in parts of Microsoft’s business.

Now it’s System Center’s turn. And it’s good. We should look at that suite holistically, in the context of our time & and the marketplace. Parts of it are undoubtedly great & market-leading; other parts of it are, in my opinion, beyond fixing. The former will be strengthened, the latter will be cut off and discarded. System Center, whether it lives on or gets swallowed up by the Azure Pack, will get better, and I’m pumped about that!

Labworks 2:5-8 – Get-Me -ConvergedSwitching -For “Hyper-V” | Now-Please

Hello Labworks fans, detractors and partisans alike, hope you had a nice Easter / Resurrection / Agnostic Spring Celebration weekend.

Last time on Labworks 2:1-4, we looked at some of the awesome teaming options Microsoft gave us with Server 2012 via its multiplexor driver. We also made the required configuration adjustments on our switch for jumbo frames & VLAN trunking, then we built ourselves some port channel interfaces flavored with LACP.

I think the multiplexor driver/protocol is one of the great (unsung?) enhancements of Server 2012/R2 because it’s a sort of pre-virtualization abstraction layer (That is to say, your NICs are abstracted & standardized via this driver before we build our important virtual switches) and because it’s a value & performance multiplier you can use on just about any modern NIC, from the humble RealTek to the Mighty Intel Server 10GbE.

But I’m getting too excited here; let’s get back to the curriculum and get started shall we?

Goals

5.  Understand what Microsoft’s multiplexor driver/LBFO has done to our NICs

6. Build our Virtual Machine Switch for maximum flexibility & performance

7. The vEthernets are Coming

8. Next Steps: Jumbo frames from End-to-end and performance tuning

Schematic:

Lab 2 - Daisetta Labs overview

2:5 Understand what Microsoft’s Multiplexor driver/LBFO has done to our NICs

So as I said above, the best way to think about the multiplexor driver & Microsoft’s Load Balancing/Failover tech is by viewing it as a pre-virtualization abstraction layer for your NICs. Let’s take a look.

Our Network Connections screen doesn’t look much different yet, save for one new decked-out icon labeled “Daisetta-Team:”

daisettateam

Meanwhile, this screen is still showing the four NICs we joined into a team in Labworks 2:3, so what gives?

A click on the properties of any of those NICs (save for the RealTek) reveals what’s happened:

Egads! My Intel NIC has been neutered by LBFO
Egads! My Intel NIC has been neutered by LBFO

The LBFO process unbinds many (though not all) settings, configurations, protocols and certain driver elements from your physical NICs, then binds the fabulous Multiplexor driver/protocol to the NIC as you see in the screenshot above.

In the dark days of 2008 R2 & Windows core, when we had to walk up hill to school both ways in the snow I had to download and run a cmd tool called nvspbind to get this kind of information.

Fortunately for us in 2012 & R2, we have some simple cmdlets:

daisettateam3

So notice Microsoft has essentially stripped “Ethernet 4” of all that would have made it special & unique amongst my 4x1GbE NICs; where I might have thought to tag a VLAN onto that Intel GbE, the multiplexor has stripped that option out. If I had statically assigned an IP address to this interface, TCP/IP v4 & v6 are now no longer bound to the NIC itself and thus are incapable of having an IP address.

And the awesome thing is you can do this across NICs, even NICs made by separate vendors. I could, for example, mix the sacred NICs (Intel) with the profane NICs (RealTek)…it don’t matter, all NICs are invited to the LBFO party.

No extra licensing costs here either; if you own a Server 2012 or 2012 R2 license, you get this for free, which is all kinds of kick ass as this bit of tech has allowed me in many situations to delay hardware spend. Why go for 10GbE NICs & Switches when I can combine some old Broadcom NICs, leverage LACP on the switch, and build 6×1 or 8x1GbE Converged LACP teams?

LBFO even adds up all the NICs you’ve given it and teases you with a calculated LinkSpeed figure, which we’re going to hold it to in the next step:

4GbS LACP team sounds great, but is it really 4Gb/s?
4GbS LACP team sounds great, but is it really 4Gb/s?

2:6 Build our Virtual Machine Switch for maximum flexibility & performance

If we just had the multiplexor protocol & LBFO available to us, it’d be great for physical server performance & durability. But if you’re deploying Hyper-V, you get to have your LBFO cake and eat it too, by putting a virtual switch atop the team.

This is all very easy to do in Hyper-V manager. Simply right click your server, select Virtual Switch Manager, make sure the Multiplexor driver is selected as the NIC, and press OK.

Bob’s your Uncle:

daisettaconverged1

But let’s go a bit deeper and do this via powershell, where we get some extra options & control:

PS C:usersjeff.DAISETTALABS> new-vmswitch -NetAdapterInterfaceDescription “Microsoft Network Adapter Multiplexor Driver” -AllowManagementOS 1 -MinimumBandwidthMode Weight -name “Daisetta-Converged”

Let’s go through each of these:

  • New-vmswitch : the cmdlet we’re invoking to build the switch. Run get-help new-vmswitch for a rundown of the cmdlet’s structure & options
  • -NetAdapterInterfaceDescription : here we’re telling Windows which NIC to build the VM Switch on top of. Get the precise name from Get-NetAdapter and enclose it in quotes
  • -Allow ManagementOS 1 : Recall the diagram above. This boolean switch (1 yes, 0 no) tells Windows to create the VM Switch & plug the Host/Management Operating System into said Switch. You may or may not want this; in the lab I say yes; at work I’ve used No.
  • -Minimum Bandwidth Mode Weight: We lay out the rules for how the switch will apportion some of the 4Gb/s bandwidth available to it. By using “Weight,” we’re telling the switch we’ll assign some values later
  • Name: Name your switch

A few seconds later, and congrats Mr. Hyper-V admin, you have built a converged virtual switch!

2:7 The vEthernets are Coming

Now that we’ve built our converged virtual switch, we need to plug some things into it. And that starts on the physical host.

If you’re building a Hyper-V cluster or stand-alone Hyper-V host with VMs on networked storage, you’ll approach vEthernet adpaters differently than if you’re building Hyper-V for VMs on attached/internal storage or on SMB 3.0 share storage. In the former, you’re going to need storage vEthernet adpters; in the latter you won’t need as many vEthernets unless you’re going multi-channel SMB 3.0, which we’ll cover in another labworks session.

I’m going to show you the iSCSI + Failover Clustering model.

In traditional Microsoft Failover Clustering for Virtual Machines, we need a minimum of five discrete networks. Here’s how that shakes out in the Daisetta Lab:

[table]

Network Name, VLAN ID, Purpose, Notes

Management, 1, Host & VM management network, You can separate the two if you like

CSV, 14, Host Cluster & communication and coordination, Important for clustering Hyper-V hosts

LM, 15, Live Migration network, When you must send VMs from broke host to host with the most LM is there for you

iSCSI 1-3, 11-13, Storage, Soemwhat controversial but supported

[/table]

Now you should be connecting that dots: remember in Labworks 2:1, we built a trunked port-channel on our Cisco 2960S for the sole purpose of these vEthernet adapters & our converged switch.

So, we’re going to attach tagged vethernet adapters to our host via powershell. Pay attention here to the “-managementOS” tag; though our Converged switch is for virtual machines, we’re using it for our physical host as well.

You can script his out of course (and VMM does that for you), but if you just want to copy paste, do it in this order:

  • Add the vEthernets
add-vmnetworkadapter -managementos -name CSV -switchname Daisetta-converged
add-vmnetworkadapter -managementos -name iSCSI-1 -switchname Daisetta-converged add-vmnetworkadapter -managementos -name iSCSI-2 -switchname Daisetta-converged
add-vmnetworkadapter -managementos -name iSCSI-3 -switchname Daisetta-converged
add-vmnetworkadapter -managementos -name LM -switchname Daisetta-converged
  • Tag those vEthernets!
Set-VMNetworkAdapterVlan -ManagementOS -Access -VlanId 15 -VMNetworkAdapterName LM
Set-VMNetworkAdapterVlan -ManagementOS -Access -VlanId 14 -VMNetworkAdapterName CSV
Set-VMNetworkAdapterVlan -ManagementOS -Access -VlanId 13 -VMNetworkAdapterName iSCSI-3
Set-VMNetworkAdapterVlan -ManagementOS -Access -VlanId 12 -VMNetworkAdapterName iSCSI-2
Set-VMNetworkAdapterVlan -ManagementOS -Access -VlanId 11 -VMNetworkAdapterName iSCSI-1
  • Now set IPs
New-NetIPAddress -IPAddress 172.16.14.12 -InterfaceAlias "vEthernet (CSV)" -AddressFamily IPv4 -PrefixLength 24
 
New-NetIPAddress -IPAddress 172.16.15.12 -InterfaceAlias “vEthernet (LM)” -AddressFamily IPv4 -PrefixLength 24
New-NetIPAddress -IPAddress 172.16.13.12 -InterfaceAlias "vEthernet (iSCSI-3)" -AddressFamily IPv4 -PrefixLength 24
New-NetIPAddress -IPAddress 172.16.12.12 -InterfaceAlias "vEthernet (iSCSI-2)" -AddressFamily IPv4 -PrefixLength 24
New-NetIPAddress -IPAddress 172.16.11.12 -InterfaceAlias "vEthernet (iSCSI-1)" -AddressFamily IPv4 -PrefixLength 24
 

Notice we didn’t include a Gateway in the New-NetIPAddress cmdlet; that’s because when we built our Virtual Switch with the “-managementOS 1” switch attached, Windows automatically provisioned a vEthernet adapter for us, which either got an IP via DHCP or took an apipa address.

So now we have our vEthernets and their appropriate VLAN tags:

daisettaconverged2
Ignore the DMZ vEthernet for now. Notice Daisetta-Converged, our VM Switch, is seen as a VMNetworkAdapter and is untagged. In my lab, this interface functions as my Host Management interface. In a production scenario, you’ll probably use separate vEthernet adapters for Host Management and not expose the switch itself to the management OS

 

 

 

 

 

 

 

2:8: Next Steps : Jumbo Frames from end-to-end & Performance Tuning

So if you’ve made it this far, congrats. If you do nothing else, you now have a converged Hyper-V virtual switch, tagged vEthernets on your host, and a virtualized infrastructure that’s ready for VMs.

But there’s more you can do; stay tuned for the next labworks post where we’ll get into jumbo frames & performance tuning this baby so she can run with all the bandwidth we’ve given her.

Links/Knowledge/Required Reading Used in this Post:

[table]
Resource, Author, Summary
New-VMSwitch Technet, Microsoft, Always good to have Technet reference
Building a Converged Fabric with Server 2012, Hans “The Hyper-Dutchman” Vredevoort, A 2012 post which helped me when I was struggling through 2008 R2 to 2012 Hyper-V migration

Hyper-V 3.0 Converged Networks with Force 10 and DCB, Dell, Neat Wiki & diagram with iSCSI as separate virtual switch but with DCB

[/table]

 

 

Live Migration Performance in Theory & Practice -or- In Which I take on Aidan Finn

Aidan Finn, upstanding Irishman, apparent bear-cub puncher, hobbyist photog, MVP all-star  and one of my favorite Hyper-V bloggers (seriously, he’s good, and along with DidierV & the Hyper-Dutchman has probably saved my vAss more times than I can vCount) appeared on one of my favorite podcasts last week, RunAs Radio with Canuck Richard Campbell.

Which is all sorts of awesome as these are a few of my favorite things piled on top of each other (Finn on RunAs).

The subject? Hyper-V, scale out file servers (SoFS) in 2012 R2, SMB 3.0 multichannel and Microsoft storage networking, which are just about my favoritest subjects in the whole wide world. I mean what are the odds that one of my favorite Hyper-V bloggers would appear on one of my favorite tech podcasts? Remote. And talk about storage networking tech, Redmond-style, during that podcast?

Where Perfmon is king, you will find Hyper-V bloggers like DidierV, who gets to play with 10G RDMA NICs
Where Perfmon is king, you will find Hyper-V bloggers like DidierV, who gets to play with 10G RDMA NICs

All that and an adorable Irish brogue?

This is Instant nerdgasm territory here people; if you’re into these black arts as I am, it’s a must-listen.

Anyway, Finn reminded me of his famous powershell demos in which he demonstrates all the options we Hyper-V admins have at our disposal now when it comes to Live Migrating VMs from host to host.

And believe me, we have so many now it’s almost embarrassing, especially if you cut your teeth on Hyper-V 2.0 in 2008 R2, where successfully Live Migrating VMs off a host (or draining one during production) involved a few right clicks, chicken sacrifice, Earth-Jupiter-Moon alignment, a reliable Geiger counter by your side and a tolerance for Pucker Factor Values greater than 10* **.

Nowadays, we can:

  • Live Migrate VMs between hosts in a cluster (.vhdx parked in a Cluster Shared Volume, VM config, RAM & CPU on a host….block storage, the Coke Classic option)
  • Live Migrate VMs parked on SMB 3.0 shares, just like you NFS jockeys do
  • Shared-nothing Live Migration, either storage + VM, just storage, or just VM!
    • A for instance:  from my Dell Latitude i7 ultrabook with Windows 8.1 and client hyper-v installed (natch), I can storage Live Migrate a .vhdx off my skinny but fast 256GB SSD to a spacious SMB share at work, then drop it back on my laptop at the end of the day, all via Scheduled Task or powershell with no downtime for the VM
    •  With Server 2012/2012 R2 you get all those options + SMB 3.0 multichannel

Not only that, but we have some cool new toys with which to make the cost of Live Migration a VM to the host with the most a little less painful:

  • Standard TCP/IP : I like this because I’m old school and anything that stresses the network and LACP is fun because it makes the network guy sweat
  • Compression: Borrow spare cycles from the host CPU, compress the VM’s RAM, and Live Migrate your way out of a tight spot
  • SMB via Remote Direct Memory Access : the holy of holies in Live Migration. As Finn points out, this bit of tech can scale beyond the bandwidth capabilities of the PCIe 3 bus. SMB 3.0 + RDMA makes you hate your Northbridge

Finn*** of course provided some Live Migration start:finish times resulting from the various methods above, which I then, of course, interpreted as Finn daring me personally over the radio to try and beat those times in my humble Daisetta Lab.

Now this is just for fun people; not a Labworks-style list of repeatable results, so let’s not nerd-out on how my testing methodology isn’t sound & I’m a stupidhead, ok?

Anyway, Sysinternals has a nice little tool to redline the RAM in your Windows VM. I don’t know how Finn does it, but I don’t have workloads (yet!) in the Lab that would fill 4GB of RAM with non-random data on a VM, so off to the cmd we go:

You type this (haven’t played with all the switches yet) in this navy blue screen:

ramtest

And then this happen and the somewhat pink graph goes full pink:

ramthevm

Then we press this button to test Live Migration w/ compression, as the Daisetta Lab doesn’t have fancy RDMA NICs like certain well-connected Irish Hyper-V bloggers:

Wish I had some RDMA NICs :sadface:
Wish I had some RDMA NICs :sadface:

Which makes this blue celeste denim Azure colored line get all spikey:

Oh...my NUMAs, they're spikey. Second test was more dramatic than first. Why?
Oh! My NUMA! Second spike somewhat higher than first. Why?

all of which results in a wicked-fast Live Migrations & really cool orange-colored charts in my totally non-random, non-scientific but highly enjoyable laboratory experiment

ramthevm2

Still, in the end, I like my TCP/IP uncompressed Live Migrations because 1) sackcloth & ashes, and 2) I didn’t go to the trouble of building a multiplexed LACP team -with a virtual switch on top!- just to let the Cat5es in my attic have an easy day at the office:

livemigration1gbe

But at work: yes. I love this compression stuff and echo Finn’s observations on how Hyper-V doesn’t slam your host CPUs beyond what the host & its VM fleet could bear.

Anyway, did I beat Finn’s Live Migration times in this fun little test? Will the Irish MVP have to admit he’s not so esteemed after all and surrender his Hyper_V_MVP_badge.gif to me?

Of course I did and yes he will.

But not really.

[table caption=”Daisetta Lab LM vs Finn’s Powershell LM Scripts – 4GB VM” width=”500″ colwidth=”20|100|50″ colalign=”left|left|left|left|left”]
Who,TCP/IP LM,Compressed LM,RDMA & SMB 3 LM,Notes
Finn,78 seconds, 15 seconds,6.8 seconds, “Mr. I once moved a VM with 56GB of RAM in 35 seconds probably has a few Xeons”
D-Lab,38 seconds,Like 12 or something,Who’s ass do I need to kiss to get RDMA/iWarp?, But seriously my VM RAM was probably not random
[/table]

Finn notes in his posts that he’s dedicating an entire 1GbE NIC for his Live Migration Demos, wheras I’m embracing the converged switch model and haven’t even played with bandwidth or QOS settings on my Hyper-V switch.

How do my VMware colleagues & friends measure this stuff & think about vMotion performance & reliability? I know NFS can scale & perform, but am ignorant on the nuances of v3 vs v4, how it works on the host and Distributed vSwitch and your “Shared nothing” storage vMotion. And what’s this I hear that vSphere won’t begin a vMotion without knowing it will complete? How’s that determined?

I mean I could spend an hour or two googling it, or you could, I don’t know, post a comment and save me the time and spread some of your knowledge 😀

I’m jazzed about SMB 3.0, but there are only a handful of storage vendors who have support for the new stack, and among them, as Finn points out, Microsoft is #1 storage vendor for SMB 3 fans, with NetApp probably in 2nd place.

 

 

* Just kidding, it wasn’t that bad. Most days. 

** Pucker Factor Value can be measured by querying obscure wmi class win32_pfv

*** Finn is a consultant. So you can hire him. I have no relationship with him other than admiration for his scripting skillz