An engineer in a VMware shop that’s using VMware’s new VSAN converged storage/compute tech had a near 12 hour outage this week. He reports in vivid detail at Reddit, making me feel like I’m right there with him:
At 10:30am, all hell broke loose. I received almost 1000 alert emails in just a couple minutes, as every one of the 77 VM’s in the cluster began to die – high CPU, unresponsive, applications or websites not working. All of the ESXi hosts started emitting a myriad of warnings, mostly for high CPU. DRS attempted to start migrating VM’s but all of the tasks sat “In progress”. After a few minutes, two of the ESXi hosts became “disconnected” from vCenter, but the machines were still running.
Everything appeared to be dead or dying – the VM’s that didn’t immediately stop pinging or otherwise crash had huge loads as their IO requests sat and spun. Trying to perform any action on any of the hosts or VM’s was totally unresponsive and vCenter quickly filled up with “In progress” tasks, including my request to turn off DRS in an attempt to stop it from making things worse.
I’m a Hyper-V guy and (admittedly) barely comprehend what DRS is but wow. I’ve got 77 VMs in my 6 node cluster too. And I’ve been in that same position, when something unexpected…rare…almost impossible to wargame…happens and the whole cluster falls apart. For me it was an ARP storm in the physical switch thanks in part to an immature understanding 2008 R2’s virtual switching.
I’m not ashamed to say that in such situations intuition plays a part. Logs are an incomprehensible firehose and not useful and may even distract you from the real problem. Your ops manager VM, if stored within the cluster (cf observer effect) is useless, and so, what do you have?
You have what lots of us have, no matter the platform. A support contract. You spend valuable minutes explaining your situation to a guy on the phone who handles many such calls per day. Minutes, then a half hour, then a full hour tick by. The business is getting restless & voices are being raised. If your IT group has an SLA, you’re now violating it. Your pulse is rising, you’re sweating now.
So you escalate. Engage the sales team who sold you the product..you’re desperate. This guy got a vExpert on the phone. At times, I’ve had MVPs helping me. Yet with some problems, there are no obvious answers, even for the diligent & extraordinary.
But if you’re good, you’ve a keen sense of what you know versus what you don’t know (cf Donald Rumsfeld for the win), and you know when to abandon one path in favor of another. This engineer knew exactly the timing of his outage…what he did, when he finished the work he did, and when the outage started. Maybe he didn’t have it down in a spread and proving it empirically in court would never work, but he knew: he was thinking about what he knew during his outage, and he was putting all his knowns and unknowns together and building a model of the outage in his head.
I feel simpatico with this guy…and I’m not too proud to say that sometimes, when nothing’s left, you’ve got to run to the server room (if it’s near, which it’s not in my case or in this engineer’s case I think) and check the blinky lights on the hard drives on each of your virtualization nodes. Are they going nuts? Does it look odd? The CPUs are redlined and the putty session on the switch is slow…why’s that? ‘
Is this signal, or is this noise?
Observe the data, no matter how you come by it. Humans are good at pattern recognition. Observe all you can, and then deduce.
Bravo to this chap for doing just that and feeling -yes feeling at times- his way through the outage, even if he couldn’t solve it.
High five from a Hyper-V guy.