Tuesday, July 18, 2017

When things go horribly wrong …

How a few cents of wire lying unnoticed on the road can cripple a vehicle as large as an RV; we continue to value availability and it’s time to double down on the benefits of NonStop!

The most essential attribute of NonStop today is its fault tolerance capabilities. Availability is as highly valued as it has always been and yet, there are many parties advocating that it really isn’t an issue any longer. Push apps and data into the cloud – public or private, it matters little at this point – and the infrastructure on offer from cloud providers ensures your apps and indeed you data is protected and available 24 x 7. But is this really the situation and should CIOs contemplating a future for their IT centered on cloud computing be immune to the many ways apps and data can be taken offline?

Unintended consequences! We read a lot about such outcomes these days and it is a further reflection on just how complex our interdependencies have become. Push a button over here and suddenly way over there, something just stops working. They weren’t even on the same network, or were they? Throw malware onto a Windows server looking after building infrastructure and suddenly, the data on a mainframe is compromised – who knew that they shared a common LAN? Ouch – but it happened as we all know oh so well.

For the past two months, Margo and I have been fulltime RVers. That is, we are without a permanent address and have been living out of our company command center. We have driven to numerous events all of which have been covered in previous posts to this blog. Our travels have continued and this past week we headed down to Southern California to meet with a client and the trip took us through Las Vegas. In the heat of summer in the desserts of Nevada we hit temps exceeding 110F. Overnighting at our regular RV site, we found a collection of fluids pooling underneath the RV and sheer panic set in. After all, this is our home; what has happened?

It has turned out that unknowingly we had run over wire mesh that was completely invisible to the naked eye. But those strands of very thin wire managed to wrap themselves around the drive shaft of the RV where they became an efficient “weed whacker” – you know, those appliances we often see being used to trim hedges and lawn borders. In a matter of seconds our own drive shaft powered these thin wires such that the result was multiple shredded hydraulic lines and air hoses – who could have imagined such innocent strands of wire could be so disruptive or  that they could completely cripple a 15 plus ton coach in a matter of seconds. Yes, unintended consequences are everywhere and for the most part, lie outside any of our plans and procedures, where detection of the event comes too late.

It is exactly the same with all platforms and infrastructure, on-premise or in the cloud, or even hybrid combinations of both! If you don’t design for failure – even the most far-fetched – then you are destined for failure. It is as simple as that. In my time at Tandem Computers we often referred to an incident that led to Tandem systems always being side-vented and never top-vented. The reason for this was that, at an early demo of a NonStop system, coffee was accidentally spilt on top of the machine effectively stopping the NonStop. Now I am not sure of the authenticity of this event but would welcome anyone’s input as to the truth behind this but it does illustrate the value of experience.  Designers would immediately have caught on to the possibility that coffee would be spilt on a system the day it was being demoed but for Tandem engineers, it led to changes that exist to this day.

Experience has led to more observations which in turn have generated more actions and this is all part of the heritage of NonStop and in many respects, is part of the reason why there isn’t any competitors today to NonStop. You simply cannot imagine all of the unintended consequences and then document them in their entirety within the space of a two page business plan. But design them you must and as I look at how the platforms and infrastructure being hawked by vendors selling cloud computing today are dependent solely on the value proposition that comes with redundancy (which is all they ever point to), my head hits the table along with a not-too-subtle sigh in disbelief. Redundancy plays a part, of course, but just one part in negating potential outages but availability needs so much more. But at what cost?

The whole argument for cloud computing today revolves around greatly reduced IT costs – there is an elasticity of provisioning unlike anything we have experienced before but more importantly, given the virtualization that is happening behind the scenes, we can run many more clients on a cloud than was ever conceived as possible back when service bureaus and time-sharing options were being promoted to CIOs as the answer to keeping costs under control. With the greatly reduced costs came the equally important consideration of greatly reduced staff. And this is where the issue of unintended consequences really shows its face. Experience? Observations? Even plans and procedures? Who will be taking responsibility for ensuring the resultant implementations are fully prepared to accommodate elements that fail?

There is a very good reason why pilots run through check lists prior to take off, landings, changes of altitude, etc. Any time an action is to be taken there are procedures that must be followed. When I turn on the ignition of the RV, there is a check list that appears on the digital display and for the same reason as pilots have checklists – too many bad things can happen if you miss something and I have managed to inflict considerable damage to our RV through the years when I forgot to follow all the items on the checklist. And there are best practices in place today at every data center that have been developed over time based yet again on experience – so when next we talk about availability as we head to clouds, who is preparing the next generation of checklists?

It is pleasing to me to see the efforts that OmniPayments is putting into providing cloud computing based on NonStop. For the moment it is solely providing payments solutions  to select financial institutions but even now, the number of clients opting to run their OmniPayments on the basis of SaaS rather than investing in platforms and infrastructure themselves sends a very powerful message to the community. Don’t discount the value of NonStop as has been demonstrated through the ages – get to virtualized NonStop (vNS) as quickly as you can and go champion within your enterprise that yes, you now have the best possible solution that can survive even the strangest of unintended consequences. It’s just what NonStop was designed to do and it keeps on doing it.

You run on NonStop X so you will run on vNS. There is much that can go wrong with traditional physical systems just as there is much that can go wrong with clouds. Simply going for more clouds and leaving it to redundant banks of servers isn’t the safety net any enterprise should rely upon so take it to the next level. Let all you know how NonStop is taking its most prized attribute, availability, high and wide into the clouds! After all, these clouds are every bit as vulnerable to failure as any primitive hardware built in the past and NonStop knows failures when it encounters them and just doesn’t stop! 

No comments: