What’s not broken and just keeps on running? NonStop delivers!

As I look back at the past year perhaps the best way to describe it was how there were many times where things were broken. I am not talking about the occasional drinking glass or dinner plate but objects that mattered to us. When things are broken their performance is hindered to where in some cases they no longer serve any useful purpose. Replacement appears to be the order of the day.

Mind you, I am not talking about the weather here in Colorado as I have lost track of how many days have passed without a meaningful snow fall. This morning may have been the exception as a little rain together with what might pass as a little sleet did fall but not for long enough to stick to anything. Enjoying December days that have climbed into the 70sF (20sC) would surely pass as unusual even if we didn’t talk about weather patterns that look to be broken.

However, before developing this story line further, please be happy for us as remedies for almost everything have been found. But what was broken? This time last year Margo broke her leg badly where the remedy happened to include the insertion of metal rods and nails. In summer our Range Rover was rear-ended on the freeway and the insurance company wrote it off. That new sectional we had waited for did finally show up but the central portion of the sectional had been scratched in transport.

When it comes to IT and to the digital transformation and the pivot to everything-as-a-service, it’s hard to make light of the fact that the role clouds are playing isn’t proving as rock-solid as promoters would have you believe. Not for them is an outage here or there something for us to worry about, but when a major cloud services provider like Amazon Web Services (AWS) breaks then yes, we should all be concerned. In fact there were enough outages for CRN to publish The 10 Biggest Cloud Outages Of 2021 (So Far). As for the tag line, it was rather long but managed to sum up the predicament of many affected at the time:

“‘Outages can mean the end for companies, depending on their choices in design and deployment, or they can be complete non-events,’ Miles Ward, chief technology officer at Los Angeles-based Google partner SADA Systems, tells CRN. ‘Cloud has changed the nature of outages.’”

But then, CRN highlights something that should warm the hearts of many in NonStop, particularly at this time of year. Consider it your early arrival of your Christmas gift:

“‘Every cloud engineering team has seen how impossible it is for customers to engineer around these kinds of outages and is working hard to distribute, subdivide, and make fault-tolerant these central services,’ Ward said.”

Given that this article by CRN was published back in late July so missed reporting on the big AWS outage it’s worth noting that among the top three worst outages were:

In third place – Fastly Outage in June. “Fastly impacted bulletin board website Reddit, video streaming service Twitch and a number of news sites including CNN and The New York Times.” Among the comments reported at the time by CRN was this particular gem:

“Michael Goldstein, CEO of LAN Infotech, a Fort Lauderdale, Fla.-based solution provider, told CRN at the time that the global outage shows how critical it is for customers to properly architect their cloud and on-premises network.

“‘Cloud isn’t any different than on-premises—with both cloud and on-premises you need to make sure you have the right architecture,’ Goldstein said. ‘We make sure that when we put mission-critical applications in [Microsoft] Azure for our customers we have multiple data center regions to prevent an outage like this. You need a fail-safe and a continuity plan to prevent outages.’”

Rising to second place and given the generalized heading of More Microsoft Issues this time it centered on issues to do with Microsoft Teams. Apparently, “Teams’ calling service sent calls straight into some users’ voicemails.” Now depending on your level of tolerance of virtual meetings this may have been a blessing in disguise, but in reality, it really all came back to issues with the infrastructure, according to Microsoft via updates provided by the Microsoft 365 Status Twitter account:

“…Microsoft ‘isolated a recent change that has caused portions of infrastructure to send some Microsoft Teams calls straight to voicemail.’”

But then, one Microsoft partner, Amaxra, according to its president and CEO, Rosalyn Arntzen, told CRN that “over the past few years, Microsoft had gotten “dramatically better” at updating partners “as soon as they are aware of an issue and listing when they expect the issue to be solved—or at least provide a status.”

Coming in with the blue-ribbon winning outage of the year (so far) was the Akamai Outage, June 17. Remember this outage? Turns out it happened “Nine days after the Fastly outage, (where) a system issue with Cambridge, Mass.-based Akamai Technologies caused internet outages for global airlines, banks, and stock exchanges. The company saw service disruptions for its hosting platform, which helps defend against Distributed Denial-of-Service (DDoS) attacks.

The way CRN reported this outage was to highlight that:

“The disruption affected several large companies around the globe, including Southwest Airlines, United Airlines, Commonwealth Bank of Australia, Westpac Bank, and Australia and New Zealand Banking Group, as well as the Hong Kong Stock Exchange’s website. Services for many of the companies impacted were restored within the day.

“Downdetector.com showed spikes in complaints about service outages for websites of companies inside the U.S. as well as in a number of other countries including Australia, Germany and India.”

And remember among the also-runs was the outage at Verizon that reports blamed on a fiber cut in Brooklyn, but that was later confirmed as being “a software issue triggered during routine network management activities.” And then there was the issue at Google when “The Google Drive cloud storage service—and associated cloud apps including Google Docs and Google Sheets—suffered multiple service issues … While users could still access Google Drive, affected users could not create new documents and were ‘seeing error messages, high latency, and/or other unexpected behavior,’ according to the company.”

And there you have it: The myth of the infallibility of clouds. Amazon, Microsoft and Google. Of course, it was left to Larry Ellison to capitalize on their circumstances by virtue of his claim that Oracle cloud didn’t fail. Surely, you cannot be serious, Larry?

For all the upside associated with capitalizing on cloud services there is still the fundamental issue that resilience and indeed reliability of levels we associate with NonStop are simply mythical. Fail-safe continuity and indeed fault tolerance for “central services” is being openly discussed even as we know that with todays’ modern languages tools and services there is a lot that can be done to deliver a kind of pseudo fault tolerance. To think that all those years ago, the original Tandem Computers understood the issues better than any other vendor.

And yet, when those cloud services' vendors, providing the underlying infrastructure and most important of all the networking and integration services get it so hopelessly wrong, how can users deploying mission critical applications know for certain that these services will always be there, 24 x 7? The reality is a lot more sobering; they cannot provide anything close to ironclad guarantees. There is a reason why NonStop continues to thrive four decades after being first introduced; it’s fault tolerant in so many ways that it should be hard to ignore it’s contribution to cloud computing.

I am not entering into this conversation lightly. However we aren’t discussing how to best fix a broken toy of which there will be many reports over the holidays. Two opportunities come to mind that in the coming months I will be exploring in more detail. And they have to do with how we think about NonStop going forward and whether our own ideas about the role of NonStop may indeed be outdated.

There is the potential to have NonStop play a guardian role – no pun intended. Should there be a central NonStop essentially polling the hybrid multi-cloud environment common today among enterprises so that exposure to any one cloud can be marginalized to where outages have no impact on the running of mission critical applications? This is clearly an over simplification but there are models that feature NonStop in this way that readily come to mind.

There is also the potential for NonStop itself, virtualized as we now have the option to deploy NonStop, treating the world of hybrid clouds as no different to either converged NonStop processors or as virtual machines. Consider one cloud as being CPU0 and another cloud as CPU1, etc. and you get the idea. This too is clearly an over simplification that perhaps throws a spotlight on the capabilities of the cloud services providers interconnect with each other, but the idea is still simple in principle. A single image NonStop system spanning multiple clouds, with the ability to perform its industry-leading take-over whenever a cloud misbehaves?

Once we get past the idea that yes, like real CPUs and even Virtual Machines, clouds are just as unreliable then the future of NonStop will warm to the opportunity this represents. The mere fact that one publication is already producing an annual Top 10 Outages article should be evidence enough that enterprises need to more seriously consider what the cloud experience really entails?

For Margo and me, this is just the beginning of a theme that we will revisit in 2022, so stay tuned. But again, the items that broke for us in 2021 have all been addressed and having said that, can you all say the same about your own hybrid IT and its supporting infrastructure? Even as we wish you the very best for the coming year perhaps it is time to ponder that ultimate question about NonStop: When did availability ever not be the issue of the day?

The folly that was Tandem Computers and the path that led me to NonStop ...

With the arrival of 2018 I am celebrating thirty years of association with NonStop and before that, Tandem Computers. And yes, a lot has changed but the fundamentals are still very much intact! The arrival of 2018 has a lot of meaning for me, but perhaps nothing more significant than my journey with Tandem and later NonStop can be traced all the way back to 1988 – yes, some thirty years ago. But I am getting a little ahead of myself and there is much to tell before that eventful year came around. And a lot was happening well before 1988. For nearly ten years I had really enjoyed working with Nixdorf Computers and before that, with The Computer Software Company (TCSC) out of Richmond Virginia. It was back in 1979 that I first heard about Nixdorf’s interests in acquiring TCSC which they eventually did and in so doing, thrust me headlong into a turbulent period where I was barely at home – flying to meetings after meetings in Europe and the US. All those years ago there was ...

Real Time View

Search This Blog

What’s not broken and just keeps on running? NonStop delivers!

Labels

Comments

Popular posts from this blog

The folly that was Tandem Computers and the path that led me to NonStop ...

ACI Strategy - it's all about choice!

An era ends!