As I look back at the past year perhaps the best way
to describe it was how there were many times where things were broken. I am not
talking about the occasional drinking glass or dinner plate but objects that
mattered to us. When things are broken their performance is hindered to where
in some cases they no longer serve any useful purpose. Replacement appears to
be the order of the day.
Mind you, I am not talking about the weather here in
Colorado as I have lost track of how many days have passed without a meaningful
snow fall. This morning may have been the exception as a little rain together
with what might pass as a little sleet did fall but not for long enough to
stick to anything. Enjoying December days that have climbed into the 70sF
(20sC) would surely pass as unusual even if we didn’t talk about weather
patterns that look to be broken.
However, before developing this story line further, please
be happy for us as remedies for almost everything have been found. But what was
broken? This time last year Margo broke her leg badly where the remedy happened
to include the insertion of metal rods and nails. In summer our Range Rover was
rear-ended on the freeway and the insurance company wrote it off. That new
sectional we had waited for did finally show up but the central portion of the
sectional had been scratched in transport.
When it comes to IT and to the digital
transformation and the pivot to everything-as-a-service, it’s hard to make
light of the fact that the role clouds are playing isn’t proving as rock-solid
as promoters would have you believe. Not for them is an outage here or there
something for us to worry about, but when a major cloud services provider like
Amazon Web Services (AWS) breaks then yes, we should all be concerned. In fact
there were enough outages for CRN to publish The 10 Biggest Cloud Outages Of 2021 (So
Far). As for the tag line, it was rather long but managed to sum up the
predicament of many affected at the time:
“‘Outages can mean the end for companies, depending on their
choices in design and deployment, or they can be complete non-events,’ Miles
Ward, chief technology officer at Los Angeles-based Google partner SADA
Systems, tells CRN. ‘Cloud has changed the nature of outages.’”
But then, CRN highlights something that should warm the hearts of many
in NonStop, particularly at this time of year. Consider it your early arrival
of your Christmas gift:
“‘Every cloud engineering team has seen how
impossible it is for customers to engineer around these kinds of outages and is
working hard to distribute, subdivide, and make fault-tolerant these central
services,’ Ward said.”
Given that this
article by CRN was published back in late July so missed reporting on the big
AWS outage it’s worth noting that among the top three worst outages were:
In third place – Fastly Outage in June. “Fastly impacted
bulletin board website Reddit, video streaming service Twitch and a number of
news sites including CNN and The New York Times.” Among the comments reported
at the time by CRN was this particular gem:
“Michael
Goldstein, CEO of LAN Infotech, a Fort Lauderdale, Fla.-based solution
provider, told CRN at the time that the global outage shows how critical it is
for customers to properly architect their cloud and on-premises network.
“‘Cloud isn’t
any different than on-premises—with both cloud and on-premises you need to make
sure you have the right architecture,’ Goldstein said. ‘We make sure that when
we put mission-critical applications in [Microsoft] Azure for our customers we
have multiple data center regions to prevent an outage like this. You need a
fail-safe and a continuity plan to prevent outages.’”
Rising
to second place and given the generalized heading of More Microsoft Issues this time it centered on issues to do with
Microsoft Teams. Apparently, “Teams’ calling
service sent calls straight into some users’ voicemails.” Now depending on your
level of tolerance of virtual meetings this may have been a blessing in
disguise, but in reality, it really all came back to issues with the
infrastructure, according to Microsoft via updates provided by the Microsoft
365 Status Twitter account:
“…Microsoft ‘isolated a recent change
that has caused portions of infrastructure to send some Microsoft Teams calls
straight to voicemail.’”
But then, one Microsoft partner, Amaxra, according to its
president and CEO, Rosalyn Arntzen, told CRN that “over the past few years,
Microsoft had gotten “dramatically better” at updating partners “as soon as
they are aware of an issue and listing when they expect the issue to be
solved—or at least provide a status.”
Coming in with the blue-ribbon winning outage of the year (so
far) was the Akamai Outage, June 17. Remember this outage? Turns out it
happened “Nine days after the Fastly outage, (where) a system issue
with Cambridge, Mass.-based Akamai Technologies caused internet
outages for global
airlines, banks, and stock exchanges. The company saw service disruptions for
its hosting platform, which helps defend against Distributed Denial-of-Service
(DDoS) attacks.
The way CRN reported this outage was to highlight that:
“The disruption
affected several large companies around the globe, including Southwest
Airlines, United Airlines, Commonwealth Bank of Australia, Westpac Bank, and
Australia and New Zealand Banking Group, as well as the Hong Kong Stock
Exchange’s website. Services for many of the companies impacted were restored
within the day.
“Downdetector.com
showed spikes in complaints about service outages for websites of companies
inside the U.S. as well as in a number of other countries including Australia,
Germany and India.”
And
remember among the also-runs was the outage at Verizon that reports blamed on a
fiber cut in Brooklyn, but that was later confirmed as being “a software issue triggered during routine network
management activities.” And then there was the issue at Google when “The Google
Drive cloud storage service—and associated cloud apps including Google Docs and
Google Sheets—suffered multiple service issues … While users could still access
Google Drive, affected users could not create new documents and were ‘seeing
error messages, high latency, and/or other unexpected behavior,’ according to
the company.”
And there you have it: The myth of the infallibility of
clouds. Amazon, Microsoft and Google. Of course, it was left to Larry Ellison
to capitalize on their circumstances by virtue of his claim that Oracle cloud
didn’t fail. Surely, you cannot be serious, Larry?
For all the upside associated with capitalizing on cloud services there is still the fundamental issue that resilience and indeed reliability of levels we associate with NonStop are simply mythical. Fail-safe continuity and indeed fault tolerance for “central services” is being openly discussed even as we know that with todays’ modern languages tools and services there is a lot that can be done to deliver a kind of pseudo fault tolerance. To think that all those years ago, the original Tandem Computers understood the issues better than any other vendor.
And yet, when those cloud services' vendors, providing the underlying
infrastructure and most important of all the networking and integration
services get it so hopelessly wrong, how can users deploying mission critical
applications know for certain that these services will always be there, 24 x 7?
The reality is a lot more sobering; they cannot provide anything close to
ironclad guarantees. There is a reason why NonStop continues to thrive four
decades after being first introduced; it’s fault tolerant in so many ways that
it should be hard to ignore it’s contribution to cloud computing.
I am not entering into this conversation lightly. However we aren’t
discussing how to best fix a broken toy of which there will be many reports
over the holidays. Two opportunities come to mind that in the coming months I
will be exploring in more detail. And they have to do with how we think about
NonStop going forward and whether our own ideas about the role of NonStop may
indeed be outdated.
There is the potential to have NonStop play a guardian role – no pun
intended. Should there be a central NonStop essentially polling the hybrid
multi-cloud environment common today among enterprises so that exposure to any
one cloud can be marginalized to where outages have no impact on the running of
mission critical applications? This is clearly an over simplification but there
are models that feature NonStop in this way that readily come to mind.
There is also the potential for NonStop itself, virtualized as we now
have the option to deploy NonStop, treating the world of hybrid clouds as no
different to either converged NonStop processors or as virtual machines.
Consider one cloud as being CPU0 and another cloud as CPU1, etc. and you get
the idea. This too is clearly an over simplification that perhaps throws a
spotlight on the capabilities of the cloud services providers interconnect with
each other, but the idea is still simple in principle. A single image NonStop
system spanning multiple clouds, with the ability to perform its
industry-leading take-over whenever a cloud misbehaves?
Once we get past the idea that yes, like real CPUs and even Virtual
Machines, clouds are just as unreliable then the future of NonStop will warm to
the opportunity this represents. The mere fact that one publication is already
producing an annual Top 10 Outages article should be evidence enough that
enterprises need to more seriously consider what the cloud experience really
entails?
For Margo and me, this is just the beginning of
a theme that we will revisit in 2022, so stay tuned. But again, the items that
broke for us in 2021 have all been addressed and having said that, can you all
say the same about your own hybrid IT and its supporting infrastructure? Even
as we wish you the very best for the coming year perhaps it is time to ponder
that ultimate question about NonStop: When did availability ever not be the
issue of the day?
Comments