Disaster strikes! What’s your Plan B?

Many road warriors we know all too well expect that when they head to the airport to board their flight and attend to business they will then return home safely. It happens all the time. For anyone that has walked through an airport of late there is no doubting that the desire to get back out into the world has returned. Gates are packed with check in and security lines back to their former slow crawl patterns.

What is not expected is to see planes parked at the gates, along taxi runways and worse. Who knew that a simple computer glitch could take down the air traffic control network across the United States? When news of a glitch spreads it almost never turns out to be a minor irritation. Of late we have read about such glitches affecting banks, trading houses and more – so much so that for the community at large, it’s become almost a case of well, here we go again. But bringing air travel to a halt, that’s a glitch on a completely different plain! Everyone it seems is left hurting.

So what did happen? Was it a breach in security? Ransomware? Someone flipped the wrong switch? Apparently it has something to do with the database. As it was reported by CNN in an update provided January 12, 2023, it was A corrupt file led to the FAA ground stoppage. It was also found in the backup system. “The computer system that failed was the central database for all NOTAMs (Notice to Air Missions) nationwide. Those notices advise pilots of issues along their route and at their destination. It has a backup, which officials switched to when problems with the main system emerged, according to the source.”

Just this Friday, in a further update as reported by BBC News in the article FAA outage: US airline regulators blame contractor for travel chaos, “The FAA said that their contract employee, who was not identified, deleted the files while working to synchronize the primary and backup Notam databases.” Imagine that; a problem with synchronizing apparently replicated databases. Once again, it can be said that it was human error but then again, who was looking over the shoulder of this contract employee?

There is a reason why we don’t let individuals engage directly, with no supervision, with the maintenance of mission critical systems. There was a time when we talked about putting a big dog between operators and the console or, in the case of NonStop, two dogs. As I was reminded by NonStop technical consultant and longtime supporter of the NonStop community, Bill Honaker, it is important to remember that steps can be taken to avoid human error.

At a customer that was upgrading all of their NonStop servers while their app stayed available, there was a rule that “any time an engineer was to shut down a cabinet, two people had to lay hands on it and separately identify it as offline and ready to go down. And all my config changes were reviewed by the other project manager.”

For the NonStop community, to reference the well-known charter of NASA during the Apollo Program, failure is not an option. NonStop systems have been running for decades without outages all the while masking outages, disaster or otherwise, such that the general public was never impacted. That is the true value of depending upon a fault tolerant platform that already comes with a built-in Plan B.

From the archives (circa 1980s)
Courtesy Ceferino Perez, Madrid Spain
As appeared this week in LinkedIn

However, with the most robust application deployment on NonStop there is almost uniform recognition that beyond fault tolerance there is disaster tolerance. Natural disasters along with the efforts of maleficent individuals or institutions impacting operations have become almost routine occurrences. The need for a second and perhaps even a third (or more) geographically separated site, in case the primary data center burns down, has taken on even greater importance. With this in mind, there is the need for a broader Plan B than one solely focused on just a single system.

For the NonStop user the options are out there and available from multiple sources. “Our business for nigh on four decades has been all about ensuring disasters do not take down the operations of any of our NonStop customers,” said Tim Dunne, NTI’s Global Director Worldwide Sales. “When you look at the complexity of today’s NonStop modern deployments, it is of paramount importance that recovery can be achieved in real time to where the impact on business goes unnoticed.”

NTI is among a group of vendors who have specialized in ensuring data centers relying on NonStop aren’t the subject to outages for any reason. It would be so easy for all involved in NonStop to caste stones at those IT organizations not as robust as what is delivered with NonStop but are their vulnerabilities in the way NonStop systems are engaged in dialogues with one another? Are we really sure that when that switch is flipped, it will all work?

“We know that many of our customers do indeed have procedures to ensure their backups are up to the task,” said Dunne. “We have helped many NonStop customers develop plans that take advantage of NTI’s DRNet®/Unified product suite and yes, we continue to provide this service as part of our License agreement; as much as necessary without ever charging a Professional Services Fee. NonStop customer support of our Active/Active option has been experiencing significant growth over the past decade.”

From firsthand experience, Dunne also noted the value that comes with direct line of sight with those implementing a broader approach to Plan B. “Having embraced Change Data Capture (CDC) methodologies earlier than anyone else NTI is experienced in ensuring whatever Plan B is being considered can be delivered. Furthermore, as the need for simplified approaches to software maintenance and upgrades remains a priority, there is emerging growing trend in support of not only robust solutions but of those that are economically viable.”

My discussions on the subject of primary and backup systems and databases was not limited to just NTI. When the experience at FAA was raised with Bill Honaker, his response was to the point. “Many customers spend the extra effort to run their sites as Active/Active. To do a ‘transparent’ switchover, that’s the best way,” said Bill. “IT organizations may say that because the NonStop platform is ‘old’ (from the 90s) it’s risky but a lot of us have experience with handling this kind of environment dating back to those times. It is all about designing for fault tolerance, everywhere (Power, Network, Switches, Routers, Servers, Database). Companies (and agencies) don’t know it’s possible because of the ‘good enough’ mindset that pervades the industry, in my opinion, so they don’t even try.”

As for the view of others who provide services supporting disaster recovery, when I reached out to Collin Yates of TCM Solutions, he made an even stronger case for involving experienced personal in creating working Plan B configurations. “Being able to switch from one site (one DC) to another seamlessly to the users, especially in a 24x7x365 environment, comes down to two factors really. The first being that the application in question is running in an Active/Active (A:A) configuration. That is to say transactions are being processed on BOTH nodes simultaneously,” said Collin.

“As for the second factor, it’s all about customers having a very good and reliable Database Replication (DR) product installed that supports the A:A Application. On the NonStop that could be DRNET, Shadowbase or Oracle’s Golden Gate as the main protagonists. In a nutshell, running the right platform (NonStop of course) with right Data Replication engine (DRNET or Shadowbase) things like the outage at the FAA would or should not ever occur,” observed Collin before adding, “We are experts in Data Replication and configuring such in either A:A or Active / Passive (A:P) on your NonStop platform.” And yes, part of the services is to ensure there are steps in place to test the processes all work to plan.

Perhaps that is the most important aspect of all for the NonStop community. Access to a generous mix of services providers, consultants and software vendors ensures that no Plan B, whatever the complexity, need give rise to anxieties over how best to ensure your most mission-critical application of all can continue serving the business through any disaster scenario imaginable. It isn’t a simple deployment to get right – there are all sorts of potential collision scenarios to work through – but the experience gained from real world deployments has made even the most unlikely of occurrences manageable.

And when it comes to ensuring primary and backup are indeed synchronized the NonStop community is uniquely blessed by the presence of industry-leading, usable, software products that do exactly that. Think TANDsoft with their high-performance FS Compare and Repair utility. There is no reason to leave to chance and to inexperienced technical staff any opportunity to perform operations that could lead to events such as that experienced by the FAA.

The FAA gained considerable notoriety over the failings and the manner in which they went about explaining what happened only made the situation worse for them. The press was particularly unkind to the way they went about solving the problem. For the NonStop community even as the decades continue to slide past, having a fault tolerant platform on hand as a starting point and with access to software and services that minimize the potential for such outages to occur, Plan B is never an afterthought but an integral part of what makes NonStop the robust solution it is today.

The folly that was Tandem Computers and the path that led me to NonStop ...

With the arrival of 2018 I am celebrating thirty years of association with NonStop and before that, Tandem Computers. And yes, a lot has changed but the fundamentals are still very much intact! The arrival of 2018 has a lot of meaning for me, but perhaps nothing more significant than my journey with Tandem and later NonStop can be traced all the way back to 1988 – yes, some thirty years ago. But I am getting a little ahead of myself and there is much to tell before that eventful year came around. And a lot was happening well before 1988. For nearly ten years I had really enjoyed working with Nixdorf Computers and before that, with The Computer Software Company (TCSC) out of Richmond Virginia. It was back in 1979 that I first heard about Nixdorf’s interests in acquiring TCSC which they eventually did and in so doing, thrust me headlong into a turbulent period where I was barely at home – flying to meetings after meetings in Europe and the US. All those years ago there was ...

Real Time View

Search This Blog

Disaster strikes! What’s your Plan B?

Labels

Comments

Popular posts from this blog

The folly that was Tandem Computers and the path that led me to NonStop ...

HPE NonStop: Is the wait over? Will the legend live on?

An era ends!