Thursday, April 17, 2008

Relying on Routines!

After five days of driving it’s actually quite good to be back at my desk. Over the weekend I drove close to 1,000 miles up to the Sonoma wine country and back, only to turn around and drive across to Scottsdale, Arizona for the DUST user group meeting. While in Sonoma I spent some time at the Infineon race track at Sears Point and captured a shot of a SMART car, the picture I have included here, mixing it up with Miatas, Mustangs, and Corvettes out on the race track.

I have always enjoyed driving long distances. And today, it is far more relaxing than flying the short hops – with security lines and cancelled flights. A few years back I was in Singapore standing alongside a carousel waiting for my bags and, as the system quite stopped unexpectedly, a passenger next to me just sighed, and said “don’t you just love the romance of travel!”

Well the romance of travel has long gone. Today it’s more about fatigue, frustration, minimal connect times, and bad food for $5 (even in first class, I’m told by the Delta faithful)! Having any opportunity to hit the open road these days, I am finding particularly enjoyable. But after running up the miles, I drove to the local dealer for an oil change – a routine I religiously maintain – and had the service manager check out the car. “You have to take a look at this,” he said to me as the car was returned. Pointing at the front tires, he added “this is pretty dangerous; the inside of both tires are badly worn and the cord is showing – you better replace these tires if you want to avoid a disaster.” So on went a set of Pirelli pZero Rosso tires, the front wheels re-aligned, and what a difference!

As well as attending RUG meetings, I continue to stay in touch with the SIGs. Last year I was the Business Continuity (BC) SIG leader but I am pleased to say Mike Heath has now stepped in to lead this group. I have known Mike for years - indeed, I can recall sitting in a bar in La Defense to the west of Paris with Mike, after one of the Tandem road-shows of the early ‘90s, listening to CD “Waking up the Neighbours”, by Bryan Adams and released the year before as I recall, and we both enjoyed our cognacs

This past week we held a virtual BC SIG meeting on the subject of “Active-Active”. On this call were folks from GoldenGate, Gravic, and Network Technologies (NTI). When it came to NTI’s 15 minute presentation, I was very much amused when I heard Jim McFadden explain he was going to talk about “disaster-recovery avoidance” before going on to add “this is the first time I have laid claim to being in the ‘guaranteed disaster business’ - even when (our customers) purchase and implement (our software), they will fail when they don't implement the solution across the operation.”

Many years ago, I had been sitting in a restaurant on Stevens Creek, Cupertino with Roger Matthews. Roger and I were winding down from a week of meetings and we saw the chalk board suggesting we try the “giant shrimp”. Of course, this kicked off a lively discussion about oxymorons and how many of them had made it into every day usage. We talked about “military intelligence”, “common sense’, “Dallas culture” as well as my all-time favorite “user friendly” particularly when used in the same sentence as “customer service”! While I still have problems with “manufactured customs”, seeing a “SMART race-car” on the weekend stopped me right in my tracks. But it looks to me like we now need to add “disaster-recovery avoidance” to the list!

Disasters are becoming common place – almost routine. Whether it’s a natural disaster, just the local contractor tearing through a conduit with their backhoe, or just as often (as it now seems) a terrorist attack, disasters will happen and we need to be able to recover. At the DUST meeting this week, one participant pointed out how a major financial institution was moving their second site out of California and setting it up close to Phoenix. Having decided that they were a lot better off with two computer centers, they had built them both in California, either side of major fault lines, and according to the experts, this didn’t exactly look like the optimal deployment.

One of the tracks on the Bryan Adams album that Mike and I listened to in Paris was “Vanishing”, and it opens with the lines “People all over build on solid ground; they build it up and then they tear it down. Take it or leave it; who cares how much it costs. They'll never know how much is gone until it's lost!” Something about “solid-ground (in) California” strikes me as another obvious oxymoron we should have included in the list!

However, meeting the requirement for business continuity by simply building two computer centers, is just a starting point. As Jim highlighted, unless it’s implemented at all levels – applications, networking, data bases, etc. - it will not function as expected should a situation arise where the second computer center is needed. The whole philosophy of accommodating failures and making any such outage as transparent to the user as possible, takes a lot of work and attention to detail.

Active – Active is not an oxymoron, but it’s not tautology either! Distributing computer centers across multiple sites just makes good business sense. But the former practice where one center was designated as the emergency back-up site, and left idle for most of the time, is not a cost-effective option. And simply having it powered-up receiving a steady stream of data base updates, in an “Active – Passive” fashion may not always survive scrutiny as there’s simply way too much compute power being wasted. Today, it makes more sense to have all available computing power available to the business! Active – Active configurations, where both centers are equally engaged in supporting mission critical transactions with ample provision for taking up the slack, should its partner center fail for any reason, is what businesses demand.

At a recent gathering of HP sales folks in Prague, Scott Healy now with GoldenGate but most recently, with Sabre, was asked how Sabre would respond if their main computer system took an outage. “It depends,” started Scott, as he explained about all the steps that would have to be taken in order to successfully switch from one active system to another, adding “the key point is that we need to be as confident in executing a takeover as a switchover.”

Before any system as complex as that deployed at Sabre could support a take-over by a second system (programmed to be looking for failures), management had to walk-through many scenarios and have scripts developed to automate as much as possible, and then routinely test the scripts even if this meant throwing a switch and creating an outage to ensure all parties knew the ropes, and the procedures be followed, as they would be the same as those to be followed during any real outage. “The only way you do (testing, and real outages) is to have the procedures the same for both cases.”

However, even with this knowledge, Scott said this was not always the case as sometimes weeks, and perhaps months, went by between tests and he suggested that if pushed, he would actually take an outage of 10 – 15 minutes rather than cutting over to a system that may not have all the right code in place, or the latest table implementations running, as senior management would take much longer than 10 -15 minutes to make their way down to his office! He then added “the script should always be (available) and updated. If I could have executed a prepared script in test, and if successful, have confidence executing it in production, then yes, I would have done that.”

Active – Active implementation requires addressing many areas – from the libraries where the executables reside, to the network, and to the data bases and files. Keeping the data bases fully in synch is a big part of the equation, particularly when it comes to deploying mission critical applications, but so is the development of scripts and the training on operational procedures. And letting network traffic switch between computer centers has to become routine – with no surprises in store for anyone. In the end, as Jim so rightly pointed out, the aim is for complete disaster-recovery avoidance where tapping the resources of other systems in times of necessity is a built-in and automated procedure, and where the users are oblivious to any transition.

As Bryan Adams went on to write “think I hear thunder, ain't no sign of rain; danger signs flashin' in my brain! Ridin' on empty - lights are turnin' red …” what are our practices when it comes to business continuity? What will happen when lights start flashing red and when we see the danger signs? I am sure none of us likes to hear someone else telling us to “change the tires” if we really want to avoid a disaster.

No comments:

It’s time for three more wishes for NonStop!

Three years have come around rather quickly this time but it’s still worth thinking further ahead when it comes to our wishes for NonStop ...