Wednesday, September 12, 2007

Glitches, the norm?

Still in Sydney – but the headlines of the past few weeks have been bothering me somewhat. Have you seen all of them – it looks like computer glitches are hitting us hard! I have included a picture of Sydney in case you aren't sure what it looks like!

In my August 27th blog posting – Is 30 mins too long? – I remarked that “I have little patience for any retailer of financial institution that skips on their infrastructure”. But now I am seeing whole sectors of the community being affected. I have to start wondering – are we becoming desensitized to all of this?

What caught my attention was the headlines here last week – well, actually, a small article in one of the financial papers – “Glitch shuts out Westpac online customers”! It turned out that about 30 percent of the bank’s 400,000 internet banking customer could not access the (online banking) service. The paper I was reading went on to add that according to Westpac, “it appears to be related to an internal systems error which we’re still trying to isolate” and then added a comment that the bank wasn’t sure whether this was related to a recent website revamp at the bank.

Now, in isolation, this would have just been something I read and had a brief chuckle about. But unfortunately, I had only moments early read on my blackberry about Barclay’s having a big problem in the UK that forced them to borrow 1.6 Billion Pounds. According to “Barclay’s blames technical glitch for 1.6 Bn Pound emergency loan”! A problem with the link between its electronic settlement system and the CREST settlement house on Wednesday broke down … for an hour!

Going back to my August 27th blog posting, you may recall that I mentioned, in passing, that Wells Fargo had suffered a serious outage on the West Coast that not only affected ATMs but major portion of the branch banking business, as well. I just went back and googled the Wells Fargo outage and the first link I was directed to was something called and the heading simply stated “Wells Fargo ATM, other glitches last longer than first reported”. The report also put the timing in perspective as well, when it added “Well’s computer glitch came at a poor time for nervous banking customers, considering the recent turmoil in the mortgage and stock markets.”

I began to look at this after I met with a former colleague of mine, Dieter Monch. Dieter was the Australian Managing Director of Nixdorf Computer when I worked for Nixdorf, back in the early ‘80s. Dieter is an investor, and now manages, the company that sells red-light and speeding cameras around Australia. He recently attended a state government presentation that asked potential vendors to look into providing a camera network that wouldn’t fail – borrowing words from NASA, failure was not an option. Dieter simply, and I have to believe, politely – how much are you prepared to pay?

Now, I am not all that sympathetic to the loss of a speeding camera – and the revenue opportunity missed. I don’t think many of us are – and don’t look positively on this form of revenue generation. But looking at it from a different perspective – if these were cameras tracking vital security operations and went down at the time a key illegal or terrorist activity was being executed – then I can see a time in the future when even these types of networks just have to remain operational at all costs.

So, glitches and their implied outages, as well as the implications of lost revenue, are beginning to show up across all industries and markets. So we are taking the issue pretty seriously, and we seem to understand the problem. But with the news coverage I have seen over the past couple of days – I am not sure how seriously we are taking the fall-out from today’s glitches. Surely, the loss of credibility in a marketplace of 400,000 as was the case in Australia, or millions I would have to believe in the US – as well as the real cost in terms of interest on the short-term borrowing of 1.6 Bn Pounds is pretty serious. Again, have we become desensitized to the issue of computer glitches? Has the term become an easy way out – a catch-all phrase to cover up any infrastructure stuff-up we may make?

Do we aggressively promote the value of applications and data bases that survive single (and now, multiple) points of failure? Do we explain how all this works and the value we can provide? Or, do we simply leave it to others – the comms guys? the web server guys? to explain why an element of the infrastructure failed?

Do we still believe that some subset of these applications are so fundamentally important to us that we view them as "mission critical applications", and ar we prioritizing and routing these "mission critical transactions" to a platform that is orders of magnitude more reliable than the other servers we may have deployed?

While we, as users of NonStop, have come a long way in removing many sources of outages – how strong a voice do we have in other areas of infrastructure? And are we still strongly advocating NonStop in support of mission critical applications, or have we elected to just to sit back and watch as less reliable platforms siphon-off these transactions? In other words, have glitches become the norm and have we reached a time where it’s OK to simply explain away a service interruption to the dreaded glitch?

1 comment:

Al Hoss said...

Great post! Actually, I DO think the financial sector is starting to take it more seriously than ever, but I think it shows just how complex the problem space is.

It goes beyond hardening even the subsystems that you mentioned (processors, database, web/application servers, etc.). That's necessary but not sufficient (as RT Writer would say). The organization has to 'plan for failure' in establishing operational processes/procedures as well. That means taking the time to draft the procedures, but it also means doing 'just enough' testing of those to validate them.

Of course, the costs of that are hard to quantify. The value proposition of the enterprise class systems (NonStop one of the leading contenders there) is that at least you can remove the hardware/OS/database from your list of worries!