Wednesday, April 30, 2008

Keepers of the inner sanctum!

I am back from Vienna but it’s hard to put thoughts of this city too far away. I first went to Vienna when I was with Nixdorf Computers, back in the early ‘80s, so I have been visiting Vienna for 25 years! The picture I have included here is just a quick snapshot looking down a lane as I watched construction workers restore an old four-story building and what caught my attention was how the scaffolding was being erected.

Stationed one above the other and separated by about eight feet, the scaffold riggers were man-handling the scaffold tubing up a human chain as they extended its height up the side of the building. From the sidewalk, they were lifting these tubes and simply passing them up to the next level of the scaffold – there were five or six riggers in the chain, one per level - before the tubes were clamped to the framework in support of the next level. Balanced precariously, the riggers showed no fear of heights as they confidently pushed their framework higher. I am certain there were architects and building site managers present, but it was the riggers getting the real work done.

Whenever I am in a city I take time to look around me and to appreciate the many unique styles that different architects have developed. Whether I am in Chicago, Sydney, or elsewhere around the world, I am always observing the pride their citizens take in their city skylines and the pleasure they get from showing me the sights. Walking through data centers these days generates much the same feeling. There’s not a data center operations manager I have encountered who isn’t absolutely thrilled to walk you through their domain.

The nerve-centers are often located somewhere else and, more often than not, the team is overseeing the operations of a couple of sites. At the bigger installations, these may be split functionally with multiple nerve-centers each supporting multiple data centers. With layers of console screens and, just as frequently today, large flat-screen displays showing everything from network and power grids to weather maps and even CNN news feeds, these facilities are abuzz with activity as information transits the enterprise.

But it is the role of operators, watching over the processing of information critical to the business, where near-chaos appears to be routine, where I see the real face of IT. For those corporations that have a heterogeneous systems environment (and it’s hard not to, these days), the complexity is absolutely mind-boggling! Transactions arriving on many different networks are being routed to the appropriate applications platform, with data being pulled from many different operational data bases, and with logs being updated, at a furious pace. While not the same as passing metal tubing hand-over-hand from one rigger to another you sense a similar protocol all the same!

Architects responsible for the design, and technicians sorting out infrastructure, move from one opportunity to the next. They enjoy the moment, but rarely stay around for the complete lifecycle of the project. Operations staff not only have to live day-in and day-out with the consequences of the project, but continually adjust to accommodate the arrival of other projects that often come with conflicting or incomplete operating instructions.

I recently had the opportunity to talk with a senior operations executive at a financial services company, and came away highly impressed with all that he had on his plate. I was particularly interested in the real world feedback on the true cost of operating today’s HP NonStop server. I had decided to dig into this as I continue to be told how expensive the NonStop platform remains and I have become puzzled over exactly where these costs originate. “While there is no single priority, but rather, a whole slew of priorities, for operations managers, cost savings is firmly at the top of his list. Automating as much as possible, in order to meet ever-more-aggressive availability metrics within the Service Level Agreements (SLAs), continues to be a priority for management.”

Of all the systems in any data center, the IBM mainframe still requires the most support. Steps have been taken over the years, through technologies such as Workload Management (WLM), and advanced scripting tools, but turning a reliable platform like the System z into a highly available platform remains elusive and even enabling parallel sysplex (the mainframe’s cluster solution) is no panacea. While fail-over support can be added, via scripting, for selected applications, it’s not a take-over technology, and requires far more awareness of the environment being programmed directly into the application. The fall-out can often be far more pressure on the operations teams as they try to avoid making mistakes in times of crisis!

And it’s not just the large corporations that wrestle with SLAs and availability issues, as software companies face many of the same issues. Talking with Marc Paley, Director of Global IT Operations at GoldenGate, uptime has become paramount. Running one of everything, from large System z and HP NonStop servers, to Sun and Windows servers, as well as every flavor of Linux, “availability becomes even more demanding than in commercial shops! We literally have one of every data base product, and often a number of different versions needing support. Achieving uptime involves deploying redundant servers, and the use of load balancing, in order to achieve the high availability our company requires.”

By comparison, the elegance of the internal design of NonStop and its newfound fondness for openness, makes it so much easier to operate. A long time ago, when visiting a retailer in Texas, I was curious how many folks were assigned to operate the NonStop server. And the response I was given was that actually, there was no dedicated operations staff. Instead, I was told, “we are listening to the console printer and if it gets really loud and we can see it spewing paper, then someone will go over and take a look. Otherwise, it just runs!” Now that was a few years back and times have changed – but for many NonStop users, retelling this incident generates very little contradiction.

As for the other items on the list – recruiting and retaining quality staff, selecting and deploying tools and utilities, including Business Process Management (BPM) products and scripting languages, like Perl, and ensuring complete security of the environment – they are only slightly less important than reducing costs and improving uptime. As in the commercial world, Marc noted, operators at GoldenGate develop scripts, and this has now become a must-have requirement of every operator. Marc then went on to add “as for recruiting operators, we really do need then to be familiar with the operating systems and data bases we support and we don’t hire them unless there’s competency already.” But for me, what came across from these conversations was the challenges that arise in making sure senior management is fully onboard with the value proposition that good data center operations provides.

There’s not a data center manager, however, who isn’t looking over their shoulders to see if an outsourcing company is coming through the doors! The recent decision of ACI to outsource its internal IT operations is just the latest example of a company going down this path. There’s no question whatsoever that, as data centers increasingly move from manual to automated processes (maintaining the same SLA for the same applications), the incentive is to reduce the number of people involved. And getting someone else to do it at a much lower price certainly becomes seductive – at least, on paper, and before the second round of negotiation begins.

I have covered other management functions in previous blog postings – the CIO, the CTO, Software Architects, and the gifted artisans spread throughout development – and have commented on the value that comes from well-functioning teams that have strong communication skills that keep them all well-connected to the business. But in the end, all of these areas are under tremendous pressure to reduce costs, while improving uptime, and I see a marketplace growing more aware of the true advantages of the HP NonStop platform.

With so much capability built in to the NonStop server, and very few resources needed to manage them, I remain puzzled over the observations about how expensive the platform is. Certainly, Jim Johnson, Chairman of The Standish Group has always maintained that operating NonStop requires fewer operations resources than pretty much any other platform. In our most recent email exchange, Jim remarked “that was the major cost advantage that NS had for years, and right now, (compared with the) IBM mainframe, it remains a 3 to 1 advantage!”

Will we ever see completely automated data centers? Will data centers evolve to where human oversight is no longer required and where operators are no longer required? Probably not in my lifetime! Users will continue to see more tools on their desktop allowing them to assemble and tailor the solutions they need, but out there somewhere, there will be smart people engaged in looking after it all. For the NonStop user, there may not be as many as needed on other platforms, but knowing there’s other sets of eyes watching over the information flow, stationed “securely” deep within the inner sanctum, brings a whole lot of comfort to many of us. And they don’t have to be afraid of heights!

Mark Whitfield said...

Fascinating reading. It will be an interesting space to watch in the next 12-24 months. No sooner back from Vienna, I feel like EBUG 2008 was a cliff-hanger to an even more interesting follow-up episode in 2009. The best cliff-hanger since Han Solo was frozen in carbonite in the 'Empire Strikes Back'. What we need now is a 'Return of the HP NonStop'...