Friday, January 13, 2012

The forgotten attribute …

In prior posts I have covered the key attribute of “availability” and within numerous other posts I have written about “scalability” but, when it comes to the key attributes that contribute to success of the HP NonStop Server platform, it’s time to address “data integrity”!

With winter firmly entrenched I have been returning to our garage routinely, and what a sad site it is – battery trickle-feed chargers scattered around the floor keeping cold batteries alive. Having chalked up a lot of miles in 2010 it seems quite strange to see vehicles left this way – brooding almost, seemingly ignored and forgotten, as conditions ill suit rear-wheel roadsters.

Should you look at my social blog, Buckle-Up-Travel,the June 27th, 2010, post “…finally succumbing to heat” you will read of an incident at the Willow Springs track that sidelined our car based on something that for many is simply ignored. Power-steering fluid – when was it that you last heard someone talking passionately about something as inane! And it was NonStop Enterprise Development (NED) engineering Director, Mike Plum, who reminded me that “power steering fluid takes a beating on a track like Willow Springs with the long sweeping turns. The fluid is under extreme pressure and once it boils the observed failures can be: fluid expulsion, blown cap, blown reservoir, blown hose or pump lock up.”

The picture at the top of the page? It was taken at Willow Springs but earlier in the day as I returned to the paddock and it would be the following session when everything went horribly wrong. That day remains the low point of that year and a reminder that sometimes it can be the little things that, left unmonitored, wreak havoc at the worst possible times!

It was a little deeper within that same post where perhaps the most blatant of observations was made by my good friend, Brian Kenny, who pointed out to me that “power steering fluid may indeed be the ‘forgotten fluid’ (and that) the extra grip the Toyos provided overwhelmed the standard offering!” Indeed, I had completely forgotten to check with anyone about the likely negative impact on the power steering when running stickier track tires.

While there continues to be considerable coverage in forums and blogs today when it comes to scalability, particularly the almost linear scalability that comes with deploying the HP NonStop Server platform (and the many business benefits derived from this key NonStop characteristic in terms of being able to scale up, and down), there are other very important attributes as well, none more written about perhaps than availability and of how the NonStop Server platform remains available despite failing components. In attaining this all important attribute the chosen architecture addressed scalability as well, and in a highly intelligent manner. To read more about my observations on the importance of availability, check the post of October 31st, 2011 “What price availability?

However, the attribute that I have come to appreciate of late, even as I have begun referring to it as “the forgotten attribute,” is data integrity. A quick check of groups on LinkedIn revealed little about the topic, and while there will be some within the NonStop community who will take exception to my observation, all the same, it seems of late that while we all accept it as belonging in the mix of key attributes of the NonStop Server – availability, scalability and data integrity – it’s more from a historical perspective than anything else. It’s always been associated with the NonStop Server I have to admit, but beyond that? Very little of the floodlights that are directed at availability and scalability fall on data integrity, the forgotten attribute.

And yet, in recent exchanges with NED product management it came up a couple of times, and in each instant was tightly coupled with availability in a manner I had not previously given enough consideration – after all, availability needed little by way of supporting attributes, I reasoned. “Data Integrity is indeed related to availability. After all, what would tend to be the ultimate outage (short of a fire or natural disaster)? Answer: A data integrity problem that took hours or days to recover a database to its proper state,” was how product management’s software boss, Tim Keefauver, explained the relationship.

In the same exchange Keefauver then added “It is with this thought in mind that data integrity is such a high priority for NonStop and always has been. For example, if due to a data corruption a numeric becomes non-numeric then some programs will issue a fatal error and end. This can result in an outage of the application even though no bad data got to disk or to the eyes of end-users.” And with this explanation, Keefauver had my undivided attention.

From the moment data arrives on the NonStop Server measures are taken at every step to ensure there’s no loss of data integrity. Opportunities to corrupt data as it is manipulated, stored and subsequently retrieved have been examined and addressed through a combination of what today’s modern Intel chipsets provide as well as the implementation of more accurate CheckSum algorithms.

When it comes to the role played by CheckSum algorithms it was Bill Highleyman, well-known commentator and author / editor of the Availability Digest, who observed how “in order for data corruption to occur, it presumably would have to be in the path through ServerNet, the CLIMs, and the disk units themselves. That is what a good CheckSum (algorithm) protects against.”

This morning, the waitress at our local diner asked me if I could wait a spell as they just shut down the computer! Yes, it was another reminder of how we have all experienced at one time, or another, the frustration that comes with data that is just not right – for years, “computer error,” was synonymous with a modern workplace. You wanted computers, right? Well, you just have to live with the occasional computer error and perhaps you need to go so far as to retain manual processes as a back-up.

But those days are long gone, and increasingly as traffic between computers escalates the obligation to provide accurate data is paramount – corrupted data not only impacts our own computers, of course, but can lead to cascading failures that can carry well beyond our own business pursuits in a way that carries with it a lot of potentially unwanted headlines.

Keefauver’s observation was later confirmed with NED product management Director, Randy Meyer, who explained that “Tim’s exactly right – data integrity issues are often what cause the hours/days of downtime that we read about in the papers. A fault occurs, creating a corrupted database. Then it takes hours or more to recover that database.” A computer system may indeed be available but if the confidence in the accuracy of the data is lost, then it’s no more available than a system that has crashed. Perhaps worse – actions may have already been initiated that then require considerable negotiation to back out.

It may not always be on the minds of users, but the data integrity provisions of the NonStop Server and their contribution to much greater levels of availability certainly are something we shouldn’t be oblivious to – there’s no reason at all to overlook the work that is done in this area and the engineering time spent in ensuring there’s never any loss in data integrity.

As for our end-users, never having to deal with bad data is of course its own reward and perhaps that is the most important aspect that comes from supporting this key NonStop Server attribute. Far from being the forgotten attribute, all of us should breathe a collective sigh of relief over just how seriously NED takes data integrity.

No comments: