Friday, October 2, 2015

How many DBAs does it take to change a light-bulb should it not be NonStop?

Taking a look at street signs may not be the best way to tell in which direction you are headed but on the other hand, NonStop roadmaps contain strong signage as to what is coming next …

Standing at the intersection of Confidence Drive and 100 Year Party Court, I knew I was back in Boulder County. It’s nearby our home and a place we routinely visit for early morning muffins and bagels.  Apart from having a good dose of confidence it would be presumptive for most of us to suggest that they could keep any party kicking along for a hundred years and yet, after some forty plus years in the marketplace, NonStop is doing just fine and might even make that anniversary. Irrespective of that possibility it did get me thinking this week about what all is involved in ensuring something just keeps on going – just like the famous Energizer Batteries in the television commercials.

No matter the month or even the year, picking up on this theme, it seems as though I can never escape routine maintenance. Whether it’s a car needing new tires, the outside BBQ needing a thorough clean (as fall approaches) or changing light bulbs inside the house! I just finished pulling out the old bulbs and replacing with newer, more eco-friendly variety, and once I started I was surprised at just how many light bulbs had failed over the summer. Then again, I was reminded that maintenance is never a one-time shot as no one has ever suggested that any apparatus only ever needs to be looked at just once.

Since time immemorial, data centers have been subject to regular maintenance. At first, it was simply a case of cleaning the card reader and removing paper dust from the printers. Part of any service agreement for a major system included scheduled down time for maintenance for which enterprises paid a small fortune to ensure was performed regularly and to a plan. But can we safely ignore some tasks today and save the money? Surely, just rip out the server that’s failed and replace it with a new one – what’s the point of spending time trying to fix a computer that is little more than a board? Has the price of industry-standard components dropped to the point where we really do enjoy the luxury of having disposable systems?

Just recently I attended the MATUG user group meeting in Herndon, Virginia. This was held on the HP property that is located on EDS Drive, close by Washington D.C.s Dulles airport. Even though our navigation system couldn’t help but call the thoroughfare Ed’s Drive, nevertheless, we were reminded how times change and how street names once thought impervious to change harken back to former glory days. Nearby where we live, here in Boulder, CO, are the former premises of the once mighty Storage Technology and while the buildings were all demolished a while ago, off the arterial highway that now takes you to a shopping mall there are sign posts for both Disk Drive and Tape Drive that lead nowhere at all.

HP Product Manager, Mark Pollans, did an excellent job of reviewing the NonStop product roadmap and while he had the audience literally “Oohing” and “Ahhing” on a regular basis, one item did catch my attention. I know I have heard reference made to it in the past, but for some reason this time it had me thinking. When it came to supporting Solid State Disks (SSDs), that are essentially extensions to the thumb drive technology we all have come to depend upon of late, it wasn’t a straightforward task for the engineers at HP.

The problem is that they just wear out. Just when you least expect it, they're not going to let you write anything more. Who knew; makes me take a second glance at all the thumb drives I have stashed in a side draw of my desk that each carries a different PowerPoint presentation. The wear out monitor is a feature of the drive and it is externalized via Open System Management (OSM); it is OSM that will alert you as an SSD is getting close to wearing out – giving customers ample time to replace the drive.

But here’s the thing, as I understood it from Mark, when it comes to the HP NonStop systems using SSDs, there’s now new capabilities incorporated into the drive that provide feedback on just how long they can be used so that monitoring software can graph the potential failure time so enterprises will not be caught out by surprise.

Vendors working in the application monitoring space are also aware of this property of SSDs on NonStop and assure me that they have this base well and truly covered. All sounds rather simple when you think about it – letting us know when you can no longer write data to an SSD - but no, seems that bringing this to our attention (as it is about to happen) was a requirement of the HP NonStop team.  Ooh! And yes, Ahh!

And this cuts to the very core of why we have faith in the NonStop engineering team. Not for them is an easy path, but rather, tackling every problem from the perspective of the user and not just individual items in isolation, but how they impact the total operation of a NonStop system. I am often told of just how good the hardware has become and I am being questioned about the continuing relevance of NonStop.

To many folks, it’s once again a case of thinking that good enough is well, yes, good enough. But it isn’t and it’s proven time and time again in the real world. Outages hurt and there’s no ducking the issue and yes, planned outages hurt every bit as much as unplanned outages – I still become highly agitated when my online banking application tells me that it will be down for maintenance Sunday between the hours of 4:00pm and midnight. What the heck is that all about … But now, for users of NonStop systems with SSDs it’s safe to run even the most accessed NonStop SQL (NS SQL) tables on the latest in SSD offerings from HP.

When the NonStop developers first started discussing the need to provide an SQL database on NonStop one of the most important properties covered was how to keep SQL up and running even during times of maintenance? As I am so often reminded, the very nature of SQL and the relational database manager supporting it, database administrators (DBAs) need to run certain utilities that check out just how fragmented the database has become and then, after gathering statistics, and then perform routine maintenance. All the while, the database is offline as with all other popular SQL implementations, you have to take down the database and have some other option for handling queries that may continue arriving at the application.

Several years ago I wrote a research note on NS SQL for HP (that is no longer available on the HP web site, but can be provided upon request), and the fact that NS SQL was a part of the “integrated HW, SW and OS stack” simplified NS SQL in ways other implementation simply couldn’t emulate. In that research note I made the observation of how, from the server’s hardware and disk storage subsystems to the operating system itself, on up through the platforms low-level access methods and audit, logging and recovery features, at every turn the DBA faces compromises and trade-offs when it comes to tuning an SQL database.

Whether it’s simple maintenance or more complex modeling to cater for growth; trouble-shooting because of user input errors and unexpected resource locks; monitoring performance, running statistics, and updating query plans, there’s no let-up in the demand it places on DBAs. Perhaps central to what drives much of the activity of the DBA is the underlying problem that the SQL database instance is but one of many technology “layers” the DBA needs to be aware of. Even with the tools on offer today, there’s still much that simply relies on the judgment calls of skilled DBAs.

“I think ease-of-management is a valid argument,” said Sami Akbay, formerly VP of Marketing, GoldenGate Software, and now Cofounder and EVP of Striim (Nee WebAction). “Having fewer systems instead of ‘fragmented’ infrastructure is something that favors the NonStop SQL offerings!” Just as importantly and highly valued by DBAs supporting NS SQL/MX is the ability to run mixed workloads as a byproduct of this tight integration without, for instance, competing resource management schemes. “We update statistics and query plans on a monthly basis, for most objects and we do it on the fly!” Rob Lesan, formerly of AOL and now part of the vendor community, confirmed all of the above before adding “maintenance? Truly, we run reorgs, statistics, splits, column adds, etc. all without taking anything down. It’s the NonStop fundamentals!”

Of all the attributes of NS SQL that I know the NonStop community value most of all is that there’s no need to break for routine maintenance – it all can be done on the fly while the database is being accessed by NS SQL applications anywhere in the network. Ooh! And yes, again, Ahh! Try that out with Oracle or even SQL Server without resorting to complicated cluster options together with background data replication in place, all glued together with complex scripts demanding a whole lot of operator attention. Gee whiz, hope nothing breaks right now! Of course, the answer to the question of just how many DBAs do you need when not running on NonStop becomes a sore point for enterprises. 

SSDs that degrade with warnings and SQL that doesn’t have to be shut down all help reduce the maintenance load expected of NonStop systems and this is proving to be a major consideration going forward. If you want to enjoy that 100 year party you need to look very seriously at all that NonStop offers and yes, have the confidence to promote internally! No, when it really matters most, NS SQL, and the integrated stack it is part of, remains unmatched in terms of underlying technology than any competitor’s offering and for this, the community can sit back and exhale – ooh! Ahh!


Anonymous said...

Speaking of street names:

Tandem Blvd in Austin

Bill Honaker

Chris Ager said...

Good article Richard !
Worth noting that "predicative failure analysis" of Nonstop disks has been a feature for HDD's too for a while - although in my experience, this doesn't always work in all circumstances - HDD's sometimes just fail without any predictive failure events.
In relation to SQL/MP and updating statistics - this sometimes does involve an outage, depending on whether the affected SQL programs have been compiled with the norecompile option - obviously this is down to the DB and application design.
Regards, Chris