Friday, January 30, 2009

The Ever Elusive Service Level Agreement

SLA: the Service Level Agreement.

This is often a metric assigned by a management oversight committee to an internal organization or an agreement between a Vendor and a Client. It dictates one of potentially many key measurement objectives that help determine quality of service. For example: when you walk into McDonald's, they might employ a service level agreement that says that a cup of coffee will be serviced at a particular level of quality. Maybe it's something as simple as this: A cup of coffee will be served to a waiting customer in no less than 2 minutes from door entry to coffee-in-hand on average.

This hypothetical SLA would be tracked via some observable mechanism and a particular set of controls would intervene if service quality wasn't met. For example, more front counter staff added, colder coffee served, manager replacement or in the most dire of situations, some sort of automated coffee launching mechanism employed that would hurl the product across the counter to minimize that final "cup-pour-cap-hand-off" sequence. This final solution would potentially kick off a new series of lawsuits -- I'm guessing that's why I have yet to see it in operation. Still, you can bet that it's in the prototype stage in markets with extremely demanding customers.

Regardless of the SLA that's being tracked or that exists at the contract level, there's a potentially unspoken SLA in effect at almost all times when it comes to service and especially when it comes to infrastructure.

And there are complications surrounding real versus perceived SLA. McDonald's may have a 2 minute SLA that they're tracking -- but customers of McDonald's may think that the SLA is more like 1 minute. This is why its extremely important to have dialog with customers and carefully listen to this kind of expectation. Its why social media is a great marketing tool -- it's possible to have a conversation with a group of customers over important aspects of quality. Social media can give the company providing that service a direct insight into the good, the bad and the ugly experiences of their customers.

I've said it here before and and I'll say it again; Infrastructure is difficult in part because when it's working it is hidden out of sight. It sits under all of the things that everyone else views as bright and shiny. Complicating the hidden aspects is the fact that being good infrastructure is partly what makes it invisible.

This is another reason that explains why good infrastructure people are good communicators; Listening is more important at the infrastructure level than talking. Arrogance can net you blind spots or worse lead to difficult customer relationships. People with crappy communication skills make difficult managers in general; In the context of infrastructure, however, this relationship is made doubly crucial.

When it comes to providing infrastructure service, opportunities to interact with customers are rare. These crucial interactions are made all the more difficult by personalities that empty the emotional bank account in advance. A good relationship with your customer is extremely crucial because a lot of the basis for success or failure potentially rest upon SLA metrics that may not be evident or are incorrectly perceived by the customer. Worse, when the desired SLA metrics are discovered, they may be impossible or meeting them may not make business sense. Your customer often doesn't know what they need. They don't know why something works in a lot of cases. They may want 100% service but only 99.5% is possible without going out of business.

One thing is for certain; They will be more than unhappy to tell you when something is broken or an SLA is not being met.

For example, I woke up this morning snoring. This may sound like an event that is completely unrelated to the topic of SLAs, but humor me.

Last night I flew in on U.S. Airways and upon arrival, my luggage was missing. In my luggage: a CPAP machine that helps me deal with my own SLA -- it keeps me from snoring too much. If I snore too much I miss sleep and wake up. Too much missed sleep and I have a crappy day. My body and I have this service level agreement; keep the snoring to a minimum and it won't have a heart attack on me before age 50.

I had a feeling that something like this was going to happen when I turned loose of the handle of my suitcase at the departing airport. Why? Because my usual claim ticket looked like a hall pass for a 3rd grader. U.S. Airways computers were down and instead of the usual neat claim ticket with bar-code, my name and other things encoded on it -- instead I had something with carbon paper and scribbled numbers on it. I had a boarding pass with some numbers and letters jotted on it that I prayed were in the right locations. Maybe 9F was my seat? (it was near a box that said something to that effect) Or maybe it was the departing gate? Was the Gate 35A or was it terminal A and gate 35? Fortunately I had about an hour and a half to kill ahead of the flight so I was easily able to make it onto the plane.

Too bad my luggage wasn't so lucky. Riding to the plane on a shuttle bus (it was an outside boarding situation) I could see the crew loading bags into the belly of the plane. I didn't see my luggage there either (it's amazing how many times I've caught sight of my bag being loaded from the inside of a plane). I could tell that with all of the normal automation down my luggage was going to be a long shot. We sat on the tarmac for another 30 minutes or so delay because the pilot was unable to file a flight plan and had to do it "manually". Upon landing about half the suitcases for the flight were missing. I was fortunately home, so I had things to wear.

Other customers were not so happy.

U.S. Airways has violated an SLA -- the one it has with its customer in this context. Is it truly in violation of an SLA from a contractual basis? Possibly not. They may have carefully calculated statistics that account for a certain amount of delay and a particular percentage of lost bags every year. So far this year they may have only lost some small percentage of bags. In other words, their SLA might be fine because the current number of lost bags is potentially still below the number that they were expecting to lose. I don't know for sure if this is the case at the end of the day because I don't know what SLAs (if any) that they're maintaining internal or otherwise.

One common saying applies here though; while a lot of people may question a system that's working and wonder out loud if everything has been done the right way -- everyone can easily be in agreement over a system that's broken. U.S. Airways had failed to deliver service -- their computers were down and therefore my bag was not there at the end of a delayed flight.

And what about their SLA? Has it been met? Maybe they're fine at a statistical level. Possibly they've accounted for a certain amount of this, in other words, as business as usual. When it comes to their SLA, the moment a piece of luggage is lost at the end of a trip, you can bet that any customer, me included, is going to think otherwise.
--Paul Ferris

Tuesday, January 13, 2009

Launching All Things Infrastructure

The Internet is full of opinions on a lot of topics, mostly exciting.

And that's a problem -- infrastructure rarely conjures up exciting images in the collective minds of computing professionals. Infrastructure is that boring stuff that sits just below the exciting stuff -- below the applications, below the pretty graphics -- below the stuff that people consider creative.

Exciting infrastructure is actually an oxymoron -- you don't want your infrastructure to be exciting by definition. Oh, sure, fast CPUs are all the rage -- and some people get very excited about the speed of computation of, say a farm of computers operating as a Beowulf cluster, for example -- but that's not the excitement I'm talking about here. The exciting moments of infrastructure typically involve the unexpected. Unexpected loss. Unexpected down-time. Unexpected performance.

No, if you do a lot of infrastructure, you want it to be boring as paint drying. I've done my fair share of enterprise-class implementations of things. I've been around this equation for a really long time, actually. I've based a lot of my career on the best ways to deploy and manage solid infrastructure. Along the way, I've come to some conclusions that have motivated me to publish this blog.

Being boring, under the radar and behind the scenes makes for a challenge of its own. Management rarely understands what's going on (until things get exciting) -- and then its too late. There isn't a whole lot of blog or interesting print devoted to building and maintaining the stuff. Complicating matters, the shifting landscape of computing changes the infrastructure equation frequently(albeit a bit slower than the development side of the house). All of these vectors contribute to what I perceive as a vacancy when it comes to the interesting (to me) topic of enterprise infrastructure management.

I had an epiphany when I realized this vacuum existed. I realized that I had been one of the people happily computing along, building infrastructure and not really sharing techniques, solutions, observations or challenges -- and I have no excuse for not sharing.

Infrastructure? Boring?

I realized that I could talk about this a lot -- and that probably there would be an audience of people that would rarely find the topic boring.

All Things Infrastructure will be about the challenges of building and maintaining not just the infrastructure, but the needed human resources that must inevitably accompany the work. People are needed -- and as the enterprise grows more complex, so will the work and so will the resumes of the people being managed. The culture must grow in lock-step as well. Complication breeds more complication.

It would probably help a bit at this point to explain that I'm won't be talking much about small shops or even medium-sized companies. My experience is with medium to large implementations of J2EE application infrastructure. This isn't cheap stuff, by definition. It requires planning and forethought or things will be more expensive. Complicating matters, the metrics to prove this prior statement are embedded in choices that are not simple. People may make these choices based upon broken assumptions or guesses which are costly and unlike other choices, very hard to undo.

I'll close this post with a promise: What I relay here will be from the gut, based upon my experiences and observations. It won't be based upon anything sales-related or tainted by advertising whim. Infrastructure done right lasts a long time. The teams that manage that infrastructure must be stable units. Stability requires longevity and longevity requires truthfulness. There are no shortcuts when it comes to the truth.

--Paul Ferris January, 2008