Cranky Series: Plan for when things go Bump in the Night


August 21, 2012 by Mike Hillwig

If I’ve learned anything in my years of working as a technology professional, it’s that telephones work around the clock. If systems run at all hours, it means they can go down at all hours, too. And because we do a lot of our IO intensive work off-hours, it means problems can occur off-hours.

When I give this presentation, I love to tell the story of how my Blackberry got beaten up at my former employer. People do stupid things at all hours, meaning we have to fix them at all hours.

When I first started with my current employer, I was the only SQL Server DBA. We have a small army of Oracle DBAs, but they didn’t know much about SQL Server. And shortly after I started with the company, I put in some basic alerts to tell us when things weren’t going according to plan.

That’s when my phone started to ring in the middle of the night. And you know what? It was my own darn fault. I was pushing the panic button without telling people why the panic button was being pushed. Worse yet, they were getting alerts about things that weren’t necessarily critical for immediate resolution. But my alert didn’t tell them it could wait for next business day resolution.

Let me give a classic example: Log shipping. We have two data centers, one on the east coast and one in western Pennsylvania. Sometimes, we can run into network latency in the middle of the night, especially when our backup devices are replicating. We’re sending terabytes of data almost a thousand miles away, and in both directions. That means the occasional log shipping copy job will fail. My phone would ring in the middle of the night because the network burped. That got old really fast.

Now, when my log shipping jobs fail, they don’t fire off the failure alert. Instead, I have a job that scans msdb.dbo.sysjobhistory, looking for failed log shipping jobs in the last fifteen minutes. The alert clearly indicates that if it’s the first time the alert is seen, it’s not cause for alarm. If the alert persists, it says that the first thing to do is confirm network connectivity between data centers. There is a very likely possibility that other servers will be complaining as well. If network can be dismissed as the cause of the problem, THEN someone needs to get me out of bed.

Another example here is data file management. I have a job that looks for datafiles with the ability to autogrow. If they reach a certain size threshold, the alert fires off saying that we need to take action. It also states that the file will continue to grow until someone takes action, and that this can wait for next business day resolution.

Next business day resolution. In other words, don’t get my butt out of bed.