Using sysjobhistory to Find What’s Running Longer than Usual

3

September 8, 2009 by Mike Hillwig

We all have SQL Agent jobs that run periodically. Wouldn’t it be nice to know that something ran longer than average last night? When something runs longer than average, this could indicate a looming problem. I love finding these things before they cause serious problems because that prevents my phone from ringing in the middle of the night.

In larger environments, monitoring software can do this for you. I bet this could be done in System Center Operations Manager. In the meantime, though, I’m doing this the old-fashioned way.

The MSDB database has a vault of information that I’m just starting to explore. I’m sure somebody else has figured this out somewhere else. I figured this one out on my own and thought it was worth sharing.

The sysjobhistory table shows the execution of all of your SQL Agent jobs, broken down by step. If the step_id is 0, that means it’s the job as a whole. I filter those out. By getting an average of the run_duration, we can see how long things usually take.

select job_id, step_id, avg(run_duration) average
into #recentaverages
from sysjobhistory
where run_date > @historydate
and step_id > 0
group by job_id, step_id
order by job_id, step_id

There is a @historydate variable that I use to hold a date of 30 days ago.

Job history average execution times

Screenshot: Job history average execution times

Next I took that temp table and joined it back to sysjobhistory, filtering out anything that runs in less than ten seconds. I also use @analysisdate for everything since yesterday.

select sjh.job_id, sjh.step_id, sjh.run_date, sjh.run_time, sjh.run_duration, ra.average
from sysjobhistory sjh
inner join #recentaverages ra
on sjh.job_id = ra.job_id
and sjh.step_id = ra.step_id
where run_date >= @analysisdate
and sjh.run_duration > ra.average
and sjh.run_duration > 10

Shows Job executions running above average execution times

Screenshot: Job executions running above average execution times

Already, I can see one job that has several executions that are running longer than usual. This appears to be a job that runs hourly and six runs in the past 24 hours ran longer than usual. This could be a problem and warrants investigation.

Finally, I join this data to be sysjobs and sysjobsteps to find out who the culprits are.

select sj.name, sjs.step_id, sjs.step_name, run_date, run_time, sjh.run_duration, ra.average
from sysjobhistory sjh
inner join #recentaverages ra
on sjh.job_id = ra.job_id
and sjh.step_id = ra.step_id
inner join sysjobsteps sjs
on sjh.job_id = sjs.job_id
and sjh.step_id = sjs.step_id
inner join sysjobs sj
on sj.job_id = sjs.job_id
where run_date >= @analysisdate
and sjh.run_duration > ra.average
and sjh.run_duration > 10
order by sj.name, sjs.step_id

Shows the job and step names for jobs that are taking longer than average

Screenshot: Shows the job and step names for jobs that are taking longer than average

I now know my jobs and steps that may need to be investigated. Unfortunately, I had to block out some proprietary info, but you certainly get the idea of what’s going on here.

This could certainly be tweaked to look at the job as a whole instead of steps. Do that simply by looking only at step_id = 0 in the first query. Another option is to add a threshold that looks at anything like 10% above average.

Drop this into a SQL job that e-mails you when this script finds data, and you have a pretty powerful tool in your arsenal. You can find my entire script on my site. In the meantime, I have a longer-than-average running job to investigate.