Wednesday 3 October 2007

The SQL Troubleshooting Checklist

read this handy article from SQL Server magazine and thought I'd share it with everyone... a good one to stick on the pinboard!

The Troubleshooting List

By Andy Warren, 2007/10/03
Total article views: 177 Views in the last 30 days: 162

Most DBA's grow into their jobs and part of that process is the baptism by fire that occurs when things go wrong, especially on the performance side of things. I started to make this performance centric, but so many of the steps and thoughts are common with other problems that I tried to make it a little more generic. In this article I've listed some tips and thoughts about what to do - and why - when the sudden performance dip strikes or other serious problem occurs. It's probably not a perfect list, but the key is to build a process that keeps you on track.

#1 - It's Usually Blocking
When I get a call that 'everyone is locked up', 'everyone is slow', etc, the first thing I look for is blocking. Various ways to check, but the tried and true technique is to execute 'select * from master..sysprocesses where blocked > 0'. Works on SQL 7 forward and takes about 15 seconds to type in and run, then you can decide what action to take. This usually occurs because someone is doing something bad like bulk loading records during the day. You can find and kill the blocking spid, but it's worth looking to see what the spid is doing first. If it's been running for three hours you might be better off to let it finish rather than wait for it to roll back.

#2 - It's Not Blocking, Is It Limited to One Application/Database/Physical Location?
I'm usually asking this question while I'm headed to a machine to work on step #1. If it's NONE of those things I can probably rule out recent application deployments, recent proc changes, network problems to a WAN site. If it is one physical location, I immediately get an email out to the network team to have them check all their hardware. Sometimes a router will blip, or maybe they just rebooted it and forgot to tell everyone. In practice it's usually not the network, but sometimes it is! If it's just one application and/or database, I'm immediately thinking 'what has changed'? If you know how to do wait state analysis this is a good technique to think about applying here but it takes time and you can often shortcut with the right questions.

#3 - If it's One Application, Can They Reproduce the Issue?
This is key. If they can show you it's always slow when they add a customer, look up an order, etc, you can quickly zoom in by profiling and filtering to a single host name. That will let you see exactly what is being executed and look for something with a long duration - long being subjective, but anything close to 30 seconds is a good candidate for any immediate look and anything on down to 5 seconds might need to be checked. Ideally here you can refer back to a baseline so you know what those procedures usually cost. Usually application problems can be traced back to either a change (app code or database), or SQL has generated a new plan for a query that isn't performing well, or both. In the worst case get a developer to step through the code against the production database because it's not always a SQL problem. Check whatever error logging system the application uses and the event log of the machine they are using it on.

#4 - It's One Application, But They Cannot Reproduce on Demand
These can be miserable to solve. I usually start with running separate profiler sessions on a couple users I can trust to call or email when the problem happens and to be able to tell me a lot about what they were doing at the time. I've seen some of these solved by the low tech technique of using a camcorder to capture everything that happened, more recently I've used Morae to let me go back in time to see what happened. Both of those are more often needed for code issues rather than SQL issues, but sometimes you're just looking for clues. If you can get the users to log when it happens you can try to match up in Profiler, the hard part is the query plans may be fine but they are being blocked or bogged down by something happening elsewhere on the server. So you have to have a deep focus on the user AND be looking at the server holistically.

#5 - Sometimes it's the data
This might happen because a lot of data has been added and the statistics don't match the current state of the data, or just because in some cases there is a lot more data than the developer expected. I dealt with a problem like this once where IT users had not flipped a bit in a table that allowed users to cache the results of an expensive query on the client, as a result they kept running that query over and over and over, killing performance. Flipped the bit and magically all was well!

#6 - Rarely Is it the Hardware
Perhaps most common in this category is a drive in a RAID set failing and no one noticing, resulting in a performance decrease because parity is being used to restore the missing data on the fly. I've also seen more than once incidents where 'the server is down' but someone doing work in the server room unplugged/knocked loose the wrong power/network cable. For the most part the hardware will either work or it won't. Doesn't hurt to have some check, but unless the server is actually unavailable this should be further down the list.

#7 - It's Usually Because Someone Changed Something!
I bet we all have had a few of these happen. Having a good change control process that is logged can really help when you're under stress. Something as minor as deploying a proc change where they added a column or order by can have a huge performance impact if the query is no longer covered. Sometimes the rules change after a hotfix (less often) or a service pack (more often) and while hopefully the changes are good, sometimes it isn't. Larger companies may be applying patches on a regular basis and after a while you start to ignore them, until something goes wrong. I'll also add a plug here for rebooting one extra time after applying a service pack because I want to be absolutely sure there are no pending changes waiting to be made live on the next reboot which might be months away - who would think of the last service pack as being an issue when it was applied three months ago?

#8 - It's Incredibly Important to Communicate
As soon as you know an incident is going to last more than few minutes someone needs to get the word to operations. Work may need to be redirected to other sites, hourly employees put on break or sent for an hour of required but not all that important training, or in worse cases sent home for the day. Even smaller companies may have hundreds of employees/customers affected and the cost per hour of down time can be significant. Tell operations what you know via your boss, let them decide. It's also important to immediately communicate problems to everyone in IT. A lot of times you'll get an immediate reply back from someone who was bulk loading data, rebooting a switch, etc, that will immediately explain the problem. You also will need people to help you chase down leads so that you can stay focused, and someone to make sure you get something to eat and drink while you work.

#9 - Know when to ask for Help
I think the more experienced you are the easier this is to do. If you're a brand new DBA you hate to get the credit card from your boss to call PSS, or ask for a consultant to come, because you don't want to look inept. Absolutely there is a risk to that, but on the other hand, anything beyond a few hours and the cost of either of those pales in comparison. Try setting the stage before the incident by having one call to PSS pre-approved and consider establishing a relationship with a SQL consultant just in case. And it doesn't have to just be PSS, lot's of good challenging problems get posted in the forums here and on MSDN every day and the people that answer them like a good challenge and the learning it brings. Lean on your peers too, whether within the company or someone you met at the local user group. With experience you get more comfortable with the idea that you know what you need to know, you know what you don't know, and that it is not a sign of weakness to ask for help. Sometimes easier to say than to do!

#10 - Learn Your Lesson
I bet easily 8 out of 10 times when I'm finally done resolving a performance problem that I can look back and see that I didn't ask the right question, didn't really listen to an answer, took off following the first lead rather than being methodical, or any of a bunch of other things I should have done - in hindsight. This isn't about beating yourself up or holding yourself to some incredibly high standard, but rather stepping past our pride enough to really learn a lesson.

#Extra Credit - The Challenge
It's been a while since one was posted, but Steve and I had a lot of fun writing about many of the near disasters we've been involved in, often of our making. So my challenge to you is post your own performance issue checklist, or tell us about the problem you should have solved sooner and why. Yeah, we may smile a bit when we read them, but I bet more of us smile because we've done the same thing, or because we know that's one trap we won't fall in to. As always I hope you've learned something or sparked an idea or two, and look forward to an interesting discussion!

No comments: