In the troubleshooting guide we went over the different physical bottlenecks that can cause SQL Server slowness and showed how to identify them. In this guide, we’ll drill down one step deeper to help identify the actual settings or queries that are contributing to the bottleneck. By relieving these bottlenecks we will start to see our batch requests / sec increase. While it may seem you will be able to go through this article once, what is more likely is that you will need to continue to iterate through some of these principles multiple times. This is because many times when one bottleneck is relieved, another is revealed.
One thing we will not cover much in this article is architecture. Unfortunately this is a much more complex subject. Suffice it to say, either you have it or you don’t. I can’t so much give architecture principles in an optimization guide, but it is probably the single most important factor in determining throughput in a system. Without a good foundation you can only build a house so large. With that said, there are still luckily plenty of optimization techniques we can employ.
The biggest way to be a hero in your organization is to create the missing indexes. This task is simple ever since the introduction of DMV’s. When an index is missing it contributes directly to IO usage, and the higher the IO typically means higher CPU. Each time the SQL Server service is restarted the system tables begin getting populated and missing index information is collected. Needless to say, the longer the SQL Server service has been running, the more accurate this information will be. Once an index is determined as missing, the number of times it could have been used is counted. Fortunately we are able to query this data to help determine what which new indexes will have the most positive impact.
This query will show the missing indexes sorted by most impactful first. Implementing the top 10 typically has an enormous impact on a system that is not very optimized.
High IO Queries
After taking care of the indexes, we can begin to focus on the high IO queries. There are two different categories high IO queries fall into.
- High Average IO
- High Cumulative IO
The difference between these two is that a high average IO query will (hopefully) be run infrequently but when it does run, it uses an enormous amount of resources. This is typical behavior we see in reports or job related processes.
The High Cumulative IO procedure will typically be a procedure that is executed a lot. While the average amount of IO per execution is low when it is added up throughout the day we find it can also consume an enormous amount of resources.
The same DMV is used to determine both these types of High IO queries. The only difference is that we are sorting the results by average IO vs cumulative IO. The first two queries in the above link show how to find these. Once these are found you can begin to figure out what is causing the biggest issue using the execution plan. Also reference How to Write Optimized Stored Procedures.
High CPU Queries
If your biggest bottleneck is CPU, I actually consider this to be the best bottleneck you can have. You will know if your CPU is bottlnecked because you are able to hit (95%+) usage. If this is the case, then you need to tackle the queries that take higher CPU rather than higher IO first. First of course, make sure this is being caused by SQLServer.exe and not some other process. With that out of the way, we can run yet another DMV. This DMV is basically the same one we ran for the IO, however we are sorting by worker time instead. The same “average vs cumulative” case comes into play with the CPU queries just like the IO queries. We need to focus on both the intensive “job” type queries AND the “operational” queries that utilize a lot of CPU cumulatively. This DMV will show you both.
Most Executed Queries
One of the least recognized optimization techniques focuses on the most executed queries. Often times in systems you will find that many procedures will depend on one particular query or procedure. If this single query can be sped up, then the entire system would see a big increase in throughput. A gross example of this was one system I worked on that had a calculated column defined as a user defined function. While architecturally this is a big mistake, after optimizing this function this system that previously hit max batch requests/sec of 400, now hit over 3500. Suffice it to say my yearly salary could have been justified with that one fix. To find the most executed queries run the following DMV:
Keep an Eye on Your Server
While it’s possible management may think you’re wasting time, it is very important to keep an eye on your server to become familiar with the processes that are running. If you do not do this you won’t be able to recognize new & slow processes. By watching the server you can discover blocking, adhoc queries run by users, increases in connection count, queries run at a certain time of day. There are so many positive discoveries that can come about by watching your server I can’t begin to mention them all here. There are two tools I can recommend to manually watch your server.
One of the least understood and improperly configured settings in SQL Server has to be parallelism. My take on parallelism for a customer facing OLTP system, is that any user initiated process should not qualify for parallelism. Long story short is the risk is not worth the reward. Parallelism is basically a multi-threaded call for a single query. If the cost threshold is too low, then you will get everything under the sun qualifying for parallelism.
You can read more about it here: What is parallelism.
Configuring it here: Configuring parallelism.
This will help you find queries qualifying for parallelism.
This can help you control a single query.
Blocking can also be a hindrance to throughput. Blocking occurs because the default isolation level for SQL Server is read-committed. This basically means that while a record is being updated it is locked so that it cannot be read until the update is complete. At least that is one example. Most of the time blocking is a symptom of missing or non optimal indexes. With improper indexes, scans occur. When scans occur too many records are touched and locked. To help get around blocking you need to add indexes or change the default isolation level for your procedure. To discover the blocker you can run finding the lead blocker. Run this to find the queries that are most blocked.