Choosing the Best Clustered Index

Posted on June 6, 2009 by Derek Dieter
2
5

One of the most common database performance issue most companies face is the improper choice of indexes or the lack of indexes. Most essential, is the clustered index. A well-chosen clustered index will save CPU, Memory, and IO. Choosing the clustered index for a table goes hand-in-hand with the writing of the queries for a table. It is best to design the queries and the index simultaneously to keep consistency. The majority of the time this is not an option. When we have not written the queries, the best way to understand the optimal clustered index is to study the queries that are run against the table.

Once you understand the queries that run against a table, it is helpful to think of each of these queries broken down into three parts:

The select list – Be on the lookout for the queries that return the most columns from the table.
The join columns – The most common columns used to join other tables against the table.
The where clause – The most common columns used to filter the result sets.

Avoiding Fragmentation

Before going into depth in choosing a clustered index we need to know what causes fragmentation.

We ideally want to answer no to the following:

Will the clustered value(s) be updated?
Can new rows be inserted between already inserted clustered rows?

The reason for answering no above is because we do not want fragmentation at the clustered level. The clustered index is like the foundation of a house. If it is built on a hill, it can easily collapse. Unless you have a great maintenance plan or a big maintenance window to rebuild. Size of the table also comes into play. If the table is smaller (100K rows) it doesn’t matter as much because you can potentially fix the fragmentation relatively quickly. I would say ideally you want to stay about 60% fragmentation by the end of the day. Of course this could be completely wrong given specific scenarios.

Since all nonclustered indexes point to the clustered index, if new values get inserted in between existing clustered keys, these rows may be pushed to new pages, causing all the nonclustered indexes to update their pointer values to the new page instead. This can cause a fragmentation mess. This is why many times you will find the clustered key as an identity column. While this is the safest bet, it is also not always the most optimal choice. Especially if the table is not going to be large, and/or if you have a nice maintenance window. In the end you really need to know how the table is going to be queried the most.

Fill Factor

Another alternative to defining a clustered index that can allow rows to be inserted between existing clustered keys is to define a custom fill factor. The default fill factor for a table is set to 0, which is the same as 100%. What this means is that each data page is filled to capacity prior to filling the next data page. The negative aspect of this is that if a row needs to be inserted on an already filled data page, then other rows need to be moved off to accommodate these new rows. By defining the fill factor as say, 70% we are in essence saying, let’s leave 30% of the page empty to accommodate for new rows being inserted. This can buy you time in order to run a more comprehensive maintenance plan.

What to Choose

With the given information we can now say the primary keys and the columns used in the where clause are typically appropriate candidates for the clustered index. However the question still arises as to which columns to use. The short answer is to use the columns that get queried more frequently, and if both are queried an equal number of times, use the query that returns the largest select list. The majority of times this will yield the best results but sometimes you will not know this information until the system is live. Keep in mind fragmentation however. Do not sacrifice the integrity of your table. This is really a catch 22. But there are other options to choosing the clustered index you really want. You can always create covering indexes using the new INCLUDE method for indexes.

Examples

A couple example tables I can already tell you what the keys should be. In working with Employees or Customers, there is typically always an EmployeeID or CustomerID. This is always the appropriate index for an OLTP system. They may not be ideal for a reporting system however. Also, Orders will have an OrderID which is the appropriate clustered index. Then there is the OrderDetail table that contains the order items; while this may contain an OrderDetailID as a primary key (identity) my personal choice for the clustered index is the OrderID – since this is how it is always referenced.

Here is a more comprehensive tutorial on indexes. If you are not clear on indexes, I recommend studying the topic.

CLUSTERED INDEX DMV

Luckily, dynamic management views (DMVs) were introduced in the new versions of SQL Server that store meta data about the queries themselves.

One of the most useful DMVs are the sys.indexes and sys.dm_db_index_usage_stats. Using these two DMVs alone can help optimize a SQL Server enough to take it from poor performance to exceptional performance. Using these views, we can differentiate what is likely the biggest mistake made in indexing. Improper clustered indexes.

The clustered index is the actual data of the table, sorted according to the columns that are selected to be the clustered index. The columns that are selected to be the clustered index, should generally be the most commonly queried fields in the table. Once the rows used in a WHERE clause or a JOIN are appropriately isolated in the query, the actual data that is inline to the row typically needs to be gathered for the SELECT list. If the rows are isolated using a non-clustered index, and the non-clustered index does not contain the appropriate columns for the select list, a lookup to the clustered index occurs. This lookup itself takes a fair amount of resources. It is much better to be able to isolate the rows using the clustered index because once the rows are isolated, the data is already inline because the clustered index is the data.

The following query helps determine improper clustered indexes by comparing the number of seeks the clustered index receives with the number of seeks the most used nonclustered index receives. It then compares the number of seeks the most used nonclustered index receives with the number of lookups the clustered index receives. If this particular nonclustered index is responsible for more that 50% of the lookups to the clustered index, it is a very likely candidate to be the more appropriate clustered index.

[cc lang=”sql”]
DECLARE @NonClusteredSeekPct float
DECLARE @ClusteredLookupFromNCPct float

— Define percentage of usage the non clustered should
— receive over the clustered index
SET @NonClusteredSeekPct = 1.50 — 150%

— Define the percentage of all lookups on the clustered index
— should be executed by this non clustered index
SET @ClusteredLookupFromNCPct = .75 — 75%

SELECT
TableName = object_name(idx.object_id)
,NonUsefulClusteredIndex = idx.NAME
,ShouldBeClustered = nc.NonClusteredName
,Clustered_User_Seeks = c.user_seeks
,NonClustered_User_Seeks = nc.user_seeks
,Clustered_User_Lookups = c.user_lookups
,DatabaseName = db_name(c.database_id)
FROM sys.indexes idx
LEFT JOIN sys.dm_db_index_usage_stats c
ON idx.object_id = c.object_id
AND idx.index_id = c.index_id
–AND c.database_id = @DBID
JOIN (
SELECT
idx.object_id
,nonclusteredname = idx.NAME
,ius.user_seeks
FROM sys.indexes idx
JOIN sys.dm_db_index_usage_stats ius
ON idx.object_id = ius.object_id
AND idx.index_id = ius.index_id
WHERE idx.type_desc = ‘nonclustered’
AND ius.user_seeks =
(
SELECT MAX(user_seeks)
FROM sys.dm_db_index_usage_stats
WHERE object_id = ius.object_id
AND type_desc = ‘nonclustered’
)
GROUP BY
idx.object_id
,idx.NAME
,ius.user_seeks
) nc
ON nc.object_id = idx.object_id
WHERE
idx.type_desc IN (‘clustered’,’heap’)
— non clustered user seeks outweigh clustered by 150%
AND nc.user_seeks > (c.user_seeks * @NonClusteredSeekPct)
— nc index usage is primary cause of clustered lookups 80%
AND nc.user_seeks >= (c.user_lookups * @ClusteredLookupFromNCPct)
ORDER BY nc.user_seeks DESC
[/cc]