Clustered Index
-
Posted on June 6, 2010 by Derek Dieter
-
2
The clustered index on a table can be defined as: the sort order for how the data for the table is actually stored. Being that the clustered index is the actual data itself, you cannot have two clustered indexes. You can however have many non clustered indexes. These non clustered indexes are a subset of the columns for the table that point back to the clustered index itself.
If a clustered index is not defined, then the table is considered to be a heap. Or just a scattered array of rows with no particular sort order. The only time a heap is considered useful, is when insert speed is a factor. When new rows are inserted into a heap, they do not need to get stored in any particular order and thus do not need to be squeezed into any particular “page” of an index. Whereas, the clustered index insert needs to be placed (or squeezed) into a specific location in order to maintain the underlying sort order.
This process of being squeezed occurs on what is called a page of the index. Multiple pages comprise the entire index, and there is also a setting defining how “full” pages should be kept. This fullness percentage is called “fill factor” and a lower percentage accommodates inserts into the proper sort order without having to push rows to the next page in order to fit the incoming rows. So if insert speed is not a factor, or inserts are not happening often on the table, then a clustered index is always useful.
The choice of the clustered index should pivot around how the table is going to be queried. The clustered index should accommodate the majority of queries where possible. This is because querying the clustered index directly is akin to finding the page of a book directly without having to use the index. When using the index in the back of the book, there is always a cost because it takes time to find the proper page using the index, and then you have to find the page the index references. Compare this to already knowing the page number directly and you can understand the difference.
Clustered indexes can be defined using multiple columns. Meaning, more than one column can define the sort order for the underlying storage of data. An example would be when defining the clustered index by LastName, then FirstName. This would help the speed of a query when looking for a specific LastName, then FirstName. Please note, that the column order in which the clustered is defined is important. Creating a clustered index by setting FirstName first, then LastName second, will not accommodate a query that just wants to find a record by LastName.
Another consideration when creating a clustered index is that you want the first column of the clustered index to be the more unique column. Notice I said more unique and not necessarily unique. Why? Because generally clustered indexes should not be based on selecting a unique column first. In the majority of cases a clustered index based on a primary key is useless, unless that table is referenced via it’s primary key the majority of time. This creates confusion as to why when creating a primary key, the default operation is to utilize it as the clustered index. While this is probably the safest bet (in order to assist newcomers) in my experience it is rarely the best choice.
- Comments (RSS)
- Trackback
- Permalink