SQL Server Join Algorithms
-
Posted on February 20, 2011 by Derek Dieter
-
4
If you read execution plans enough, you’ve probably realized that when SQL Server joins tables together, it uses different internal algorithms. The three algorithms are:
- Loop Join
- Merge Join
- Hash Join
These alogorithms that are used are based upon factors of the underlying data.
Merge Join
For the most part, this is the most efficient method of joining tables. As the name implies, both tables are essentially merged together, much like a zipper being zipped. It typically occurs when both tables that are being joined, are joined on keys that are presorted and contain all the same keys in both tables (for example joining a primary key with a foreign key). When one table contains keys that the other table does not have, the chance of the merge join being used is less likely.
The physical profile of a merge join is very little CPU usage, and very little reads compared to other types of joins.
Loop Join
The loop join is more CPU intensive than a merge join. This join typically occurs when one worktable is quite a bit smaller than the other. As the word loop implies, the smaller table being joined is looped until it finds the matching key in the outer (larger) table. This join is most efficient when the resulting output is smaller than 5000 rows. When larger than that, the CPU and reads make the join less efficient.
Hash Join
A hash join is the least efficient of all joins, however that does not mean it is not necessary. When large tables are being joined and not all the common keys being joined exist in both sides and/or they are not sorted, this join type may be chosen. As the word Hash implies, a hash is created against all rows for the common key in each table, essentially breaking the table into approximately 7 buckets. After the table has been broken into these separate buckets, the buckets themselves are joined together resulting in a smaller comparison of a result set.
Within the Hash Join, there are 3 separate types of hash joins (In memory, Grace, and recursive). These hash joins are segragated according to how much memory is required to create them. Essentially the hash joins follow the same methodology as lock escalation. When a hash join is taking too much memory, then the next highest type of hash join is escalated to and pushes the buckets to disk.
Using Query hints for Join types
Increasingly, I do not recommend using query hints for SQL Server. With SQL Server 2000 I was more apt to use query hints. However, as the optimizer gets smarter, and the more you upgrade, you never know what the query optimizer is going to choose for the query or why it is choosing it. That being said, yes you can specify query hints for join types. To do so, see the following queries:
[cc lang=”sql”]
SELECT *
FROM Sales.SalesOrderDetail sod
INNER MERGE JOIN Sales.SalesOrderHeader soh
ON sod.SalesOrderID = soh.SalesOrderID
[/cc]
[cc lang=”sql”]
SELECT *
FROM Sales.SalesOrderDetail sod
INNER LOOP JOIN Sales.SalesOrderHeader soh
ON sod.SalesOrderID = soh.SalesOrderID
[/cc]
[cc lang=”sql”]
SELECT *
FROM Sales.SalesOrderDetail sod
INNER HASH JOIN Sales.SalesOrderHeader soh
ON sod.SalesOrderID = soh.SalesOrderID
[/cc]
- Comments (RSS)
- Trackback
- Permalink
Hi,
Good article – short and precise
one point though: for loop join, isn’t that the small table is the one that will be used for outer loop and the bigger table the inner table?
Your means of explaining the whole thing in this piece of writing is actually good, all be capable of easily be aware of it, Thanks a lot.
I loved your article, very concise!. But let me add something, and please correct me if I am wrong. MERGE JOIN requires at least one equijoin predicate and sorted inputs from both sides, this means that the merge join should be supported by indexes on both tables involved in the join operation. This is a very important limitation. Cheers
Hey Martin,
That is the way I understand it. I think we said the same thing in different ways. For the record, I never force MERGE join hints because they can easily error out if those requirements are not met. If however you can setup the two inputs with a likelihood they will utilize a merge join, that makes for a good scenario.
Hi Derek – This post is nothing short of amazing. I’ve been trying to get a query return in less than 5 seconds which normally executes for 3 mins 45 seconds. Having tried all options I could possibly think of, what the ‘inner merge join’ query hint you explained above did pull the trick for me. The query now executes in exactly one second. I’m flabbergasted. Thanks so much.
Great T C,
Glad that helped. These join types in my experience have the greatest influence on the overall execution plan generated. I’ve found lately that the queries that benefit from hinting are typically also join a lot of different tables and are complicated to the optimizer. I’m trying more to create short queries that may dump results to table vars, then join the table vars.