Using Union Instead of OR
-
Posted on October 7, 2012 by Derek Dieter
-
5
Sometimes slow queries can be rectified by changing the query around a bit. One of these examples can be illustrated when multiple values are compared within a WHERE clause using an OR or IN statement. Often times, OR can cause a scan against an index or table which may not be the preferable execution plan in terms of IO consumption, or overall query speed.
A lot of variables come into play when the query optimizer creates an execution plan. These variables include a multitude of hardware specs, instance settings, database settings, statistics (table, index, auto-generated), and also the way the query is written. The one we are changing here is the way the query is written. As unsuspecting as it may seem, even though two different queries can return the exact same results, the path at which they take can be entirely different just based upon the format of the query.
UNION vs OR
In most of my experience with SQL Server, the OR is generally less efficient than a UNION. What tends to happen with an OR is it causes a scan more often. Now this sometimes may actually be a better path to take in some cases and I’ll leave that to a separate article but overall I’ve found that touching more records is the biggest cause of slowness. So let’s start our comparison.
Here is our OR statement:
[cc lang=”sql”]
SELECT SalesOrderID, *
FROM sales.SalesOrderDetail
WHERE ProductID = 750 OR ProductID = 953
[/cc]
From this execution plan we see that we are performing a scan against 121,000 rows. (You can’t see the row count, but it is).
Now let’s perform the same query, but written using a UNION instead of OR:
[cc lang=”sql”]
SELECT [SalesOrderID], *
FROM sales.SalesOrderDetail
WHERE ProductID = 750
UNION
SELECT [SalesOrderID], *
FROM sales.SalesOrderDetail
WHERE ProductID = 953
[/cc]
Here we see two branches of operations. One branch is touching 358 rows, while the other branch is 346 rows. Both branches meet to perform the concatenation operation bringing both result sets together. We have two separate seeks, but we also have a key lookup in order to retrieve the SELECT list we need. This wasn’t necessary for the scan operation because we were touching all rows in the scan operation anyway so the data was retrieved at the time of the scan, rather than afterwards. This has to do with the index and the rows we need rather than UNION vs OR. I will say however that the select list is also a factor in the choice of a seek vs scan but we’ll ignore that in this post.
Explanation
Why UNION causes more seeks instead of scans is because each operation needs to meet a certain selectivity requirement in order to qualify for a seek. (Selectivity is the uniqueness of the particular column being filtered). OR’s happen in a single operation so when the selectivity for each column is combined and it goes over a certain percentage then a scan is deemed more efficient.
Since a UNION by default performs a separate operation for each statement, the selectivity of each column is not combined giving it a greater chance of performing a seek. Now since the UNION performs two operations, they need to match their result sets using a concatenation operation above. Generally this is not an expensive operation.
It should also be noted that the OR clause operates just like an IN statement.
Hopefully this tip helps out. I find it to be pretty valuable when working with systems that require high concurrency.
- Comments (RSS)
- Trackback
- Permalink
Just had this issue and splitting it into a union took the number of reads down from 15,000 to 9. Thanks!
I saw a frequently-used query in an EHR go from 25 seconds in duration with tons of locking consistently down to 1 second by converting a huge JOIN statement (from the vendor’s stored procedure) consisting of 20 ORs and 20 numbered foreign keys (yeah) into 20 queries separated apart and re-joined with UNION ALL. This was on a multiple-million row table on a single older Xeon box around 2007 and was the sort of scale problem that seemed to jump out of nowhere after 4 years’ worth of data accumulated, but I’ve never forgotten the improvement it made.
Is it more efficient to run an operation of 346 executions on 346 rows + 358 executions on 358 rows or to run 1 execution on 121,000 rows? The UNION is the former while the OR is the later.