Selecting Top N Records per Group
-
Posted on September 5, 2012 by Derek Dieter
-
4
When you have multiple rows for let’s say a productID, there will probably be times when you need to select the top n subset of each those productid’s based upon a sort criteria. This is a relatively common TSQL need and there are a couple different methods to do this.
First let’s create some dummy data.
[cc lang=”sql”]
CREATE TABLE [#productzabub](
[SalesOrderID] [int] NOT NULL,
[ProductID] [int] NOT NULL
)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71782, 707)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71783, 707)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71784, 707)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71782, 708)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71783, 708)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71784, 708)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71782, 711)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71783, 711)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71784, 711)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71782, 712)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71783, 712)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71784, 712)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71782, 714)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71783, 714)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71784, 714)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71782, 715)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71783, 715)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71784, 715)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71783, 716)
INSERT [#productzabub] ([SalesOrderID], [ProductID]) VALUES (71784, 716)
[/cc]
Now that we have our data created, let’s select the top 2 rows for each productid based upon the earliest salesorderid. This salesorderid could actually represent any sequential column in the table, be it a date, or a primary key.
The first method we’ll use is to select using the new window function ROW_NUMBER() introduced in SQL 2005. What we’re doing here is simply creating a numbering system that enumerates each instance of the productid ordered by the salesorderid. We then do another select on this query.
[cc lang=”sql”]
SELECT TOP 20
t.SalesOrderID,
t.ProductID
FROM (
SELECT
sod.SalesOrderID,
sod.ProductID,
rowid = ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY SalesOrderID)
FROM #productzabub sod
) t
WHERE rowid <= 2
[/cc]
The next method we'll use is compatible with all versions of SQL. Here we are doing a self-join using a correlated subquery. Meaning we are correlating the productid from the outer query, with the productid from the inner query. We then select the top two salesorderid's from each instance, constraining our outer result set to the top 2 rows for product.
[cc lang="sql"]
SELECT
sod.SalesOrderID,
sod.ProductID
FROM #productzabub sod
WHERE SalesOrderID IN
(
SELECT TOP 2 SalesOrderID
FROM #productzabub
WHERE ProductID = sod.ProductID
ORDER BY SalesOrderID
)
[/cc]
That's pretty much it. Personally I prefer using the window functions when I can, they typically have a speed advantage over other methods because they don't join anything, just sort by doing an order by internally. You may see other methods using a CTE. This is not actually another method, it’s simply the same as the windowed function execution-wise.
- Comments (RSS)
- Trackback
- Permalink
I just took a looping query’s execution time from 40 seconds to 1 second with this info. THANK YOU! Awesome stuff. I learned something new today.