Transferring Large Amounts of Data using Batch Inserts
-
Posted on May 31, 2009 by Derek Dieter
-
4
Below is a technique used to transfer a large amount of records from one table to another. This scales pretty well for a couple reasons. First, this will not fill up the entire log prior to committing the transaction. Rather, it will populate the table in chunks of 10,000 records. Second, it’s generally much quicker. You will have to play around with the batch size. Sometimes it’s more efficient at 10,000, sometimes 500,000, depending on the system.
If you do not need to insert into an existing table and just need a copy of the table, it is better to do a SELECT INTO. However for this example, we are inserting into an existing table.
Another trick you should do is to change the recovery model of the database to simple. This way, there will be much less logging in the transaction log.
The WITH (TABLOCK) below only works in SQL 2008.
[cc lang=”sql”]
DECLARE @BatchSize int = 10000
WHILE 1 = 1
BEGIN
INSERT INTO [dbo].[Destination] –WITH (TABLOCK) — Uncomment for 2008
(
FirstName
,LastName
,EmailAddress
,PhoneNumber
)
SELECT TOP(@BatchSize)
s.FirstName
,s.LastName
,s.EmailAddress
,s.PhoneNumber
FROM [dbo].[Source] s
WHERE NOT EXISTS (
SELECT 1
FROM dbo.Destination
WHERE PersonID = s.PersonID
)
IF @@ROWCOUNT < @BatchSize BREAK
END
[/cc]
With the above example, it is important to have at least a non clustered index on PersonID in both tables.
Another way to transfer records is to use multiple threads. Specifying a range of records as such:
[cc lang="sql"]
INSERT INTO [dbo].[Destination]
(
FirstName
,LastName
,EmailAddress
,PhoneNumber
)
SELECT TOP(@BatchSize)
s.FirstName
,s.LastName
,s.EmailAddress
,s.PhoneNumber
FROM [dbo].[Source] s
WHERE PersonID BETWEEN 1 and 5000
GO
INSERT INTO [dbo].[Destination]
(
FirstName
,LastName
,EmailAddress
,PhoneNumber
)
SELECT TOP(@BatchSize)
s.FirstName
,s.LastName
,s.EmailAddress
,s.PhoneNumber
FROM [dbo].[Source] s
WHERE PersonID BETWEEN 5001 and 10000
[/cc]
For super fast performance however, I'd recommend using SSIS. Especially in SQL Server 2008. We recently transferred 17 million records in 5 minutes with an SSIS package executed on the same server as the two databases it transferred between.
SQL Server 2008
SQL Server 2008 has made changes with regards to it’s logging mechanism when inserting records. Previously, to do an insert that was minimally logged, you would have to perform a SELECT.. INTO. Now, you can perform a minimally logged insert if you can lock the table you are inserting into. The example below shows an example of this. The exception to this rule is if you have a clustered index on the table AND the table is not empty. If the table is empty and you acquire a table lock and you have a clustered index, it will be minimally logged. However if you have data in the table, the insert will be logged. Now if you have a non clustered index on a heap and you acquire a table lock then only the non clustered index will be logged. It is always better to drop indexes prior to inserting records.
To determine the amount of logging you can use the following statement
[cc lang=”sql”]
SELECT * FROM ::fn_dblog(null, null)
[/cc]
- Comments (RSS)
- Trackback
- Permalink
I used your code to transfer about 500 million rows of data but I had null values in some of the fields on the new table that isn’t on the source table. Is there away to avoid this ? Thanks
Thank you for the tip with BREAK on @@ROWCOUNT! Up to this date I was using only WHILE condition to catch the exit point. It’s safer, but not so expressive like usage of BREAK.
And what do you think about this approach?
DECLARE @BatchSize INT = 10000;
WHILE @BatchSize > 0 BEGIN
(…)
IF @@ROWCOUNT < @BatchSize BREAK;
END
When you test @BatchSize against zero, you have the same logical result for an infinite loop condition, BUT you get a bonus with the test for nonzero "constant" for size. "1 = 1" has no other purpose then "always say yes".
My test is a potential source of failure, if @BatchSize is not a constant, like in this examples. Then the trivial condition ("1 = 1") is the safer approach.
Guys, here is another quick way to copy huge data from one table to another in small chunks.
declare @rowCount int=1
while @rowCount != 0
begin
insert into destinationTable
select top (100) * from sourceTable st
where not exists(select 1 from destinationTable dt where st.Key=dt.key)
set @rowCount=@@rowcount
end
SSIS is great but not in case of selecting millions of data prior to inserting.. I like the first example but I have a question though concerning the line:
IF @@ROWCOUNT < @BatchSize BREAK
This will end up with some records bellow 10000 not inserted to destination table.. Am I right?
If so can I replace this line by the following:
IF @@ROWCOUNT = 0 BREAK
Actually, I think using SSIS for large dataset migrations makes the most sense. It’s good about breaking down millions of rows into manageable batch sizes, which can be tuned accordingly. It also makes cross-server migrations much easier.
Even right-clicking on the database and selecting Import/Export data (Which invokes an SSIS component) makes it even simpler. (Then you can save the package for later re-use.
I thought the same thing at first, but think of the loop like this:
@BatchSize=10,000
Dataset to load = 15,000
First Iteration Insert 10,000
@@RowCount=10000
Second Iteration Insert 5,000
@@RowCount=5000 –which means there is nothing left to load
@@RowCount<10,000 so it breaks