Detecting Row Level Changes Using HASHBYTES

A common situation in a data warehouse is the requirement to detect changes in data in order to track what rows need to be imported. The traditional method of comparing the values of each field is performance intensive. Luckily there are other methods to quickly track the changes that involve creating hashes (or a fingerprint) of a particular data row. In using this method, if we want to synchronize two separate tables, we can simply join on the primary key and only compare this fingerprint column in order to determine what has changed. There are two major methods I’ve used to create a row valued hash key. The first is by using the CHECKSUM function. The other is to use the HASHBYTES function. Both of these function return back a single value representing a hash, however their parameters differ. With CHECKSUM you can pass in a list of columns to evaluate and it returns an integer value. Whereas HASHBYTES requires a single parameter to be passed in and returns back a 16-bit binary value. The trick to forcing HASHBYTES into accepting multiple column values is to use the FOR XML function which will generate a single value to pass in. The obvious difference between the two functions is the size and datatype of the hash being returned. To make a long story short, there are rare occasions (that I have witnessed more than once) where passing in different column values into CHECKSUM will return back the exact same value. Granted […]

Continue reading ...

How to Create a Type2 SCD (Slowly Changing Dimension)

This article could just as well be called creating a historical snapshot table. This type of table is also referenced as a dimension depending on what kind of data repository it’s located in. Personally, I prefer to keep a historical snapshot table in a normalized data store that contains history. This normalized data store is typically the first stopping point from the source system. It is useful because it keeps historical snapshots of what the data looked like in the source system at any point in time. To get started, let’s create a history table we will use to store the historical values. From above, we see that we have 4 additional columns: Person_HistoryID – this is a surrogate key specific to our new table. ChkSum – contains a CHECKSUM of all the columns used compare data discrepencies. DateTime_From – the beginning date in which this record is effective. DateTime_To – the ending date in which this record is no longer effective. First, let’s create our sample source table and populate it with some data [cc lang=”sql”] CREATE TABLE Person( PersonID int IDENTITY(1,1) NOT NULL, Title nvarchar(8) NULL, FirstName nvarchar(50) NOT NULL, MiddleName nvarchar(50) NULL, LastName nvarchar(50) NOT NULL, EmailAddress nvarchar(50) NULL, Phone nvarchar(25) NULL, ModifiedDate datetime NOT NULL ) SET IDENTITY_INSERT [dbo].[Person] ON INSERT [dbo].[Person] ([PersonID], [Title], [FirstName], [MiddleName], [LastName], [EmailAddress], [Phone], [ModifiedDate]) VALUES (1, N’Mr.’, N’Gustavo’, NULL, N’Achong’, N’[email protected]’, N’398-555-0132′, CAST(0x000096560110E30E AS DateTime)) INSERT [dbo].[Person] ([PersonID], [Title], [FirstName], [MiddleName], [LastName], [EmailAddress], [Phone], [ModifiedDate]) VALUES (2, N’Ms.’, N’Catherine’, N’R.’, […]

Continue reading ...

Extracting Data from a Source System to History Tables

This is a topic I haven’t found much information written about, however nearly every system I’ve worked with needs this exact functionality. It is important that the method for extracting data be done in a way that does not hinder performance of the source system.  In this example, the goal is to extract data from a source system, into another database (or server) all while requiring as little resources as possible.  This is why I choose to pull from a source system in two separate stages. First Stage – Staging Import The first step is to do a very simple select statement into a staging table. This first select statement may do some ETL — mostly in regards to lookups that are needed from the source system. There could be multiple select statements pulling data into multiple staging tables. I prefer to pull tables from the source to staging in a one to one relationship. So for every table we need, we also have a corresponding staging table. See the diagram below: The reason for pulling one to one is simple.  First of all, the query is a very simple select.  Second, it makes troubleshooting very simple.  After importing into staging, the next step is to move the records to the history table(s). Second Stage – Historical Import In the historical import, we compare what we have in our history table with what is in staging.  Each record in staging is joined with the corresponding current record in the history […]

Continue reading ...

SQL Server 2008 Minimally Logged Inserts

SQL Server 2008 has now introduced minimally logged inserts into tables that already contain data and a clustered index. What happens is the initial inserts may be fully logged if the data pages they are filling already contain data. However any new data pages added to the table will be minimally logged if all the requirements below are met. Trace flag 610 must be on Database recovery model must be bulk-logged or Simple Inserted data must be ordered by the clustered index To turn on the trace flag for your current session: [cc lang=”sql”] DBCC TRACEON (610) INSERT INTO dbo.MyTable SELECT * FROM ORDER BY 1 DBCC TRACEOFF (610) [/cc] This new change differs dramatically from the previous requirements for minimal logging. Previously there could be no clustered index and a table lock had to be acquired on the target table. For more information, visit: Minimal Logging Changes – MSDN Blog

Continue reading ...

Quick Table Transfers (Imports) using SSIS, Bulk Insert or BCP

Ever wonder why sometimes data transfer can be lightning fast while other times you’re watching sp_who2 wondering when it’s going to finish? It’s likely you’re noticing the difference between minimal logging and full logging. Even in a simple recovery model for a database you can experience row inserts to both the transaction log and the data pages. The easiest way to take advantage of minimal logging is to set the database recovery model to simple, drop all indexes in the target table then use SSIS, DTS, or BULK INSERT to transfer the data in. The speed of inserting data in SQL Server is wholly dependent on how many writes occur to the transaction log. These writes occur in two different modes, Minimal logging and Full logging. Minimal logging directly to the data page then writes only a pointer to the datapage in the transaction log, while Full logging writes the content of all the rows to the transaction log prior to inserting them into the data page. Needless to say, in order to take advantage of quick inserts, you will want to employ minimal logging. There are however a few prerequisites. The database recovery model of the target table must be either Simple or Bulk Logged If the target table contains a clustered index, it cannot contain data A table lock must be aquired on the target table The table cannot be part of a replication scheme If the table contains a non clustered index, the index itself will be […]

Continue reading ...

Featured Articles

 Site Author

  • Thanks for visiting!