Detecting Row Level Changes Using HASHBYTES

Posted on November 17, 2012 by Derek Dieter
12
1

« How to Order Numeric Values in a Varchar Field

A common situation in a data warehouse is the requirement to detect changes in data in order to track what rows need to be imported. The traditional method of comparing the values of each field is performance intensive. Luckily there are other methods to quickly track the changes that involve creating hashes (or a fingerprint) of a particular data row. In using this method, if we want to synchronize two separate tables, we can simply join on the primary key and only compare this fingerprint column in order to determine what has changed.

There are two major methods I’ve used to create a row valued hash key. The first is by using the CHECKSUM function. The other is to use the HASHBYTES function. Both of these function return back a single value representing a hash, however their parameters differ. With CHECKSUM you can pass in a list of columns to evaluate and it returns an integer value. Whereas HASHBYTES requires a single parameter to be passed in and returns back a 16-bit binary value. The trick to forcing HASHBYTES into accepting multiple column values is to use the FOR XML function which will generate a single value to pass in. The obvious difference between the two functions is the size and datatype of the hash being returned. To make a long story short, there are rare occasions (that I have witnessed more than once) where passing in different column values into CHECKSUM will return back the exact same value. Granted this is maybe 2 times for a billion rows, but it has happened. The chances of this happening with HASHBYTES is extremely rare.

When columns are added or removed from an existing hasbytes function, it is important to to regenerate all stored hash values for source and target tables. Otherwise all records will be duplicated on import.

So, for this example we will use HASHBYTES. I don’t plan to use CHECKSUM anymore, however if space is an option I won’t rule it out.

For this example we will perform a one-way synchronization of the employee_table to the employee_sync_table by joining their primary keys and comparing the hash. We’ll put a trigger on the employee table to automatically update or insert the hash for the employee values.

Let’s start by creating the DDL for our example tables.

[cc lang=”sql”]
— source table
CREATE TABLE employee_table
(
employee_id int NOT NULL,
first_name varchar(50),
last_name varchar(50),
row_value varbinary(8000)
)
go
— trigger to calculate row_value
CREATE TRIGGER row_value_trigger on employee_table
FOR UPDATE, INSERT AS
UPDATE employee_table
SET row_value = (select hashbytes(‘md5’, (select first_name, last_name for xml raw)))
GO

— target table
CREATE TABLE employee_sync_table
(
employee_sync_table_id int identity(1,1),
employee_id int NOT NULL,
first_name varchar(50),
last_name varchar(50),
row_value varbinary(8000),
datetime_updated datetime
)
[/cc]

Our trigger will fire on insert or update and modify the row_value accordingly. Now let’s insert some values into the employee_table to see the results
[cc lang=”sql”]
INSERT INTO employee_table
(
employee_id,
first_name,
last_name
)
SELECT
employee_id = 1
,first_name = ‘Chris’
,last_name = ‘Johnson’
UNION
SELECT
employee_id = 2
,first_name = ‘Jack’
,last_name = ‘Sullivan’
UNION
SELECT
employee_id = 3
,first_name = ‘Terri’
,last_name = ‘Duffy’

— let’s also insert these rows into our sync table
INSERT INTO employee_sync_table
(
employee_id,
first_name,
last_name,
row_value,
datetime_updated
)
SELECT
employee_id,
first_name,
last_name,
row_value,
datetime_updated = GETDATE()
FROM employee_table et

— view the results
SELECT * FROM employee_table

[/cc]

row_value is the result of HASHBYTES

Now we’ll simulate an employee name change and run our update process to capture changes. Also below is an insert process to capture any inserts that may have happened.

[cc lang=”sql”]
— simulate employee name change
UPDATE employee_table
SET last_name = ‘Buffy’
WHERE employee_id = 3

— update rows that have changed
UPDATE eht
SET first_name = et.first_name,
last_name = et.last_name,
row_value = et.row_value
FROM employee_sync_table eht
JOIN employee_table et
ON et.employee_id = eht.employee_id
WHERE eht.row_value != et.row_value

— insert rows that have been added
INSERT INTO employee_sync_table
(
employee_id,
first_name,
last_name,
row_value,
datetime_updated
)
SELECT
employee_id,
first_name,
last_name,
row_value,
datetime_updated = GETDATE()
FROM employee_table et
WHERE NOT EXISTS
(
SELECT 1
FROM employee_sync_table eht
WHERE eht.employee_id = et.employee_id
)

— finally, check our results
SELECT * FROM employee_sync_table
[/cc]

There, now we see the name Duffy, has been changed to Buffy in our target sync table.

Post a comment

12 comments

odeddror 07 Dec 2017 at 1:18 am

Hi there,

I have one question, I’m using SQL server 2016 and I have a table with 150 columns with multiple data types
I implemented hashbytes (SHA1) for Merge statement (Insert / Update)
I notice that sometimes (I have around 3 million rows) it not detect changes
(for example it not detect 7 rows that need to update out of 3 millions)
Is there any limitation for how many columns you can have in hashbytes function?

Thanks,
Oded Dror

Reply to this comment

madhu 24 May 2017 at 10:17 am

This will work on single source table to destination table.

but incremental load

1.source to staging data pass
2.staging to destination

in that every day staging table will format .

but your scenario will work for that???
can you tell me

Reply to this comment

Jonas Brock 24 Nov 2016 at 11:12 am

So where in your example, do you update the hasbytes value when the last_name is changed?

Reply to this comment

John 10 Jun 2016 at 6:38 pm

I have 2000 tables mostly without keys I am currently just pushing and updating just short of a billion rows.
I was thinking I could create a view and add a hashed column as an key and manage updates that way but think the overhead might be too much .. i.e. just as easy to dump and reload.

Thoughts?

Reply to this comment

Sara 17 Jun 2015 at 1:46 am

Do you have any idea about how to handle change if table does not have any primary key.
For example, I want to update my current_flag=N ,if I found any change in combination of few columns within the table and insert new row for new values with current_flag=Y in table.

Thanks in advance..

Reply to this comment

Someone 19 Nov 2013 at 2:23 am

Why use a trigger instead of a persisted computed column??

Reply to this comment

Derek Dieter 10 Dec 2013 at 4:36 am

I just don’t want the hassle of a persisted computed column to be honest.

Reply to this comment

Christopher Musu` 25 Aug 2016 at 9:32 am

can you please explain the advantages of a trigger as opposed to a persisted computed column? is there a cost difference ?
thanks
Chris

Reply to this comment

Elicx Villaseca 24 Apr 2013 at 4:20 pm

The idea is good, but exist CDC (Capture Data Change), enabled this in DB and ready, and then only sync, SSIS have CDC objects for worh with that, cdc work with log file and them better because the performance is good comparate with triggers

Reply to this comment

Derek Dieter 13 May 2013 at 8:28 pm

Gone that route too. This is is really just an example of using HASHBYTES. But regarding CDC, I have used that before in the past, but don’t think it has no overhead. It has to read the log file and create tables from it. It also depends on the same technology (replmon reader) that powers transactional replication. So if you run out of log space, you’re SOL and you won’t be getting the changes.

Reply to this comment