Manjuke's Blog: Removing Duplicate Records From a MS SQL Table

Saturday, 25 September 2010

Removing Duplicate Records From a MS SQL Table – (MS SQL 2005 or above)

Have you ever been in a situation that your SQL tables contain duplicate records, where you have not defined a primary key or an auto increment field. And you need to keep one record and delete the rest.

The usual method of doing this is to use a temporary table or to use a cursor. But there is another method of doing this using a single query in SQL 2005 or above.

To illustrate this first I will create the following table.

create table SampleTable(
    id        int not null,
    name    varchar(20) not null,
    age        int not null
    )

Now I will insert some duplicate records to the above created table.

insert into SampleTable    (id,name,age) values (1,'John',30)
insert into SampleTable    (id,name,age) values (1,'John',30)
insert into SampleTable    (id,name,age) values (1,'John',30)
insert into SampleTable    (id,name,age) values (1,'John',30)
insert into SampleTable    (id,name,age) values (1,'John',30)
insert into SampleTable    (id,name,age) values (2,'Mary',26)
insert into SampleTable    (id,name,age) values (2,'Mary',26)
insert into SampleTable    (id,name,age) values (2,'Mary',26)
insert into SampleTable    (id,name,age) values (2,'Mary',26)
insert into SampleTable    (id,name,age) values (3,'Ann',25)
insert into SampleTable    (id,name,age) values (3,'Ann',25)
insert into SampleTable    (id,name,age) values (3,'Ann',25)
insert into SampleTable    (id,name,age) values (3,'Ann',25)
insert into SampleTable    (id,name,age) values (3,'Ann',25)
insert into SampleTable    (id,name,age) values (4,'James',21)

Using the below given query you can easily find out the duplicates (number of duplicate records).

select SUM(rec_count) as rec_count from(
select COUNT (*) - 1 as rec_count from SampleTable group by CHECKSUM(*)
) T having COUNT(*) > 1

On the above query I have remove one record (COUNT (*) - 1), since one should be there as a valid record. And you really don’t need ‘having COUNT(*) > 1’, since non duplicate record count(*) will return 1 and count(*)-1 will be 0. It’s there for the ease of readability. So if you execute the above query you will get 11 records as the record count (Total 15 records, 4 valid records. So 15-4 = 11 records).

If you can see I have used ‘CHECKSUM(*)’. This to avoid typing all field names. Without using that the query would be like ‘group by id,name,age’.

And finally we can build the query to delete duplicates like this. First we must find the valid records, which should not be deleted. The way to do is using the function ‘ROW_NUMBER’. Using that we assign a unique row number for each record and select the maximum row number for each group. Then we will only get one record per group.

select MAX(row_num) from (
select ROW_NUMBER() over (order by checksum(*)) as row_num, CHECKSUM(*) as ChkSum  
from SampleTable
) as T Group By ChkSum

And if you execute the above query you will get the following result:

It will return row numbers 5,9,14 and 15 as valid records which we must keep. And we must only delete records which the row number is not equal to the ones that’s been returned from the above mentioned query. First we’ll select those records (Only for checking purpose). You can select those records using the following query.

select T.* from(
    select ROW_NUMBER() over (order by checksum(*)) as row_num, CHECKSUM(*) 
    as ChkSum from SampleTable) as T
    where T.row_num not in (
        select MAX(row_num) from (
            select ROW_NUMBER() over (order by checksum(*)) as row_num, CHECKSUM(*) 
            as ChkSum from SampleTable
        ) as T Group By ChkSum
    )

And if you execute the above query you will get the following result.

So if you see closely row numbers 5,9,14 and 15 are not there. So we can sure, that we are deleting the correct set of records. So in order to delete the duplicated we can use the following query.

    
delete T from(
    select ROW_NUMBER() over (order by checksum(*)) as row_num, CHECKSUM(*) 
    as ChkSum from SampleTable) as T
    where T.row_num not in (
        select MAX(row_num) from (
            select ROW_NUMBER() over (order by checksum(*)) as row_num, CHECKSUM(*) 
            as ChkSum  from SampleTable
        ) as T Group By ChkSum
    )

And if you query the table you will get the following result.

Saturday, 25 September 2010

Removing Duplicate Records From a MS SQL Table – (MS SQL 2005 or above)

No comments:

Post a Comment