Using Ranking Functions to Deduplicate Data

he introduction of ranking functions in SQL Server 2005 allowed the generation of listings with generated numbers based on sort orders providing keys such as row numbers and rank. These can be used for the deduplication of data in a simple way.

Consider the following simple example of a list of characters:

A
A
B
B
C
D
D
E

Let's put this into a table. This example creates a table variable.

declare @AlphaList table (AlphaKey char);

insert into @AlphaList(AlphaKey) values ('A');
insert into @AlphaList(AlphaKey) values ('A');
insert into @AlphaList(AlphaKey) values ('B');
insert into @AlphaList(AlphaKey) values ('B');
insert into @AlphaList(AlphaKey) values ('C');
insert into @AlphaList(AlphaKey) values ('D');
insert into @AlphaList(AlphaKey) values ('D');
insert into @AlphaList(AlphaKey) values ('E');

select AlphaKey from @AlphaList order by 1;
If we now use the ROW_NUMBER function we can get a listing of @AlphaList table with a sequential list of row numbers.

select
ROW_NUMBER() over (order by AlphaKey) as RowNumber
, AlphaKey
from @AlphaList;
1 A
2 A
3 B
4 B
5 C
6 D
7 D
8 E

Now if we add RANK onto this query, we get an additional column ranking the data. For example, the values of 'A' are assigned a rank of 1 as they are equal, the values of B are assigned the next rank, C the next, etc. This is a simple way of assiging a value by class or category depending on how we order the data - the rank value of 3 simply means that all data with this rank number is the same.

select
RANK() over (order by AlphaKey) as Rank
, ROW_NUMBER() over (order by AlphaKey) as RowNumber
, AlphaKey
from @AlphaList;

1 1 A
1 2 A
3 3 B
3 4 B
5 5 C
6 6 D
6 7 D
8 8 E

From this you can see that the first rows containing the AlphaKey column values of 'A', 'B' and 'D' have the same values generated by the RANK and ROW_NUMBER functions giving us a very simple way of deduplicating the data.

What is happening is that the data is now being ordered and classified or categorised simultaneously so that equal values are adjacent to each other. It all works because

the 'order by' clauses are used to bring the data back as a specific sorted list
the ROW_NUMBER function allocates sequential numbers to each row of data
the RANK function allocates a value to each block of data matching the order criteria

with AlphaRank(Rank, RowNumber, AlphaKey) as (
select
RANK() over (order by AlphaKey) as Rank
, ROW_NUMBER() over (order by AlphaKey) as RowNumber
, AlphaKey
from @AlphaList
)
select AlphaKey
from AlphaRank
where Rank=RowNumber
A
B
C
D
E

Practical Example
This process was used to extract missing data from tables to enable an Item dimension table to be back filled from invoice/credit fact tables. A query was run to extract all the Item numbers that were missing from the Item table, for example:

select ItemNumber, UnitCost into MissingItems from SalesInvoices where ItemNumber not in (select ItemNumber from dimItem)
This produces a table containing all the missing item numbers together with their unit cost. A similar query was run against the 'SalesCredits' table to obtain further rows of data. The extracted data could look something like this:

777 10.10
777 11.11
777 12.12
888 13.13
888 14.14
888 15.15
999 16.16
999 17.17
999 18.18

The following code could then be run to deduplicate the data and, in addition, obtain the lowest unit price for the item number.

with GetMissingItems(Rank, RowNumber, ItemNumber, UnitCost) as (
select
RANK() over (order by ItemNumber, UnitCost) as Rank
, ROW_NUMBER() over (order by ItemNumber, UnitCost) as RowNumber
, ItemNumber
, UnitCost
from MissingItems
)
select
ItemNumber

, UnitCost
from GetMissingItems
where Rank=RowNumber

777 10.10
888 13.13
999 16.16

from here

...it's me...eRiCk

Search This Blog

Using Ranking Functions to Deduplicate Data

Labels

Comments

Popular posts from this blog

Crystal Report 2010 - Failed to load database information

SBO - Internal Error (-5002) Occured in Inventory Transfer

Could not load file or assembly error in VS 2010 and CR