Skip to main content

Using Ranking Functions to Deduplicate Data

he introduction of ranking functions in SQL Server 2005 allowed the generation of listings with generated numbers based on sort orders providing keys such as row numbers and rank. These can be used for the deduplication of data in a simple way.
Consider the following simple example of a list of characters:

A
A
B
B
C
D
D
E


Let's put this into a table. This example creates a table variable.
declare @AlphaList table (AlphaKey char);

insert into @AlphaList(AlphaKey) values ('A');
insert into @AlphaList(AlphaKey) values ('A');
insert into @AlphaList(AlphaKey) values ('B');
insert into @AlphaList(AlphaKey) values ('B');
insert into @AlphaList(AlphaKey) values ('C');
insert into @AlphaList(AlphaKey) values ('D');
insert into @AlphaList(AlphaKey) values ('D');
insert into @AlphaList(AlphaKey) values ('E');

select AlphaKey from @AlphaList order by 1;
If we now use the ROW_NUMBER function we can get a listing of @AlphaList table with a sequential list of row numbers.

select
ROW_NUMBER() over (order by AlphaKey) as RowNumber
, AlphaKey
from @AlphaList;
1 A
2 A
3 B
4 B
5 C
6 D
7 D
8 E


Now if we add RANK onto this query, we get an additional column ranking the data. For example, the values of 'A' are assigned a rank of 1 as they are equal, the values of B are assigned the next rank, C the next, etc. This is a simple way of assiging a value by class or category depending on how we order the data - the rank value of 3 simply means that all data with this rank number is the same.
select
RANK() over (order by AlphaKey) as Rank
, ROW_NUMBER() over (order by AlphaKey) as RowNumber
, AlphaKey
from @AlphaList;

1 1 A
1 2 A
3 3 B
3 4 B
5 5 C
6 6 D
6 7 D
8 8 E


From this you can see that the first rows containing the AlphaKey column values of 'A', 'B' and 'D' have the same values generated by the RANK and ROW_NUMBER functions giving us a very simple way of deduplicating the data.

What is happening is that the data is now being ordered and classified or categorised simultaneously so that equal values are adjacent to each other. It all works because

the 'order by' clauses are used to bring the data back as a specific sorted list
the ROW_NUMBER function allocates sequential numbers to each row of data
the RANK function allocates a value to each block of data matching the order criteria
with AlphaRank(Rank, RowNumber, AlphaKey) as (
select
RANK() over (order by AlphaKey) as Rank
, ROW_NUMBER() over (order by AlphaKey) as RowNumber
, AlphaKey
from @AlphaList
)
select AlphaKey
from AlphaRank
where Rank=RowNumber
A
B
C
D
E


Practical Example
This process was used to extract missing data from tables to enable an Item dimension table to be back filled from invoice/credit fact tables. A query was run to extract all the Item numbers that were missing from the Item table, for example:

select ItemNumber, UnitCost into MissingItems from SalesInvoices where ItemNumber not in (select ItemNumber from dimItem)
This produces a table containing all the missing item numbers together with their unit cost. A similar query was run against the 'SalesCredits' table to obtain further rows of data. The extracted data could look something like this:

777 10.10
777 11.11
777 12.12
888 13.13
888 14.14
888 15.15
999 16.16
999 17.17
999 18.18

The following code could then be run to deduplicate the data and, in addition, obtain the lowest unit price for the item number.
with GetMissingItems(Rank, RowNumber, ItemNumber, UnitCost) as (
select
RANK() over (order by ItemNumber, UnitCost) as Rank
, ROW_NUMBER() over (order by ItemNumber, UnitCost) as RowNumber
, ItemNumber
, UnitCost
from MissingItems
)
select
ItemNumber

, UnitCost
from GetMissingItems
where Rank=RowNumber


777 10.10
888 13.13
999 16.16

from here

Comments

Popular posts from this blog

SBO - Internal Error (-5002) Occured in Inventory Transfer

this is something that not well documented in SBO, it took me several hours to look at Google and can't find anything about it. Thank God friend of mine got answer for this one.So the Sympton is something like this, u need to make Inventory Transfer for Batch Item and Serial Item in one single Transaction so the Transaction is the combination of it. ex in detail lines: 1. Serial Item 2. Batch Item 3. Batch Item 4. Batch Item 5. Serial Item 6. Serial Item . When u reach this code oStockTransfer.Add it will return an error "Internal Error (-5002) Occured". This is because ur details is not in order, meaning u need to order the line with combination of Batch first then Serial or Serial first then Batch so be careful with this thing :D see u

Resolve Collation error in SQL Server

i often get this error when trying to JOIN some table Cannot resolve the collation conflict between "SQL_Latin1_General_CP850_CI_AS" and "SQL_Latin1_General_CP1_CI_AS" in the equal to operation. u need to use COLLATION when performing the JOIN, something like this SELECT field1 FROM table1 INNER JOIN table2 ON table1.field3 COLLATE DATABASE_DEFAULT = table2.field4 COLLATE DATABASE_DEFAULT i found this when trying to JOIN, maybe it affect other areas..until i found one see u

Crystal Report 2010 - Failed to load database information

"Failed to load database information" error when i try deploy my program with CR 2010 using TTX file as report definition this is because there is missing .dll, call "CRDB_FIELDDEF.DLL"..u need to manually copy the file to "C:\Program Files (x86)\SAP BusinessObjects\Crystal Reports for .NET Framework 4.0\Common\SAP BusinessObjects Enterprise XI 4.0\win32_x86" folder then the report is loading again it said that TTX is old tech but i'm still using it because i don't need to connect the report to any DB, but until i found new one to replace :D check this link to read it [EDIT 25 Jan 2011] i think i'll using ADO.NET Dataset next time :D until then