Thursday, September 4, 2008

Four different ways to find duplicates

-
We must have probably come across duplicate records in our dataset & tried removing those by using nodup and nodupkey options.

I
-
Use of nodup will remove exact duplicates as given below:

proc sort data = draft out = draft1 nodup;
by _all_ ;
run;
-
NAME AGE SEX
Rama _27_ M
Rama _27_ M
-
II
-
Where as the use of nodupkey looks at the by variables in the sort procedure and keeps the first records of those by variables (it removes the rest of the records when encountered again!)

proc sort data = draft out = draft1 nodupkey;
by petestcd;
run;

PETESTCD
Neck
Abdominal
Abdominal
Abdominal *these are the two records that will be removed by nodupkey;
HEENT
-
III
-
The duplicate records in the dataset may result in the display of the Note => “MERGE statement has more than one data set with repeats of BY values” in the log. To determine this one can use "if first. ne last." method.
-
data check;
set dups;
by id;
if first.id ne last.id; *brings the duplicates;
run;
-
This was quite confusing to me as my little brain perceived it this way:

"Y Y first. ne last.?????
If an ID is repeated twice in the dataset then first dot is equal to last dot....

ID
001
001

but SAS THINKS this way

ID First. Last.
001 1---- 0
001 0---- 1

002 1---- 1
003 1---- 1

Records that are not duplicates will have FIRST.ID = 1 and LAST.ID = 1, so we select records where first dot is not equal to last dot.
-
IV
-
The last way is the use of freq:

proc freq data = draft noprint ;
table ID / out = draftdup (keep = ID Count where = (Count > 1)) ;
run;
-
-

No comments:

Post a Comment