Gossamer Forum
Home : Products : DBMan : Customization :

Finding duplicated records

Quote Reply
Finding duplicated records
I have a medium size database, there were 3 different persons adding the records directly on excel, everything is working perfectly.

The problem is, I have duplicated records, with different IDs, but same name.

I needed a script that could find the duplicated records, but not only by name, but by telefone and city too, just to make sure it's no restaurants from the same chain

I searched the forum but found nothing...

Thanks.
Quote Reply
Re: Finding duplicated records In reply to
Try this thread:

Check for duplicate records - http://www.gossamer-threads.com/scripts/forum/resources/Forum5/HTML/001295.html

Hope this is what you are looking for Smile
Quote Reply
Re: Finding duplicated records In reply to
Thanks for the reply, but my problem is different.

I already have duplicated records in the DB, I needed an external script to search for them.
Quote Reply
Re: Finding duplicated records In reply to
I was thinking of something like the code bellow, I quite sure the code bellow wouldn't work, I'm still a beginner, and forgot the perl book at work.

When I finished running the code, I would have 3 files, the original default.db, default2.db (the new file without duplicated records) and duplicated.txt (With all the dups, I could use this file later to check the records with more accurate data),

----------------------
#!/usr/bin/perl

open(DB, "default.db") | | die "Open: $!";
while(<DB> ) {
@fields = split(/\t/, $_);
open(DB2, "default2.db") &#0124; &#0124; die "Open: $!";
while(<DB2> ) {
if ((/$fields[1]/) && (/$fields[3]/) && (/$fields[4]/) && (/$fields[5]/)) {
## assume it's duplicated, so append to duplicated.txt
}
else {
## Not duplicated, append to default2.db
}
}
close(DB2);
}
-------------------------
thanks for the help.
Quote Reply
Re: Finding duplicated records In reply to
You're right. That won't work. Smile

Off the top of my head, it seems what you would need to do is to put all of your data into associative arrays first. Then go through each one to see if there is a match with one further down on all of the fields. If so, append the data to your "duplicates" file and delete it from the array. Then rewrite your non-duplicated entries.

Make sense?



------------------
JPD





Quote Reply
Re: Finding duplicated records In reply to
I am very new to dbman and would not begin to know how to do this in dbman. I have done this kind of work on several occasions and it is very difficult challenge the first step is to of coarse eliminate exact duplicates this easy. Then the fun starts you need to deal things like 2 records the same except for “Bob and Robert” or “Dot and Dorothy” or “NY and Yew York “ the list is endless. Then there are 2 records the same except the phone number.

You get the Idea

I do not mean to discourage you but if you have a large data set this task can take a long time.

I have found that several passes through the data distilling it each time. Each data set is unique so the distillation routines need to be written for your data. In the end I almost always end up with a routine that finds 2 possible duplicate records and displays them side by side and let the user decide if they are duplicate and if so if they should be merged or which one should be deleted.

My $.02