Gossamer Forum
Home : General : Perl Programming :

file: checking for duplicates

Quote Reply
file: checking for duplicates
hi,

Having a large file (20 Mb) consisting out of 6 fields separated by the pipe symbol.

Field 2 in this file is an URL

I am searching for a snippet of perl-code that would eliminate duplicates of this file, based on checking field 2 (the URL)

Something as: read the complete file1.db and if field-2 is not duplicate then write to a new file called file2.db

Maybe first the file has to be sorted on field 2 before to start eliminate the duplicates that occure in field 2 ???

Thanks
Quote Reply
Re: [sanuk] file: checking for duplicates In reply to
Code:
my $db1 = '/path/to/db1.db';
my $db2 = '/path/to/db2.db';
my %log = ();

open DB1, $db1 or die $!;
open DB2, ">$db2" or die $!;
flock DB2, 2;
while (<DB1>) {
my @row = split /\|/;
print DB2 unless ($log{$row[1]}++);
}
close DB2;
close DB1;
Quote Reply
Re: [Paul] file: checking for duplicates In reply to
thanks,

I have to admit that I don't understand what is happening, but I have no doubt it will work.

Before testing it on my real file (20 Mb), would it be possible to include a third file, let's call it db3, to store all the rejected or duplicate lines and this as a security and for checking the results ??

Thanks,
Quote Reply
Re: [sanuk] file: checking for duplicates In reply to
Sure, just change unless to if
Quote Reply
Re: [Paul] file: checking for duplicates In reply to
Hi Paul,

Sorry to disturbe you again !

As I do not want to just copy and paste a snippet of code, but I also would like to learn and understand.

Could you please explain what

unless ($log{$row[1]}++);

exactly does

I have to confess that I have been looking at this for hours without understanding the magic of it.

Thanks,
Quote Reply
Re: [sanuk] file: checking for duplicates In reply to
It creates a hash key named by URL and then tests if it has a value (ie, if it has been defined already) and if not it is saved and it's value increased by 1 (that's the ++ bit). Then if a duplicate URL is found the "unless" block will return true meaning that this URL has been found with a value of 1 and so it will be skipped.