Gossamer Forum: General: Perl Programming: file: checking for duplicates

Jun 22, 2003, 2:49 AM

sanuk

User (103 posts)

Jun 22, 2003, 2:49 AM

Post #1 of 6

Shortcut

file: checking for duplicates

hi,

Having a large file (20 Mb) consisting out of 6 fields separated by the pipe symbol.

Field 2 in this file is an URL

I am searching for a snippet of perl-code that would eliminate duplicates of this file, based on checking field 2 (the URL)

Something as: read the complete file1.db and if field-2 is not duplicate then write to a new file called file2.db

Maybe first the file has to be sorted on field 2 before to start eliminate the duplicates that occure in field 2 ???

Thanks

Jun 22, 2003, 3:36 AM

Paul

Veteran (19537 posts)

Jun 22, 2003, 3:36 AM

Post #2 of 6

Shortcut

Re: [sanuk] file: checking for duplicates In reply to

Code:
my $db1 = '/path/to/db1.db'; 
my $db2 = '/path/to/db2.db'; 
my %log = (); 

open  DB1, $db1    or die $!; 
open  DB2, ">$db2" or die $!; 
flock DB2, 2; 
while (<DB1>) { 
    my @row = split /\|/; 
    print DB2 unless ($log{$row[1]}++); 
} 
close DB2; 
close DB1;

Jun 22, 2003, 12:57 PM

sanuk

User (103 posts)

Jun 22, 2003, 12:57 PM

Post #3 of 6

Shortcut

Re: [Paul] file: checking for duplicates In reply to

thanks,

I have to admit that I don't understand what is happening, but I have no doubt it will work.

Before testing it on my real file (20 Mb), would it be possible to include a third file, let's call it db3, to store all the rejected or duplicate lines and this as a security and for checking the results ??

Thanks,

Jun 22, 2003, 1:09 PM

Paul

Veteran (19537 posts)

Jun 22, 2003, 1:09 PM

Post #4 of 6

Shortcut

Re: [sanuk] file: checking for duplicates In reply to

Sure, just change unless to if

Jun 22, 2003, 8:04 PM

sanuk

User (103 posts)

Jun 22, 2003, 8:04 PM

Post #5 of 6

Shortcut

Re: [Paul] file: checking for duplicates In reply to

Hi Paul,

Sorry to disturbe you again !

As I do not want to just copy and paste a snippet of code, but I also would like to learn and understand.

Could you please explain what

unless ($log{$row[1]}++);

exactly does

I have to confess that I have been looking at this for hours without understanding the magic of it.

Thanks,

Jun 27, 2003, 3:12 AM

Paul

Veteran (19537 posts)

Jun 27, 2003, 3:12 AM

Post #6 of 6

Shortcut

Re: [sanuk] file: checking for duplicates In reply to

It creates a hash key named by URL and then tests if it has a value (ie, if it has been defined already) and if not it is saved and it's value increased by 1 (that's the ++ bit). Then if a duplicate URL is found the "unless" block will return true meaning that this URL has been found with a value of 1 and so it will be skipped.