Gossamer Forum
Home : General : Perl Programming :

Huge database problem!

Quote Reply
Huge database problem!
Hey guys!

I got a huge database of URLs (over 100.000). Just URLs, no titles or description. The file is 2.6 megabytes.

The problem is: this database has duplicates of enries, which I would like to be removed.
I've tried to open this database in Excel, but Excel supports only 66.000 entries, so it didn't load my database. Frown

Is there another way to eliminate these duplicate entries? May be some kind of an advanced search/replace tool?
Please help!


Regards,

Pasha

------------------
webmaster@find.virtualave.net
http://find.virtualave.net
Quote Reply
Re: Huge database problem! In reply to
Pasha,

If you are on a UNIX/Linux system and have telnet access, have you considered using the sort command from the command line first?

Once you have the file sorted, all duplicate lines would then be together. It would then be a simple matter (relatively speaking) to read in each line, store it in a temp variable, read the next line and if it is the same as the stored one, skip it and get the next line. If it is not the same, write the stored line to a new file and store the new line in the temp variable, repeating this process until all lines have been checked.

If you do not have telnet access, you can perform the sort right in perl before doing anything else with the file. Read the whole file into an array, sort the array, write it back out, and then read it back in line by line as I said above.

Just some ideas.
Quote Reply
Re: Huge database problem! In reply to
Thanks Bobsie, but I don't have a telnet access to the server, and I wouldn't do the sorting on server any way, because I'd rather slow down my PC at home than web server wich already has over 25.000 websites slowing it down. Smile

There's supposed to be a program that can do the job better than Excel Wink I don't even need sorting, just delete duplicate entries.
BTW: I just got a second database 6.2 Mb!

What do you use to manage your database?


Regards,

Pasha

[This message has been edited by Pasha (edited April 04, 1999).]
Quote Reply
Re: Huge database problem! In reply to
The only database I have is Links, so I guess you can guess what I use to manage it. Wink
Quote Reply
Re: Huge database problem! In reply to
Perfect job for perl! You do have perl installed on your home computer right? Wink

Code:
#!/usr/local/bin/perl
# ---------------------------------
open (DB, "<nameoffile.txt") or die $!;
open (DBOUT, ">newdatabase.txt") or die $!;
while (<DB> ) {
$seen{$_} = 1;
}
while (($url, $junk) = each %seen) {
print DBOUT $url;
}
close DB;
close DBOUT;
# ---------------------------------

Hope that helps,

Alex
Quote Reply
Re: Huge database problem! In reply to
 
Quote:
what I had was a compiler and Perl 4 source code.

No wonder you had problems! The new version come with install shield and everything. No compiling required. =)

Cheers,

Alex
Quote Reply
Re: Huge database problem! In reply to
Alex my friend,
I've tried to install Perl on my PC with Win98, but it gave me too much pain and sleepless nights, I had to give up. Smile

There's more: www.virtualave.net don't allow any uploads bigger than 1.2 Mb. I've mailed them a letter with a request to let me have big files on server, but they didn't mail me back yet.

Are there any online FAQs for Perl for Win98? Smile

Thanks for the reply.


Pasha

------------------
webmaster@find.virtualave.net
http://find.virtualave.net
Quote Reply
Re: Huge database problem! In reply to
 
Quote:
I've tried to install Perl on my PC with Win98, but it gave me too much pain and sleepless nights, I had to give up

Really? Have you tried the latest perl from Activestate:

http://www.activestate.com/activeperl/

It comes with an installer and sets everything up for you. You don't need to setup a webserver for this, just install perl, save that script in the same directory as the database file and double click on it.

Cheers,

Alex
Quote Reply
Re: Huge database problem! In reply to
Don't know about an Installer, what I had was a compiler and Perl 4 source code. The enviroment was similar to Linux, so I screwed everything up Smile
I will try this version, and if it comes with installer and a good FAQ, I'll be alright (hopefully) Smile

Thanks again.


Pasha

------------------
webmaster@find.virtualave.net
http://find.virtualave.net