Gossamer Forum
Home : Products : Gossamer Links : Discussions :

DMOZ Import Killed... Possible to Resume?

Quote Reply
DMOZ Import Killed... Possible to Resume?
I'm trying to import data from a DMOZ slice, and the process keeps getting killed after about 65,000 links. There are about 300,000 links total, so I'm wondering if there's a way to re-start the import from the point where it was killed the last time through. I've used --rdf-update to make sure no duplicates get added, but even still, it keeps running out of steam in the 60,000-70,000 range. Is there some kind of "resume" feature? Is this possible?

If that's not an option, what would be the easiest way to delete a chunk from the import file (i.e. the chunk representing the data that has so far been successfully imported)? Maybe I'm just an ignorant tool, but I don't know how to do that except by using a text editor, and my text editor isn't too happy about working with a 65MB file... Pirate

Thanks in advance for any advice.

Fractured Atlas :: Liberate the Artist
Services: Healthcare, Fiscal Sponsorship, Marketing, Education, The Emerging Artists Fund

Last edited by:

hennagaijin: Sep 20, 2002, 5:52 AM
Quote Reply
Re: [hennagaijin] DMOZ Import Killed... Possible to Resume? In reply to
Oops, just found this thread:

http://www.gossamer-threads.com/...i?post=175391#175391

Going to give it a shot with vedit...

Fractured Atlas :: Liberate the Artist
Services: Healthcare, Fiscal Sponsorship, Marketing, Education, The Emerging Artists Fund
Quote Reply
Re: [hennagaijin] DMOZ Import Killed... Possible to Resume? In reply to
Hi....try adding --rdf-update to the query, that will go through, and only add the link again if it isn't already in the database.

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ Import Killed... Possible to Resume? In reply to
Hi,

The problem is still time. The RDF import still chuggs through all the links, and just doesn't add them again. It still processes them, and because of the time, problems may still occur.

This is an area that has been neglected, largely, I assume, because it's just a one-time thing for most people, and not a recurring daily nightmare :)


PUGDOG� Enterprises, Inc.

The best way to contact me is to NOT use Email.
Please leave a PM here.
Quote Reply
Re: [pugdog] DMOZ Import Killed... Possible to Resume? In reply to
That's exactly right. Using Vedit, though, it was a pain-free process to cut the full slice into ten smaller slices, none of which were so big as to trigger a timeout. Nice program, that Vedit... Hadn't even heard of it before yesterday.

Fractured Atlas :: Liberate the Artist
Services: Healthcare, Fiscal Sponsorship, Marketing, Education, The Emerging Artists Fund
Quote Reply
Re: [hennagaijin] DMOZ Import Killed... Possible to Resume? In reply to
Could somone direct me to some documentation on how exactly to import from dmoz starting from the basics?

I have no idea how to do this. I have a dedicated server, and root. Have the plugin "very poor documentation, it seems to assume you know already". Cannot find anything in searching the forums that I can understand.

OK I have the file content.rdf.u8

Now what? How do I make it extract the catagory:
http://dmoz.org/Society/Paranormal/

And then how, exactly, do I put it into my directory? Do I run something else, like do_dump.cgi? What parameters?
Quote Reply
Re: [Palehorse777] DMOZ Import Killed... Possible to Resume? In reply to
>>>very poor documentation, it seems to assume you know already<<<

Not really. I have people who have no previous knowledge of even using SSH, and they seem to be able to do it on their own Wink

Are you reading the instructions given to you when you submit the form that sets up the script?

Here is the email that I just sent you, in case anyone needs any future help (not quite sure why... this is a million times easier/faster than doing it all manually Wink)

Quote:
Log into your LinksSQL admin panel
Goto 'Plugins' on the top menu
Click 'Setup New Job'

Set 'no' for cleanout database
Select 'yes' for email notification
Select 'Yes' for backing up your database before importing (VERY important if you have data in your site already).

Set to 'No' for 'Run Full Dump?'

Enter one category per line in 'categories to import' ... Replace ' ' with with '_' (space with underscore), otherwise it won't be able to import them for you.

Now skip the rest of it, and go down to;

'Redownload content.rdf.u8.?' ... You can check your FTP to see if you already have this file in your LSQL admin folder. If you don't, then make sure this is ticked, Otherwise, untick this, as it will stop the script needing to redownload the 200Mb+ file ;-)

Now, to test with, I wouldn't recommend doing the import via SSH, so just untick the 'setup cronjob' box.

Now just submit the form, and log into SSH.

Type;

cd /path/to/your/lsql/admin (change to reflect your admin path) perl dmoz_cron.cgi > log.txt &

Then, you should see something like;

[1] 1234

1234 is the process ID, and to stop the process at any time, simply type;

kill 1234

(the ID will obviously be different).

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!