Gossamer Forum
Home : Products : Gossamer Links : Discussions :

Regional import...

Quote Reply
Regional import...
Has anyone tried to import the full Regional category from the DMOZ rdf dump? I have tried many times to import the category into a clean install (new database) but it fails during the table repair. At first, I thought it was my rdf file (maybe corrupt), but it does the same from the main gziped file, and a cut I've made. When doing a repair, the line reads something like:

Regional has {blank} links but is set to 19. Repairing...

(it can't seem to determine the total links)


Then, after running for an hour or two, it gives a:

Structure Error!

This category is the largest DMOZ category (over 700,000 links). All other categories seem to import properly. The problem is when you then do a rebuild search, and it only indexes a little over 300,000 of the links.

Is there a limit as to how many links the software can handle per category?

Sean
Quote Reply
Re: [SeanP] Regional import... In reply to
Hi Sean

I had exactly the same problem.

Even on a clean install it would fail, at the same point.

I tried to resolve it but couldn't get it to work.

I imported the other 1.4 million.Cool

Dregs2
Quote Reply
Re: [dregs2] Regional import... In reply to
I wonder if it has something to do with the size of the category since it is the largest. Maybe Alex has some ideas.

Sean
Quote Reply
Re: [SeanP] Regional import... In reply to
I imported 500,000 links into someones directory a few days ago and he is experiencing the same errors when building.

Last edited by:

RedRum: Feb 25, 2002, 2:39 PM
Quote Reply
Re: [SeanP] Regional import... In reply to
Hmmm. I just imported the Ohio only part of the Regional section and I'm missing a number of links - subcategories that have "0" links in it when I know, indeed, there are links in those subcategories. This category is supposed to have approximately 37K links in it and I only have about 32K and this includes about 7K from my existent databse. I wonder if there is a common problem here!

I'd love to have Alex respond to this one!

mgeyman
Quote Reply
Re: [mgeyman] Regional import... In reply to
Have you repaired your tables and reindexed?

Last edited by:

RedRum: Feb 26, 2002, 7:12 AM
Quote Reply
Re: [RedRum] Regional import... In reply to
RedRum,

I did it (the process) only once via the Web admin interface, after doing the import. Should I or would it be better to do repair and re-index via the command line instead?

mgeyman
Quote Reply
Re: [mgeyman] Regional import... In reply to
Well ideally use telnet/ssh.

That is what you need to do to get the total to look right instead of (0)
Quote Reply
Re: [RedRum] Regional import... In reply to
Thanks a lot. I'll give that a try.

mgeyman
Quote Reply
Re: [mgeyman] Regional import... In reply to
I've tried doing the repair from the shell, but I get the same error as when I do it in the web console.

Sean
Quote Reply
Re: [mgeyman] Regional import... In reply to
mgeyman,

Did you get the "Structure Error!" when you ran the repair after importing just the Ohio categories?

Sean
Quote Reply
Re: [SeanP] Regional import... In reply to
I did a full import last week (took all week).

No errors, uniterupted run, several repairs (each took over 4 hours) no errors.

If you want me to check actual stats, I can. I'm sure there are missing links, and such, from what I saw looking at the database itself, at least 2-3 dozen links are "bad" in some way. don't know if it was during import, or from the RDF dump itself (they are not "URIs" of any form).

I aborted at the "world" section, since there is no point importing that as a whole.

I didn't import the Kids. etc, since that was after that section, but I will at some point.

FWIW my stats show ohio has 24,599 links, and 735,679 Links in the whole regional category.

Unfortunately this machine is non-routable, so I can't show it off.

I'm working on a way to do selective dumps from this machine, but have been sidetracked with the other programs I've been working on.

I'm also trying to figure out a way to "update" this with the RDF files from ODP, looking for differences, without having to iterate through the whole file.

I wonder if using a set of filters, to "grep" the RDF download, extract the URLs, then sort them, and look for a "diff" in the urls that are in the database already. Then, go back, and extract those links for import.

It would be a large file, but grep and diff are fairly efficient.


PUGDOG� Enterprises, Inc.

The best way to contact me is to NOT use Email.
Please leave a PM here.
Quote Reply
Re: [SeanP] Regional import... In reply to
SeanP,

I don't remember seeing that error message.

I re-ran everything via SSH and I still have subcategories that have no (0) links in them versus looking at DMOZ where links exist. I'm scratching my head on this one!

mgeyman
Quote Reply
Re: [pugdog] Regional import... In reply to
pugdog,

I'm interested in your pursuit of the ability to "pull out" select sections and being able to import just the updated links. That's pretty cool!

You are right, the ohio section has 24k+ links. My own db has about 7k+ so I may be missing about 1k of links from the import for whatever reason.

Question - do duplicate links in the RDF file get counted as "2" separate links or "1" link?

Thanks a lot.

mgeyman
Quote Reply
Re: [pugdog] Regional import... In reply to
I did a wget to receive the rdf file a couple of weeks ago and edited it down to just the Top/Regional/North_America/United_States/Ohio section. Should I try to grab a more recent version of the content rdf file from DMOZ to see if I get the same results (some of the subcategories with (0) links)? Do you think my results will differ?

Thanks.

mgeyman
Quote Reply
Re: [mgeyman] Regional import... In reply to
I've been thinking about this. I think the empty categories have to do with "related" links and such, where a link is in multiple locations. The DMOZ dump only places the link in it's main location, so if categories are filled with links that are originally located elsewhere, they would appear empty.


PUGDOG� Enterprises, Inc.

The best way to contact me is to NOT use Email.
Please leave a PM here.
Quote Reply
Re: [pugdog] Regional import... In reply to
So if that's the case, removing the "Related Links" from the rdf dump, would most likely remedy the issue? I wonder why only the Regional category poses this problem.

Sean
Quote Reply
Re: [pugdog] Regional import... In reply to
I think it would be very helpful if when doing a repair, if it would create a log to show what it was doing. That way, it would be easier to track where it errored.

Sean

Last edited by:

SeanP: Apr 16, 2002, 7:01 PM
Quote Reply
Re: [SeanP] Regional import... In reply to
I had the same problem with importing the regional tree from dmoz; but was able to fix it. I've only tried it via shell as this size of file is usually hard to call from the web. In any case, I got the error:

Category Regional should have links, but is set to 764915, repairing ... Structure Error!

It seems to just hang. So I stopped it. And I did this about 3 times with the same result. But then, I looked at the script. I noticed, there is a routine that it is running after this error - it's not just hanging. So, if you just keep the script running for an hour longer (more or less depending on your server), it seems to fix itself.

Then I get the following messages:

Regional reported: 764915 real:

Category World should have 626322 links, but is set to 0, repairing ... 626322 ok!
Category Kids and Teens should have 18271 links, but is set to 0, repairing ... 18271 ok!
Done (4918.00 s)

Greg
Consulting First, Inc.