Gossamer Forum
Home : Products : Gossamer Links : Pre Sales :

Dmoz dumps

Quote Reply
Dmoz dumps
What is the usual listing content of sites in dmoz dumps. I know they contain site name and url.

What else?

site descriptions, email contact info of site owners?

I'm considering using a dmoz chunk and I dont want a database full of holes, so I'd like to know to plan what to leave out/include in a database.

Thanks.
Quote Reply
Re: [roman365] Dmoz dumps In reply to
It imports:

Title
Description
URL

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Dmoz dumps In reply to
Looks like I'd be going for a slimmed down db then, no way I'd fill in all those contacts for email! Crazy

Unless someone knows a tool to do this, which I doubt is possible.

Thanks again.
Quote Reply
Re: [roman365] Dmoz dumps In reply to
Its possible to write a simple Perl script, that would go through each link, and look up the Whois record for the link in question. My 'Spider' plugin uses this kind of method, to find the contact email addresses, if nothing is defined in the meta tags.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Dmoz dumps In reply to
That sounds pretty cool, way beyond my abilities.

So your spider fills in contact name and email? What about where the contacts are say a webhost admin etc. instead of owner, but theres prob no way around that?

How much would a script to backtrack a links sql db for contacts cost?

Can your spider be used for adding domains to db?
Quote Reply
Re: [roman365] Dmoz dumps In reply to
P.S. - does such a script exist for current links 2?

I have need of something like that right now. Got a db of about 800 links pulled from dmoz and only about 200 have contact info at moment, was doing it manually :(
Quote Reply
Re: [roman365] Dmoz dumps In reply to
Quote:
So your spider fills in contact name and email? What about where the contacts are say a webhost admin etc. instead of owner, but theres prob no way around that?

To an extent. You have to define which page to start from (i.e http://www.linkssql.net/test.html), and then the spider will grab the URL's from that page.

Its *not* automated.

Quote:
How much would a script to backtrack a links sql db for contacts cost?

Well, if you needed it, I could do it fro $50US. This would basically backtrack the Author and Email for each of the links that don't already have a value.

Please send me a PM if you interested in me doing this work.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [roman365] Dmoz dumps In reply to
In Reply To:
P.S. - does such a script exist for current links 2?

I have need of something like that right now. Got a db of about 800 links pulled from dmoz and only about 200 have contact info at moment, was doing it manually :(

Afraid not. Links 2 is pretty dated in terms of development now. Most of us programers (bar a few) are concentrating a lot more on LinksSQL, as its a lot more flexible.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Dmoz dumps In reply to
I may seriously take you up on writing that script in a few weeks.

It'll depend on me going for links sql, which I'll know more definately whether I'm going to do soon.

I'll be in touch :)
Quote Reply
Re: [Andy] Dmoz dumps In reply to
OK Andy - you have a contract from me to do exactly that then.

Go for it and will pay you. I didn't know you could do that !!!!

Get to it mate...

Paul
Quote Reply
Re: [Andy] Dmoz dumps In reply to
re: whois

Andy, I don't want to rain on anyone's parade... but most whois access is now blocked. There is no way to automatically look that data up. Manually you need to enter some graphic code, or otherwise interact with the terminal. It's really killed some of the stuff I've been doing for years to maintain my domain lists -- for my own domain names, not even quasi-ethical harvesting of other data.

One thing that might be possible, is to go to the front page of a website, and look for "mailto" links. If the mailto: is to something on that domain, then if the name is "contact", "support", "Info", "webmaster", etc (a list of defined names) then automatically insert it. IF the name is something unknown at that domain like jim@domain.com then insert that into a "potential_contact" field.

You might have to do a 1-page deep search looking for email addresses that are on a contact or other similar page.

The admin area could then present the webmaster with a list of "potential_contacts" on a single line: "Title", "URL", "Potential_Contact", with URL clickable to view the site.

You could go through and check/eyeball the potentials, and check the box in front of the line to update the record, moving "potential_contact" to "contact".

Two check boxes [update contact] and [delete potential_contact] would allow you to quickly manage the information.

Easily 50 on a page.

Sure, with 20,000 links, it's a chore, but it's better than manually visiting each site, and trying to find the information -- at least this script will potentially pick up a large portion of good contact emails, leaving the ones it could not guess, or free-sites with another free-site email addres as needing intervention.

Just some thoughts....]


PUGDOG� Enterprises, Inc.

The best way to contact me is to NOT use Email.
Please leave a PM here.
Quote Reply
Re: [pugdog] Dmoz dumps In reply to
Doesn't seem to be blocked for me. I use Net::Whois::Raw to parse the data, and then look for an email address in the contact details.

Pretty simple stuff... and yeah, my script doesn't get all of the emails/titles for every link, but it works on about 95% of 100,000 links :)

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Dmoz dumps In reply to
What whois site are you using?

The central registry has blocks that even 5 or 10 look ups within a short period of time will shut down that ip access to the registry -- and it sometimes blocks more than just an IP.


PUGDOG� Enterprises, Inc.

The best way to contact me is to NOT use Email.
Please leave a PM here.
Quote Reply
Re: [pugdog] Dmoz dumps In reply to
I use a variety of servers to check, depending on the domain TLD;

Quote:
my %servers = qw(
COM whois.crsnic.net
NET whois.crsnic.net
EDU whois.educause.net
ORG whois.publicinterestregistry.net
ARPA whois.arin.net
RIPE whois.ripe.net
MIL whois.nic.mil
COOP whois.nic.coop
MUSEUM whois.museum
BIZ whois.neulevel.biz
INFO whois.afilias.net
NAME whois.nic.name
AD whois.ripe.net
AL whois.ripe.net
AM whois.ripe.net
AS whois.gdns.net
AT whois.nic.at
AU box2.aunic.net
AZ whois.ripe.net
BA whois.ripe.net
BE aardvark.dns.be
BG whois.ripe.net
BR whois.nic.br
BY whois.ripe.net
CA eider.cira.ca
CC whois.nic.cc
CH domex.switch.ch
CK whois.ck-nic.org.ck
CL nic.cl
CN log.cnnic.net.cn
CX whois.nic.cx
CY whois.ripe.net
CZ dc1.eunet.cz
DE whois.denic.de
DK whois.dk-hostmaster.dk
DO ns.nic.do
DZ whois.ripe.net
EE whois.ripe.net
EG whois.ripe.net
ES whois.ripe.net
FI whois.ripe.net
FO whois.ripe.net
FR winter.nic.fr
GA whois.ripe.net
GB whois.ripe.net
GE whois.ripe.net
GL whois.ripe.net
GM whois.ripe.net
GR whois.ripe.net
GS whois.adamsnames.tc
HK whois.hkdnr.net.hk
HR whois.ripe.net
HU whois.nic.hu
ID muara.idnic.net.id
IE whois.domainregistry.ie
IL whois.isoc.org.il
IN whois.ncst.ernet.in
IS horus.isnic.is
IT whois.nic.it
JO whois.ripe.net
JP whois.nic.ad.jp
KG whois.domain.kg
KH whois.nic.net.kh
KR whois.krnic.net
LA whois.nic.la
LI domex.switch.ch
LK arisen.nic.lk
LT ns.litnet.lt
LU whois.dns.lu
LV whois.ripe.net
MA whois.ripe.net
MC whois.ripe.net
MD whois.ripe.net
MM whois.nic.mm
MS whois.adamsnames.tc
MT whois.ripe.net
MX whois.nic.mx
NL gw.domain-registry.nl
NO ask.norid.no
NU whois.worldnames.net
NZ akl-iis.domainz.net.nz
PL nazgul.nask.waw.pl
PT whois.ripe.net
RO whois.rotld.ro
RU whois.ripn.net
SE ear.nic-se.se
SG qs.nic.net.sg
SH whois.nic.sh
SI whois.arnes.si
SK whois.ripe.net
SM whois.ripe.net
ST whois.nic.st
SU whois.ripn.net
TC whois.adamsnames.tc
TF whois.adamsnames.tc
TH whois.thnic.net
TJ whois.nic.tj
TN whois.ripe.net
TO whois.tonic.to
TR whois.ripe.net
TW whois.twnic.net
UA whois.net.ua
UK whois.nic.uk
US whois.publicinterestregistry.net
VA whois.ripe.net
VG whois.adamsnames.tc
WS whois.worldsite.ws
YU whois.ripe.net
ZA apies.frd.ac.za
);

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Dmoz dumps In reply to
Sorry to dig up an old thread but did this ever lead to a tool for filling out existing links in a db?
Quote Reply
Re: [roman365] Dmoz dumps In reply to
Hi,

You mean to update link records, based on thier Whois data?

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Dmoz dumps In reply to
Yeah mate.

I remember you did something like this way back for links 2.

Essentially I think the contacts field is one of the most important bits of data to have in a listing. As esentially getting a directory kick started is the hardest part and the potential value of being able to inform listed site owners is the best chance of getting some interest initially, at least in my opinion :)

I'm not sure all the ways contacts could be fetched, whois would be hit or miss, I'm sure it would work for some but some tld's have the option to protect contact info now and some of the records might get you the registrar or registrant data that might be nothing to do with the link owner.

I can appreciate how hard it would be to have a reliable method, I was just curious if this subject resulted in something :)

I want to finally make some significant use of my 2 licences and the two ultra packages I got from you. I find them useful in 2 ways. Firstly I use a Joomla directory component called mosets tree that can import from links, so as a tool to dmoz fetch and then feed on to that its great (there are no verification or spidering facilities for it and no plans to add them - they could learn a lot from your links extensions). Secondly though links combined with your extensions is far superior to anything for Joomla so eventually I want to make 2 good sites with links itself. I just wish there was a way to fully integrate links with joomla, more so even than a bridge.
Quote Reply
Re: [roman365] Dmoz dumps In reply to
Hi,

Yeah, I have a plugin called UpdateMetaValues, which grabs the content from the page, via their "Meta" tags, but thats not using Whois. I would write you a script to get the contact details via a whois, if you wanted - not sure how long would take to do.

Quote:
I'm not sure all the ways contacts could be fetched, whois would be hit or miss, I'm sure it would work for some but some tld's have the option to protect contact info now and some of the records might get you the registrar or registrant data that might be nothing to do with the link owner.

Yeah, some people hide their details - while others show their domain registrars details (i.e to hide their true info)

Re joomla - yeah, I did start trying to write a login interface (that worked with Gcomm), but I never got very far with it (due to time, and also it being pretty complex with the login routines :()

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Dmoz dumps In reply to
I'll have a think, finances looking nasty for the rest of the year, planning to shift countries end of the year :S

I'm sorely tempted though, any rough idea on cost?

I posted elsewhere on Joomla, have a look at a Joomla bridge/migration component called JFusion and theres another called Rokbridge, might be of some interst.
Quote Reply
Re: [roman365] Dmoz dumps In reply to
Hi,

A quote for the whois lookup stuff? Can't see it being more than 2 hours work ($100)

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!