Gossamer Forum
Home : Products : Gossamer Links : Discussions :

duplicate check is not helpfull

Quote Reply
duplicate check is not helpfull
duplicate check in links sql is weak. for example if you have http://www.mysite.com
and you added http://www.mysite.com/ and the software will never find these as duplicate
not to mention www.mysite.com and mysite.com aslo mysite.com and mysite.com/index.html

Quote Reply
Re: duplicate check is not helpfull In reply to
The duplicate check is really exact duplicates. Yes, it would be nice to
do a check for just the base URL, but think about how many variations there
could be? That would be a CPU intensive operation, potentially needing to
build temporary tables and such.

The only other way, would be to build a "duplicates" index, defining what constituted a "url" in the database. Maybe when validating a link the system
would "suggest" a base URL, you could change it, but if that URL was already
in the duplicates table, then the new addition would be rejected.

Trying to make this work in "real time" is tricky.



PUGDOGŪ Enterprises, Inc.
FAQ:http://LinkSQL.com/FAQ
Forum:http://LinkSQL.com/forum
Quote Reply
Re: duplicate check is not helpfull In reply to
another possibility would be to run a script that converted all of:

www.domain.com
www.domain.com/
www.domain.com/index.html
www.domain.com/index.php
http://www.doamin.com

to

http://www.domain.com/

if think this is normally better than specifying http://www.domain.com/index.html mainly because the owner might change .html to .htm but they are always going to make sure that www.domain.com works.

If a script was run that updated all of these then duplicate check would work fine.

http://www.ASciFi.com/ - The Science Fiction Portal
Quote Reply
Re: duplicate check is not helpfull In reply to
GClemmons provided this code in another post - maybe it will be useful for you..

/[^\w|\d|-]URL.COM/i


Paul Wilson.
Installations:
http://www.wiredon.net/gt/
Quote Reply
Re: duplicate check is not helpfull In reply to
where do you add the code you provided

Quote Reply
Re: duplicate check is not helpfull In reply to
Code:
www.domain.com
www.domain.com/
www.domain.com/index.html
www.domain.com/index.php
http://www.doamin.com

to

http://www.domain.com/
The problem with this is a lot of domain owners give different first pages to different engines, to monitor success. You do that, they'll de-list you.

ALso, many sites don't have both www.domain.com and domain.com working. I've seen it both ways. Where the www. won't answer, and where the non-www. won't answer.

The only one of the above variations that works, is the trailing '/'. Adding that never seems to hurt, removing it may or may not cause a redirect error.



PUGDOGŪ Enterprises, Inc.
FAQ:http://LinkSQL.com/FAQ
Forum:http://LinkSQL.com/forum