Gossamer Forum
Home : Products : Gossamer Links : Version 1.x :

Duplicate checking REGEX to add '/'

Quote Reply
Duplicate checking REGEX to add '/'
In my continuing exploration of SQL, to fix the database in the way I want it,
I figured I needed a quick and dirty fix for the URLs that ended in / and those
that didn't. The reason is duplicate checking -- I went from no dupes to 33
dupes on a 300 record database after using it.

Code:
SELECT URL FROM Links
WHERE LOWER(URL) RLIKE '.com$|.net$|.org$|.uk$|.au$'

Will find the links and display them that end in the most common domain endings.
You can add more to it just by following the example above. It won't catch the
url's that end with a directory name, but maybe I'll work on that.

To actually update the links, use:

Code:
UPDATE Links
SET URL = CONCAT(URL,'/')
WHERE LOWER(URL) RLIKE '.com$|.net$|.org$|.uk$|.au$'


You might want to try it on a test database first, using the INSERT query
I posted awhile back.


Quote Reply
Re: Duplicate checking REGEX to add '/' In reply to
A minor modification to the WHERE clause to pick out the odd URL that has a mailto: tagged on the end:


WHERE ((LOWER(URL) RLIKE '.com$|.net$|.org$|.uk$|.au$' ) and
( LOWER(URL) NOT RLIKE '@'))
Quote Reply
Re: Duplicate checking REGEX to add '/' In reply to
Pugdog:

Works like a charm- got rid of 60+ out of 10k!

Thanks!

Dave
Quote Reply
Re: Duplicate checking REGEX to add '/' In reply to
I found a bug I can't seem to fix. The regex is matching anything that ends in those terms, interpreting the period as a "match any character" at least for the .com.

So, not only postcards.com but a subdirectory called /postcardscom will also match.

Granted, I didn't escape the '.' in the original regex, but even when I do, I get the same (or bizzare) results.

I've tried \., making sets, using the SQL monitor, using the PHP program, and still can't get it to acknowledge the '.' as a literal, not "any character"

Could this be a bug in DBI? Or MySQL? (Since I've used 2 different interface programs to get to it). Or am I again missing something that should work?

I've read through the "Mastering Regex" book, and tried everything I could find to make the period a literal.

But, I get the same results with (com$|net$|org$|uk$|au$) as I do with (.com$|.net$|.org$|.uk$|.au$) with all sorts of combinations of \. and \\. and such, even tried:

\.(com|net|org|uk|au)$

Which should be anything ending in '.' plus one of those extensions, but it didn't work.

Anyone got an answer???

Quote Reply
Re: Duplicate checking REGEX to add '/' In reply to
are you sure about that?

when i do

Code:
UPDATE Table SET URL = CONCAT(URL,'/')
WHERE URL REGEXP '\.(com|net|org|edu)\.?[^/]*$'

it works for me in a program i made..

jerry

[This message has been edited by widgetz (edited January 15, 2000).]
Quote Reply
Re: Duplicate checking REGEX to add '/' In reply to
btw..

the last bit

\.?[^/]*

searches for the .au .uk .ca .tw

i guess the problem with these are the come.to and i.am ones.. you'd have to add .to .am to (com|net|org|edu|to|am)

jerry
Quote Reply
Re: Duplicate checking REGEX to add '/' In reply to
The regex works, but it's also picking up my Amazon URLs that end in /postcardscom

It's interpreting the period as "any character" which in that case is the 's'


I didn't try REGEXP, I was using RLIKE, I wonder if there is a difference in those routines?





[This message has been edited by pugdog (edited January 15, 2000).]
Quote Reply
Re: Duplicate checking REGEX to add '/' In reply to
This is with mysql. Mysql thinks you are backslashing the next character, so what get's passed internally to mysql is:

UPDATE Links SET URL = CONCAT(URL,'/') WHERE LOWER(URL) RLIKE '.(com$|net$|org$|uk$|jpg$)'

As it just escaped the period. So by the time it gets to the reg expression, the backslash is gone.

What you want to do is double up the backslash as in:

UPDATE Links SET URL = CONCAT(URL,'/') WHERE LOWER(URL) RLIKE '\\.(com$|net$|org$|uk$|au$)'

and that works fine.

Cheers,

Alex
Quote Reply
Re: Duplicate checking REGEX to add '/' In reply to
Ok.... I'll try it again, but I'm really pretty sure I tried that!! It's one of the things that occured to me, but it didn't seem to make a difference...

When I modified the routine to use REGEXP, it seemed to work. I haven't fully checked them side by side, but it very well might be a buglet with RLIKE vs REGEXP. They may not actually be the same subroutines.

It's on my list of things to double check as I catch up with adding to the FAQ -- got hit by the non-flu-flu-bug and don't feel like doing much of anything at all for the past few days.