Gossamer Forum
Home : General : Perl Programming :

Stripping duplicate Keywords

Quote Reply
Stripping duplicate Keywords
Hi, dear friends

Normally, this is a links-2 topic, but I think it fits better in the cgi-forum.
Please correct me if I am wrong !!!

I created a field (in ADD your link) where users can input (or modify)
their KEYWORDS, regarding their submitted link.
Field URL | Field Title | Field description | Field Keywords | ... etc

Even I am asking not to repeat the same word, it does not help !
and in this field I am getting alot of extra kilobyte-size due
to repeated words such as:

a/ london news, news from london, london news online, updated london news
b/ toys, toys, toys, exporter, exporter, exporters, exporters, toy exporter
c/ travel info travel guide travel agent book a travel travel portal
d/ recipes of food, food recipes, my food recipes, the best recipes
e/ london link page links to london links from london the london top links

editing by hand is starting to become impossible !!

So what I am looking for is a Regex_CGI formula that could cleanup
this field of all the too-many words using the following rules:

For every word that is in the field Keywords:
1. words smaller than 2 characters are removed
2. all duplicate words are removed
3. if the field TITLE and Description already contains this word,
then the word is also removed from the field KEYWORDS.

Regards,
Sanuk
Quote Reply
Re: [sanuk] Stripping duplicate Keywords In reply to
Phew, this is a little tricky but here goes...

Code:
# Get rid of 1 or 2 letter words.
$in{keywords} =~ s/\b[a-z]{1,2}\b//ig;

# Strip duplicates.
undef %tmp;
my @uniques = grep (! $tmp{$_}++, split /\b/, $in{keywords});

# If exists in title or description then remove it.
my $i = 0;
foreach my $unique (@uniques) {
if ($in{title} =~ /\b\Q$unique\E\b/i or $in{description} =~ /\b\Q$unique\E\b/i) {
splice(@uniques, $i, 1);
}
$i++;
}

Last edited by:

Paul: Dec 13, 2002, 1:45 AM
Quote Reply
Re: [Paul] Stripping duplicate Keywords In reply to
Hi Paul,
Thanks for the answer, I will try it out and let you know.
Had a little problem getting on the net the last 2 days due to bad telephone lines
Regards,
Sanuk