Gossamer Forum
Home : General : Perl Programming :

If it's for me, then it's OK else NOT

Quote Reply
If it's for me, then it's OK else NOT
Hi,

Running a directory with parts related to "service clubs",
we are getting a lot of "Non related" and "pornographic" submissions

When someone's submit a site it is immediately fetched with
LWP::UserAgent
and I Have the content of the page in a string "$pagecontent",
and already stripped from all the HTML tags.

I would like to run 2 tests on "$pagecontent" to see if the page is acceptable
But I am clueless how to start with this.

A. 1ste TEST - does the page contain occurrences of these words:
lets say we have a list of words that are related to our topics:
"charity charities donation donations service services etc..etc"
I want to find if "$pagecontent" contains at least 5 occurrences of ANY of the above words
if we have 5x ANY of the above words, then result = "OK RELATED"
if we don't have 5x ANY of the above words, then result = "NOT RELATED"

B. 2ste TEST - does the page contain occurrences of these Avoid-Words:
lets say we have the following list of Avoid-Words:
"nude naked sex porno etc..etc"
I want to find if "$pagecontent" contains 4x ANY of these words
if we have 4x ANY of the Avoid-Words, then result = "XXX TEXT"
if we have less than 4x ANY Avoid-Words, then result = "OK TEXT"

Regards,

Sanuk

Quote Reply
Re: [sanuk] If it's for me, then it's OK else NOT In reply to
Code:
my (%bad, %good);
my $temp = $pagecontent;
my @good = qw/charity charities donation donations/;
my @bad = qw/sex porn nude/;

foreach my $good (@good) {
my $total = ($temp =~ s/\b$good\b//ig);
$good{$good} = $total;
}

if (grep $_ < 5, values %good) {
# NOT RELATED
}


foreach my $bad (@bad) {
my $total = ($temp =~ s/\b$bad\b//ig);
$bad{$bad} = $total;
}

if (grep $_ > 4, values %bad) {
# XXX TEXT
}

Untested so it will probably not work but it may give you an idea ;)

Last edited by:

Paul: May 31, 2003, 3:17 AM
Quote Reply
Re: [Paul] If it's for me, then it's OK else NOT In reply to
Hi,

Thank you Paul for the code example.

I tried it out with some changing but it does not work.

I always get "YES" no matter what test-text I check

I also tried to get the number of good and bad words by:

$QT_good = "grep $_";

$QT_bad = "grep $_";

but this gives errors, so I took out again

This is the code that always give "YES"

############################

my (%bad, %good);
my $temp = $pagecontent;
my $topic_related;
my $bad_related;
my @good = qw/charities charity wellfare donation/;
my @bad = qw/sex porn nude xxx/;

foreach my $good (@good){
my $total = ($temp =~ s/\b$good\b//ig);
$good{$good} = $total;
}

if (grep $_ < 5, values %good){$topic_related = "YES";}
else {$topic_related = "NO";}

foreach my $bad (@bad) {
my $total = ($temp =~ s/\b$bad\b//ig);
$bad{$bad} = $total;
}

if (grep $_ > 4, values %bad){$bad_related = "YES";}
else {$bad_related = "NO";}

############################

Regards,

Sanuk
Quote Reply
Re: [sanuk] If it's for me, then it's OK else NOT In reply to
You have a few bits mixed up. You have YES set for content that should be set as unrelated.

Anyway try the attached...
Quote Reply
Re: [Paul] If it's for me, then it's OK else NOT In reply to
Hi Paul,

Thank You very much for trying to help me with this "Mission Impossible"

But the script does not work,

It just gives me always the same result, even I change some words inside "my $page".

And if it would work, how would I be able to export the GOOD or BAD result to a $firsttest and $secondtest string, as to further do an if or else statement such as:

if $firsttest == "YES" {&dothisthing;}

else {&dotheotherthing;}

if $secondtest == "YES" {&gothere;}

else {&gotheotherway;}

Sorry Mr. Paul, I am using your time and making problem for you with something very complicated.



#### Paul script start ###############

#!/usr/bin/perl
#======================================
use CGI::Carp qw(carpout fatalsToBrowser);
main();
#======================================sub main {
#--------------------------------------
# Test grep.
my $page = 'good gooda goodb goodc goodd good bad bad1 bad2 bad3 bad4';
my @good = qw/good gooda goodb goodc goodd goode goodd/;
my @baad = qw/bad bad1 bad2 bad3 bad4/;
my %baad = map { $_, ($page =~ s/\b$_\b//ig) } @baad;
my %good = map { $_, ($page =~ s/\b$_\b//ig) } @good;

print qq{Content-type: text/html\n\n};
print grep($_ > 5, values(%good)) ? q{RELATED} : q{NOT RELATED};
print q{<br>};
print grep($_ > 4, values(%baad)) ? q{BADNESS} : q{NOT BADNESS};
}

#### Paul script end ###############

Best Regards,

Sanuk
Quote Reply
Re: [sanuk] If it's for me, then it's OK else NOT In reply to
Sorry

Can not edit the above post

but should be:

#======================================

sub main {
#--------------------------------------
# Test grep.

Thank you

Sanuk
Quote Reply
Re: [sanuk] If it's for me, then it's OK else NOT In reply to
I'm not sure what to say...it works correctly for me Unimpressed

What results are you getting, what were you expecting and what was the content of $page for the test?

Last edited by:

Paul: Jun 1, 2003, 5:34 AM
Quote Reply
Re: [Paul] If it's for me, then it's OK else NOT In reply to
Hi Paul,

In your example, using:


$page = 'good good good good good good bad bad bad bad bad';

my @good = qw/good/;

There it works, it produces "RELATED" because you have 6 "Good" words

After I changed to:

$page = 'good gooda goodb goodc goodd goode good bad bad bad bad bad';

my @good = qw/good gooda goodb goodc goodd goode/;

We even have 7 "Good" words in the text named $page

but here it does not work anymore and produces: "NOT RELATED"

Best Regards,

Sanuk

Quote Reply
Re: [sanuk] If it's for me, then it's OK else NOT In reply to
Oops, my code checks for 5 occurances of the same word.

Try adding:

Code:
my $good_total;
$good_total += $_ for (values %good);

my $baad_total;
$baad_total += $_ for (values %baad);

Then use $good_total and $baad_total to determine if it's related or not, or porn or not.

Last edited by:

Paul: Jun 1, 2003, 1:26 PM
Quote Reply
Re: [Paul] If it's for me, then it's OK else NOT In reply to
Hi,

Thanks alot for the help Paul.

Now works OK !!!

Here is how I changed on your advice:

#==== Paul's good-bad Start ===================
use CGI::Carp qw(carpout fatalsToBrowser);
my $good_total; my $baad_total;
my $page = 'good gooda goodb goodc bas bada basb badc';
my @good = qw/good gooda goodb goodc goodd goode goodd/;
my @baad = qw/bad bada badb badc/;
my %baad = map { $_, ($page =~ s/\b$_\b//ig) } @baad;
$baad_total += $_ for (values %baad);
my %good = map { $_, ($page =~ s/\b$_\b//ig) } @good;
$good_total += $_ for (values %good);
print "Content-type: text/html\n\n";
print "<br> Total Good: $good_total\n";
print "<br> Total Bad: $baad_total\n";
#==== Paul's good-bad End =====================

Only one more question,

Do I need to add "my" every where like:

my @baad = qw/bad bada badb badc/;
or can I just put:

@baad = qw/bad bada badb badc/;

Again thanks a lot.

Regards,

Sanuk
Quote Reply
Re: [sanuk] If it's for me, then it's OK else NOT In reply to
my() is strict compatible...you don't *have* to use it, but it will only lead to better code in the future :)