Hi,
Running a directory with parts related to "service clubs",
we are getting a lot of "Non related" and "pornographic" submissions
When someone's submit a site it is immediately fetched with
LWP::UserAgent
and I Have the content of the page in a string "$pagecontent",
and already stripped from all the HTML tags.
I would like to run 2 tests on "$pagecontent" to see if the page is acceptable
But I am clueless how to start with this.
A. 1ste TEST - does the page contain occurrences of these words:
lets say we have a list of words that are related to our topics:
"charity charities donation donations service services etc..etc"
I want to find if "$pagecontent" contains at least 5 occurrences of ANY of the above words
if we have 5x ANY of the above words, then result = "OK RELATED"
if we don't have 5x ANY of the above words, then result = "NOT RELATED"
B. 2ste TEST - does the page contain occurrences of these Avoid-Words:
lets say we have the following list of Avoid-Words:
"nude naked sex porno etc..etc"
I want to find if "$pagecontent" contains 4x ANY of these words
if we have 4x ANY of the Avoid-Words, then result = "XXX TEXT"
if we have less than 4x ANY Avoid-Words, then result = "OK TEXT"
Regards,
Sanuk
Running a directory with parts related to "service clubs",
we are getting a lot of "Non related" and "pornographic" submissions
When someone's submit a site it is immediately fetched with
LWP::UserAgent
and I Have the content of the page in a string "$pagecontent",
and already stripped from all the HTML tags.
I would like to run 2 tests on "$pagecontent" to see if the page is acceptable
But I am clueless how to start with this.
A. 1ste TEST - does the page contain occurrences of these words:
lets say we have a list of words that are related to our topics:
"charity charities donation donations service services etc..etc"
I want to find if "$pagecontent" contains at least 5 occurrences of ANY of the above words
if we have 5x ANY of the above words, then result = "OK RELATED"
if we don't have 5x ANY of the above words, then result = "NOT RELATED"
B. 2ste TEST - does the page contain occurrences of these Avoid-Words:
lets say we have the following list of Avoid-Words:
"nude naked sex porno etc..etc"
I want to find if "$pagecontent" contains 4x ANY of these words
if we have 4x ANY of the Avoid-Words, then result = "XXX TEXT"
if we have less than 4x ANY Avoid-Words, then result = "OK TEXT"
Regards,
Sanuk