Gossamer Forum
Home : Products : Gossamer Links : Version 1.x :

Re: Find most common words in database for stop words?

Quote Reply
Re: Find most common words in database for stop words? In reply to
This is interesting...

Have you looked at how the search works, though? It's an indexed search. The only words that are searched are the ones that are in the indexed table. You could apply the "count" function to the table to see how many times any individual record appears in the table, or you could run a simple script to iterate that table, and insert the words into a "common_words" table, using the SELECT/INSERT statements as in jump.cgi for the Hits-track. If the record exists, increment the count. If the record doesn't exist, insert it. Then, you'd have a table of the most common words, and with some modification to the "add" record and "re-index" routines, you could have this table built/rebuilt every time a new record was added.

Then, during a "build" you could load the top x records (SELECT * FROM common_words ORDER BY Hits Limit 100) and then insert that into a hash for dynamic access, or a table for "static" access (ie: search.cgi).

You could load this list of words into the "stopwords" list, and then have a list of words that were _always_ stop words, plus the top 100 terms, _or_ even any word that occured in more than a percentage of records....

To do this, you'd need to structure a "common_word_hits" where you check the table for the word and IP, then if the word/ip combination DOESN'T already exist, you increment the common_words "hits" record so you are counting only the number of RECORDS that have any particular word, not the number of times that word appears in the database .... (does that make sense??)

Shouldn't be too hard, but this is _not_ something to tackle until the specifics of the new search are released, since you'll have to work with it sort of intimately.

http://www.postcards.com
FAQ: http://www.postcards.com/FAQ/LinkSQL/

Subject Author Views Date
Thread Find most common words in database for stop words? JerryP 1094 Jun 12, 2000, 2:38 PM
Thread Re: Find most common words in database for stop words?
pugdog 1044 Jun 12, 2000, 6:46 PM
Post Re: Find most common words in database for stop words?
JerryP 1035 Jun 12, 2000, 10:22 PM