Gossamer Forum
Home : Products : Gossamer Links : Version 1.x :

Find most common words in database for stop words?

Quote Reply
Find most common words in database for stop words?
I have searched the forums for this but haven't found anything like it, unless I overlooked it. I was wondering if anyone has done anything that can be used as part of the admin process to get the top 100 words that are listed in the database. The reason I ask this is that I'm putting together a sports directory and if someone searched on say football or nfl, they would get some huge list of categories and links, to the point it's probably unusable. What I was hoping to do was to search the database say monthly for the top 100 words listed in the database (title, description, etc...) and then selectively (and manually) add those to my stop words, based upon the word. I'd probably add some extra code to the <%if ignored%> tag that explained what I was doing...

To make it more complicated, what would be great would be:
- ignore the word if it's by itself
- if it's with other words in the search, then force the + sign in front of each word to force an AND search because that's probably what the visitor wanted anyways, ie, if they searched on "nfl news" I doubt they would want sites that had "curling news" just because news was in the search. Would also be cool if the search knew this that a stop word was used so I could output an explanation of what I did in regards to the search and not totally confuse the visitor.

Jerry



Quote Reply
Re: Find most common words in database for stop words? In reply to
This is interesting...

Have you looked at how the search works, though? It's an indexed search. The only words that are searched are the ones that are in the indexed table. You could apply the "count" function to the table to see how many times any individual record appears in the table, or you could run a simple script to iterate that table, and insert the words into a "common_words" table, using the SELECT/INSERT statements as in jump.cgi for the Hits-track. If the record exists, increment the count. If the record doesn't exist, insert it. Then, you'd have a table of the most common words, and with some modification to the "add" record and "re-index" routines, you could have this table built/rebuilt every time a new record was added.

Then, during a "build" you could load the top x records (SELECT * FROM common_words ORDER BY Hits Limit 100) and then insert that into a hash for dynamic access, or a table for "static" access (ie: search.cgi).

You could load this list of words into the "stopwords" list, and then have a list of words that were _always_ stop words, plus the top 100 terms, _or_ even any word that occured in more than a percentage of records....

To do this, you'd need to structure a "common_word_hits" where you check the table for the word and IP, then if the word/ip combination DOESN'T already exist, you increment the common_words "hits" record so you are counting only the number of RECORDS that have any particular word, not the number of times that word appears in the database .... (does that make sense??)

Shouldn't be too hard, but this is _not_ something to tackle until the specifics of the new search are released, since you'll have to work with it sort of intimately.

http://www.postcards.com
FAQ: http://www.postcards.com/FAQ/LinkSQL/

Quote Reply
Re: Find most common words in database for stop words? In reply to
Actually, I think that got me pointed exactly where I needed...

Two simple sql statements will get what I think I want:

select * from Links_Word_Index order by Frequency desc limit 100;
and
select * from Category_Word_Index order by Frequency desc limit 100;

I didn't realize these tables track the Frequency of the word, which I'm assuming is how often it appears in the category and in the title and description (the only two fields I have weighted in the Links table defs).

Now I can pick and choose the words I want to include in stop words. Haven't figured that part yet, but at least now I know how to get the words...

Thanks much.

Jerry