Gossamer Forum
Home : General : Perl Programming :

Splitting a string into dictionary words

Quote Reply
Splitting a string into dictionary words
I need to split a string into dictionary words, ignoring whitespace (the easy part <G>)

The simplest example might be how a domain name bluecargadgets would be parsed into "blue" "car" and "gadgets" and perhaps "gets" and a few others.

Anagram or "jumble" logic is not really good, since I'm not interested in using all the letters, just finding words that already exist - in order - in the string.

I can see a few potential "brute force" methods, but there has to be something more elegant than a
loop -> shift -> chop ->next format.

bluecargadgets
bluecargadget
bluecargadge
bluecargadg
bluecargad
bluecarga
...
blue <- hit
...
b
luecargadgets
luecargadget
...
uecargadg
..
ecarg
...
...
car <- hit
...
gadgets <-hit
gadget <-hit
gets <-hit
get <-hit

etc.

How does google, or the keyword parsers do it? Any ideas?


PUGDOG� Enterprises, Inc.

The best way to contact me is to NOT use Email.
Please leave a PM here.
Quote Reply
Re: [pugdog] Splitting a string into dictionary words In reply to
Well, after a few hours, a whole bunch of upgrades to the server, and modules, and some head banging and scratching, the brute force method works, if a bit slowly. There are still a few bugs and quirks, and there is *no* interface (you have to format the input ahead of time).

Just a moderate change to keyword lists (breaking up compound words) has made an improvement in how google (and probably other spiders) sees the pages. Some keywords are totally irrelevant (eg: events parses to even, event, vent and vents), so I'm looking for some sort of fuzzy logic that can help remove turkey keywords, but I'm still happy. 10,000+ items were reduced to their components rather quickly.

I might pack it up into a tool if there is interest, but whether it will be a Links plugin or not I don't know. It needs to access ispell or aspell, and requires a few other perl modules from CPAN.

Might become part of the UltraWidgets Toolbox package, since that way you can chose to blow your server up on your own ;)

Ispell is pretty powerful. Never played with it before, and now the aspell program claims to be a significant improvement, so I'll probably be toying with that a little.

Text::Ispell

makes interfacing with it really easy. In case anyone was wondering.


PUGDOG� Enterprises, Inc.

The best way to contact me is to NOT use Email.
Please leave a PM here.