
chris.were at gmail
Nov 8, 2009, 5:11 PM
Post #4 of 4
(466 views)
Permalink
|
Thanks for the tips guys, got it working now. Cheers, Chris On Sun, Nov 8, 2009 at 6:43 AM, Erick Erickson <erickerickson [at] gmail>wrote: > <<<I am using the StandardAnalyzer as most of the other fields being > indexed > are free form text. >>> > > If you try Ahmet's suggestion, PerFieldAnalyzerWrapper is your friend. The > snippet > above makes me wonder if you've seen this class...... > > Best > Erick > > On Sun, Nov 8, 2009 at 5:54 AM, AHMET ARSLAN <iorixxx [at] yahoo> wrote: > > > > Hi, > > > > > > How do I go about indexing domain names? I currently index > > > the domain, but > > > it only works if I put the exact full domain in. For > > > example: > > > > > > site:www.youtube.com (this works) > > > site:youtube.com (this doesn't work) > > > > > > I am using the StandardAnalyzer as most of the other fields > > > being indexed > > > are free form text. Currently the "site" field is stored > > > and tokenized. > > > > StandardTokenizer recognizes www.youtube.com and youtube.com as singe > > token. Therefore they do not match. You can use SimpleAnalyzer which uses > > LetterTokenizer. So > > > > www.youtube.com will be broken into three tokens: www youtube com > > youtube.com will be boreken into two tokens : youtube com > > > > By doing so site:youtube.com will bring you www.youtube.com > > > > But query site:youtube.com will also match a document like > > www.foo.com/youtube.com > > > > Note that LetterTokenizer uses Character.isLetter() method to break text. > > If your input has numbers like www.645cafe.com it will cause you > problems. > > > > In your case it is better to extend CharTonizer and override protected > > boolean isTokenChar(char c) method according to your needs. > > > > > As an additional improvement it would be even better if > > > something like this > > > worked: > > > > > > site:youtube.com/foo > > > > To accomplish this, you can pre-process your queries to strip from first > > '/' char to the end. You need to convert youtube.com/foo/bla/bla to > > youtube.com. > > You can do it in a TokenFilter along with KeywordTokenizer with writing > > custom code. > > > > Hope this helps. > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene > > For additional commands, e-mail: java-user-help [at] lucene > > > > >
|