Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Indexing domain names?

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


chris.were at gmail

Nov 7, 2009, 10:20 PM

Post #1 of 4 (538 views)
Permalink
Indexing domain names?

Hi,

How do I go about indexing domain names? I currently index the domain, but
it only works if I put the exact full domain in. For example:

site:www.youtube.com (this works)
site:youtube.com (this doesn't work)

I am using the StandardAnalyzer as most of the other fields being indexed
are free form text. Currently the "site" field is stored and tokenized.

As an additional improvement it would be even better if something like this
worked:

site:youtube.com/foo

Cheers,
Chris


iorixxx at yahoo

Nov 8, 2009, 2:54 AM

Post #2 of 4 (483 views)
Permalink
Re: Indexing domain names? [In reply to]

> Hi,
>
> How do I go about indexing domain names? I currently index
> the domain, but
> it only works if I put the exact full domain in. For
> example:
>
> site:www.youtube.com (this works)
> site:youtube.com (this doesn't work)
>
> I am using the StandardAnalyzer as most of the other fields
> being indexed
> are free form text. Currently the "site" field is stored
> and tokenized.

StandardTokenizer recognizes www.youtube.com and youtube.com as singe token. Therefore they do not match. You can use SimpleAnalyzer which uses LetterTokenizer. So

www.youtube.com will be broken into three tokens: www youtube com
youtube.com will be boreken into two tokens : youtube com

By doing so site:youtube.com will bring you www.youtube.com

But query site:youtube.com will also match a document like www.foo.com/youtube.com

Note that LetterTokenizer uses Character.isLetter() method to break text. If your input has numbers like www.645cafe.com it will cause you problems.

In your case it is better to extend CharTonizer and override protected boolean isTokenChar(char c) method according to your needs.

> As an additional improvement it would be even better if
> something like this
> worked:
>
> site:youtube.com/foo

To accomplish this, you can pre-process your queries to strip from first '/' char to the end. You need to convert youtube.com/foo/bla/bla to youtube.com.
You can do it in a TokenFilter along with KeywordTokenizer with writing custom code.

Hope this helps.




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Nov 8, 2009, 6:43 AM

Post #3 of 4 (489 views)
Permalink
Re: Indexing domain names? [In reply to]

<<<I am using the StandardAnalyzer as most of the other fields being indexed
are free form text. >>>

If you try Ahmet's suggestion, PerFieldAnalyzerWrapper is your friend. The
snippet
above makes me wonder if you've seen this class......

Best
Erick

On Sun, Nov 8, 2009 at 5:54 AM, AHMET ARSLAN <iorixxx [at] yahoo> wrote:

> > Hi,
> >
> > How do I go about indexing domain names? I currently index
> > the domain, but
> > it only works if I put the exact full domain in. For
> > example:
> >
> > site:www.youtube.com (this works)
> > site:youtube.com (this doesn't work)
> >
> > I am using the StandardAnalyzer as most of the other fields
> > being indexed
> > are free form text. Currently the "site" field is stored
> > and tokenized.
>
> StandardTokenizer recognizes www.youtube.com and youtube.com as singe
> token. Therefore they do not match. You can use SimpleAnalyzer which uses
> LetterTokenizer. So
>
> www.youtube.com will be broken into three tokens: www youtube com
> youtube.com will be boreken into two tokens : youtube com
>
> By doing so site:youtube.com will bring you www.youtube.com
>
> But query site:youtube.com will also match a document like
> www.foo.com/youtube.com
>
> Note that LetterTokenizer uses Character.isLetter() method to break text.
> If your input has numbers like www.645cafe.com it will cause you problems.
>
> In your case it is better to extend CharTonizer and override protected
> boolean isTokenChar(char c) method according to your needs.
>
> > As an additional improvement it would be even better if
> > something like this
> > worked:
> >
> > site:youtube.com/foo
>
> To accomplish this, you can pre-process your queries to strip from first
> '/' char to the end. You need to convert youtube.com/foo/bla/bla to
> youtube.com.
> You can do it in a TokenFilter along with KeywordTokenizer with writing
> custom code.
>
> Hope this helps.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


chris.were at gmail

Nov 8, 2009, 5:11 PM

Post #4 of 4 (466 views)
Permalink
Re: Indexing domain names? [In reply to]

Thanks for the tips guys, got it working now.

Cheers,
Chris

On Sun, Nov 8, 2009 at 6:43 AM, Erick Erickson <erickerickson [at] gmail>wrote:

> <<<I am using the StandardAnalyzer as most of the other fields being
> indexed
> are free form text. >>>
>
> If you try Ahmet's suggestion, PerFieldAnalyzerWrapper is your friend. The
> snippet
> above makes me wonder if you've seen this class......
>
> Best
> Erick
>
> On Sun, Nov 8, 2009 at 5:54 AM, AHMET ARSLAN <iorixxx [at] yahoo> wrote:
>
> > > Hi,
> > >
> > > How do I go about indexing domain names? I currently index
> > > the domain, but
> > > it only works if I put the exact full domain in. For
> > > example:
> > >
> > > site:www.youtube.com (this works)
> > > site:youtube.com (this doesn't work)
> > >
> > > I am using the StandardAnalyzer as most of the other fields
> > > being indexed
> > > are free form text. Currently the "site" field is stored
> > > and tokenized.
> >
> > StandardTokenizer recognizes www.youtube.com and youtube.com as singe
> > token. Therefore they do not match. You can use SimpleAnalyzer which uses
> > LetterTokenizer. So
> >
> > www.youtube.com will be broken into three tokens: www youtube com
> > youtube.com will be boreken into two tokens : youtube com
> >
> > By doing so site:youtube.com will bring you www.youtube.com
> >
> > But query site:youtube.com will also match a document like
> > www.foo.com/youtube.com
> >
> > Note that LetterTokenizer uses Character.isLetter() method to break text.
> > If your input has numbers like www.645cafe.com it will cause you
> problems.
> >
> > In your case it is better to extend CharTonizer and override protected
> > boolean isTokenChar(char c) method according to your needs.
> >
> > > As an additional improvement it would be even better if
> > > something like this
> > > worked:
> > >
> > > site:youtube.com/foo
> >
> > To accomplish this, you can pre-process your queries to strip from first
> > '/' char to the end. You need to convert youtube.com/foo/bla/bla to
> > youtube.com.
> > You can do it in a TokenFilter along with KeywordTokenizer with writing
> > custom code.
> >
> > Hope this helps.
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
> >
> >
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.