Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Keep URLs intact and not tokenized by the StandardTokenizer

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


verma.sudha at gmail

Nov 18, 2009, 9:58 PM

Post #1 of 4 (560 views)
Permalink
Keep URLs intact and not tokenized by the StandardTokenizer

Hi,

I am using lucene 2-9-1.

I am reading in free text documents which I index using lucene and the
StandardAnalyzer at the moment.

The StandardAnalyzer keeps email addresses intact and does not tokenize
them. Is there something similar for
URLs? This seems like a common need. So, I thought I'd check if there
is anything out there that does it already.

I'd appreciate any help.

Thanks,
sudha


sarowe at syr

Nov 19, 2009, 11:15 AM

Post #2 of 4 (507 views)
Permalink
RE: Keep URLs intact and not tokenized by the StandardTokenizer [In reply to]

Hi Sudha,

In the past, I've built regexes to recognize URLs using the information here:

http://www.foad.org/~abigail/Perl/url2.html

The above, however, is currently a dead link.

Here's the Internet Archive's WayBack Machine's cache of this page from August 2007:

<http://web.archive.org/web/20070807114147/http://www.foad.org/~abigail/Perl/url2.html>

Here's the same content, of unknown vintage, as a text file (even though it has a .html extension):

http://nerxs.com/mirrorpages/urlregex.html

Also, Jeffrey Friedl's book "Mastering Regular Expressions", 2nd edition (but not the 1st edition), has a section on recognizing URLs in Chapter 5.

Steve

On 11/19/2009 at 12:58 AM, Sudha Verma wrote:
> Hi,
>
> I am using lucene 2-9-1.
>
> I am reading in free text documents which I index using lucene and the
> StandardAnalyzer at the moment.
>
> The StandardAnalyzer keeps email addresses intact and does not tokenize
> them. Is there something similar for
> URLs? This seems like a common need. So, I thought I'd check if there
> is anything out there that does it already.
>
> I'd appreciate any help.
>
> Thanks,
> sudha



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


verma.sudha at gmail

Nov 19, 2009, 1:35 PM

Post #3 of 4 (510 views)
Permalink
Re: Keep URLs intact and not tokenized by the StandardTokenizer [In reply to]

Thanks.

I was hoping Lucene would already have a solution for
this since it seems like it would be a common problem.

I am new to the lucene API. If I were to implement something from
scratch, are my options to extend the Tokenizer to support http regex
and then pass the text to StandardTokenizer...

-sudha

On Thu, Nov 19, 2009 at 12:15 PM, Steven A Rowe <sarowe [at] syr> wrote:

> Hi Sudha,
>
> In the past, I've built regexes to recognize URLs using the information
> here:
>
> http://www.foad.org/~abigail/Perl/url2.html
>
> The above, however, is currently a dead link.
>
> Here's the Internet Archive's WayBack Machine's cache of this page from
> August 2007:
>
> <
> http://web.archive.org/web/20070807114147/http://www.foad.org/~abigail/Perl/url2.html
> >
>
> Here's the same content, of unknown vintage, as a text file (even though it
> has a .html extension):
>
> http://nerxs.com/mirrorpages/urlregex.html
>
> Also, Jeffrey Friedl's book "Mastering Regular Expressions", 2nd edition
> (but not the 1st edition), has a section on recognizing URLs in Chapter 5.
>
> Steve
>
> On 11/19/2009 at 12:58 AM, Sudha Verma wrote:
> > Hi,
> >
> > I am using lucene 2-9-1.
> >
> > I am reading in free text documents which I index using lucene and the
> > StandardAnalyzer at the moment.
> >
> > The StandardAnalyzer keeps email addresses intact and does not tokenize
> > them. Is there something similar for
> > URLs? This seems like a common need. So, I thought I'd check if there
> > is anything out there that does it already.
> >
> > I'd appreciate any help.
> >
> > Thanks,
> > sudha
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


renaud.delbru at deri

Nov 19, 2009, 5:29 PM

Post #4 of 4 (514 views)
Permalink
RE: Keep URLs intact and not tokenized by the StandardTokenizer [In reply to]

Hi,

Some time ago, I had to modify and extend the Lucene StandardTokenizer grammar (flex file) so that it preserves URIs (based on RFC3986). I have extracted the files from my project and published the source code on github [1] under the Apache License 2.0, if it can help.

[1] http://github.com/rdelbru/lucene-uri-preserving-standard-tokenizer

--
Renaud Delbru


-----Original Message-----
From: Sudha Verma [mailto:verma.sudha [at] gmail]
Sent: Thu 11/19/2009 9:35 PM
To: java-user [at] lucene
Subject: Re: Keep URLs intact and not tokenized by the StandardTokenizer

Thanks.

I was hoping Lucene would already have a solution for
this since it seems like it would be a common problem.

I am new to the lucene API. If I were to implement something from
scratch, are my options to extend the Tokenizer to support http regex
and then pass the text to StandardTokenizer...

-sudha

On Thu, Nov 19, 2009 at 12:15 PM, Steven A Rowe <sarowe [at] syr> wrote:

> Hi Sudha,
>
> In the past, I've built regexes to recognize URLs using the information
> here:
>
> http://www.foad.org/~abigail/Perl/url2.html
>
> The above, however, is currently a dead link.
>
> Here's the Internet Archive's WayBack Machine's cache of this page from
> August 2007:
>
> <
> http://web.archive.org/web/20070807114147/http://www.foad.org/~abigail/Perl/url2.html
> >
>
> Here's the same content, of unknown vintage, as a text file (even though it
> has a .html extension):
>
> http://nerxs.com/mirrorpages/urlregex.html
>
> Also, Jeffrey Friedl's book "Mastering Regular Expressions", 2nd edition
> (but not the 1st edition), has a section on recognizing URLs in Chapter 5.
>
> Steve
>
> On 11/19/2009 at 12:58 AM, Sudha Verma wrote:
> > Hi,
> >
> > I am using lucene 2-9-1.
> >
> > I am reading in free text documents which I index using lucene and the
> > StandardAnalyzer at the moment.
> >
> > The StandardAnalyzer keeps email addresses intact and does not tokenize
> > them. Is there something similar for
> > URLs? This seems like a common need. So, I thought I'd check if there
> > is anything out there that does it already.
> >
> > I'd appreciate any help.
> >
> > Thanks,
> > sudha
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.