Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

easy way to figure out most common tokens?

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


spotter at gmail

Aug 15, 2012, 10:46 AM

Post #1 of 11 (884 views)
Permalink
easy way to figure out most common tokens?

Is there an easy way to figure out the most common tokens and then
remove those tokens from the documents.

use case: imagine one is indexing a mailing list (such as this
java-user) and is extracting all e-mail addresses in the messages and
adding them to a doc.

What that means is that one will be a lot of

java-user-unsubscribe [at] lucene
java-user-help [at] lucene

due to that being in the signature of each email.

while, the best approach might be to not put it in the index in the
first place, I'm wondering if there's a good way to process the index
after the fact to remove these type of entries.

thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Aug 15, 2012, 11:29 AM

Post #2 of 11 (868 views)
Permalink
Re: easy way to figure out most common tokens? [In reply to]

I don't see how you could without indexing everything first
since you can't know what the most frequent terms until
you've processed all your documents....

If you know these terms in advance, it seems like you could
just call then stopwords and use the common stopword
processing.

If you have to examine your corpus in the first place,
it seems like you could do something with term
frequencies to extract the most common terms from
your index then re-index all your data with those terms
as stopwords..

Best
Erick

On Wed, Aug 15, 2012 at 11:46 AM, Shaya Potter <spotter [at] gmail> wrote:
> Is there an easy way to figure out the most common tokens and then remove
> those tokens from the documents.
>
> use case: imagine one is indexing a mailing list (such as this java-user)
> and is extracting all e-mail addresses in the messages and adding them to a
> doc.
>
> What that means is that one will be a lot of
>
> java-user-unsubscribe [at] lucene
> java-user-help [at] lucene
>
> due to that being in the signature of each email.
>
> while, the best approach might be to not put it in the index in the first
> place, I'm wondering if there's a good way to process the index after the
> fact to remove these type of entries.
>
> thanks.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


iorixxx at yahoo

Aug 15, 2012, 11:34 AM

Post #3 of 11 (870 views)
Permalink
Re: easy way to figure out most common tokens? [In reply to]

> Is there an easy way to figure out
> the most common tokens and then remove those tokens from the
> documents.

Probably this : http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/HighFreqTerms.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


spotter at gmail

Aug 15, 2012, 11:42 AM

Post #4 of 11 (868 views)
Permalink
Re: easy way to figure out most common tokens? [In reply to]

On 08/15/2012 02:29 PM, Erick Erickson wrote:
> I don't see how you could without indexing everything first
> since you can't know what the most frequent terms until
> you've processed all your documents....

exactly

> If you know these terms in advance, it seems like you could
> just call then stopwords and use the common stopword
> processing.
>
> If you have to examine your corpus in the first place,
> it seems like you could do something with term
> frequencies to extract the most common terms from
> your index then re-index all your data with those terms
> as stopwords..

its a possibility, but that would require reindexing, which would take a
long time, hence my desire to try and edit the individual documents.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


spotter at gmail

Aug 15, 2012, 11:44 AM

Post #5 of 11 (868 views)
Permalink
Re: easy way to figure out most common tokens? [In reply to]

On 08/15/2012 02:34 PM, Ahmet Arslan wrote:
>> Is there an easy way to figure out
>> the most common tokens and then remove those tokens from the
>> documents.
>
> Probably this : http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/HighFreqTerms.html

ah, that's a good part 1. Then the Q would then be, how to modify the
index without reindexing all documents.

my gut is that it should be possible (it seems luke does it), but never
went deep into the document object besides for adding fields.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Aug 15, 2012, 11:47 AM

Post #6 of 11 (872 views)
Permalink
RE: easy way to figure out most common tokens? [In reply to]

If you found the terms to remove with e.g. HighFreqTerms, you can use the
abstract class FilterIndexReader (FilterAtomicReader in Lucene 4.0) to code
a filter for the term dictionary (just return a filtered TermEnum) on
merging. Just wrap an IndexReader with this FilterIndexReader that hides the
terms and then do IndexWriter.addIndexes(filteredReader) to a new, empty
index. This still needs time, but maybe better than reindexing.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Shaya Potter [mailto:spotter [at] gmail]
> Sent: Wednesday, August 15, 2012 8:43 PM
> To: java-user [at] lucene
> Cc: Erick Erickson
> Subject: Re: easy way to figure out most common tokens?
>
> On 08/15/2012 02:29 PM, Erick Erickson wrote:
> > I don't see how you could without indexing everything first since you
> > can't know what the most frequent terms until you've processed all
> > your documents....
>
> exactly
>
> > If you know these terms in advance, it seems like you could just call
> > then stopwords and use the common stopword processing.
> >
> > If you have to examine your corpus in the first place, it seems like
> > you could do something with term frequencies to extract the most
> > common terms from your index then re-index all your data with those
> > terms as stopwords..
>
> its a possibility, but that would require reindexing, which would take a
long
> time, hence my desire to try and edit the individual documents.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


uwe at thetaphi

Aug 15, 2012, 11:48 AM

Post #7 of 11 (871 views)
Permalink
RE: easy way to figure out most common tokens? [In reply to]

You cannot modify the ternm dictionary of an index, see my other eMail. You
have to filter it by copying to a new index or reindexing. Document
modifications are not supported in Lucene and other inverted indexes.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Shaya Potter [mailto:spotter [at] gmail]
> Sent: Wednesday, August 15, 2012 8:44 PM
> To: java-user [at] lucene
> Subject: Re: easy way to figure out most common tokens?
>
> On 08/15/2012 02:34 PM, Ahmet Arslan wrote:
> >> Is there an easy way to figure out
> >> the most common tokens and then remove those tokens from the
> >> documents.
> >
> > Probably this :
> > http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/Hig
> > hFreqTerms.html
>
> ah, that's a good part 1. Then the Q would then be, how to modify the
index
> without reindexing all documents.
>
> my gut is that it should be possible (it seems luke does it), but never
went deep
> into the document object besides for adding fields.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


spotter at gmail

Aug 15, 2012, 11:58 AM

Post #8 of 11 (871 views)
Permalink
Re: easy way to figure out most common tokens? [In reply to]

ok, I have no problem with filter/copy to new index, but that seems like
a good start point. Would need to figure out how to extend that class
correctly, but at least gives me a good starting point.

On 08/15/2012 02:48 PM, Uwe Schindler wrote:
> You cannot modify the ternm dictionary of an index, see my other eMail. You
> have to filter it by copying to a new index or reindexing. Document
> modifications are not supported in Lucene and other inverted indexes.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe [at] thetaphi
>
>
>> -----Original Message-----
>> From: Shaya Potter [mailto:spotter [at] gmail]
>> Sent: Wednesday, August 15, 2012 8:44 PM
>> To: java-user [at] lucene
>> Subject: Re: easy way to figure out most common tokens?
>>
>> On 08/15/2012 02:34 PM, Ahmet Arslan wrote:
>>>> Is there an easy way to figure out
>>>> the most common tokens and then remove those tokens from the
>>>> documents.
>>>
>>> Probably this :
>>> http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/Hig
>>> hFreqTerms.html
>>
>> ah, that's a good part 1. Then the Q would then be, how to modify the
> index
>> without reindexing all documents.
>>
>> my gut is that it should be possible (it seems luke does it), but never
> went deep
>> into the document object besides for adding fields.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


spotter at gmail

Aug 19, 2012, 5:07 PM

Post #9 of 11 (842 views)
Permalink
Re: easy way to figure out most common tokens? [In reply to]

On 08/15/2012 02:34 PM, Ahmet Arslan wrote:
>> Is there an easy way to figure out
>> the most common tokens and then remove those tokens from the
>> documents.
>
> Probably this : http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/HighFreqTerms.html

unsure how to use this

as far as I can tell org.apache.lucene.misc.TermStats doesn't exist in
lucene 3.6.1 (there seems to be some class like that in 4.x, but that
doesn't help me).

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


spotter at gmail

Aug 19, 2012, 5:17 PM

Post #10 of 11 (839 views)
Permalink
Re: easy way to figure out most common tokens? [In reply to]

On 08/19/2012 08:07 PM, Shaya Potter wrote:
> On 08/15/2012 02:34 PM, Ahmet Arslan wrote:
>>> Is there an easy way to figure out
>>> the most common tokens and then remove those tokens from the
>>> documents.
>>
>> Probably this :
>> http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/HighFreqTerms.html
>>
>
> unsure how to use this
>
> as far as I can tell org.apache.lucene.misc.TermStats doesn't exist in
> lucene 3.6.1 (there seems to be some class like that in 4.x, but that
> doesn't help me).

I'm wrong, its there, but eclipse isn't seeing it (haven't tried javac
by itself), even though it sees HighFreqTerms just fine.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


goksron at gmail

Aug 19, 2012, 7:04 PM

Post #11 of 11 (840 views)
Permalink
Re: easy way to figure out most common tokens? [In reply to]

You don't need to index the data. Just run the analyzer and maintain
your own counters. This will be disk-bound and will run at your disk
reading speed.

On Sun, Aug 19, 2012 at 5:17 PM, Shaya Potter <spotter [at] gmail> wrote:
> On 08/19/2012 08:07 PM, Shaya Potter wrote:
>>
>> On 08/15/2012 02:34 PM, Ahmet Arslan wrote:
>>>>
>>>> Is there an easy way to figure out
>>>> the most common tokens and then remove those tokens from the
>>>> documents.
>>>
>>>
>>> Probably this :
>>>
>>> http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/misc/HighFreqTerms.html
>>>
>>
>> unsure how to use this
>>
>> as far as I can tell org.apache.lucene.misc.TermStats doesn't exist in
>> lucene 3.6.1 (there seems to be some class like that in 4.x, but that
>> doesn't help me).
>
>
> I'm wrong, its there, but eclipse isn't seeing it (haven't tried javac by
> itself), even though it sees HighFreqTerms just fine.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>



--
Lance Norskog
goksron [at] gmail

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.