Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Using org.apache.lucene.analysis.compound

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


bbdouglas at basistech

Oct 20, 2009, 5:35 PM

Post #1 of 12 (674 views)
Permalink
Using org.apache.lucene.analysis.compound

Hello,

I've found a number of posts in different places talking about how to perform decompounding, but I haven't found too many discussing how to use the results of decompounding. If anyone can answer this question or point me to an existing discussion it would be very helpful.

In the description of the org.apache.lucene.analysis.compound package, it gives the following example:

Rindfleischüberwachungsgesetz, 0, 29
Rind, 0, 4, posIncr=0
fleisch, 4, 11, posIncr=0
überwachung, 11, 22, posIncr=0
gesetz, 23, 29, posIncr=0

And I see how this allows me to find single components such as "gesetz" or "Rind". But what if I want to find combinations of components such as "Rindfleisch" or "überwachungsgesetz"? It seems that the pattern of using posIncr=0 for all components excludes the possibility of finding sub-strings that are made up of multiple components.

Any comments or thoughts would be appreciated.

Ben Douglas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


rcmuir at gmail

Oct 20, 2009, 7:00 PM

Post #2 of 12 (641 views)
Permalink
Re: Using org.apache.lucene.analysis.compound [In reply to]

hi, it will work because it will also decompound "Rindfleish" into Rind and
fleish, with posIncr=0

so if you index Rindfleischüberwachungsgesetz, then query with "Rindfleish",
its matching because Rindfleish also gets decompounded into Rind and fleish.

On Tue, Oct 20, 2009 at 8:35 PM, Benjamin Douglas
<bbdouglas [at] basistech>wrote:

> Hello,
>
> I've found a number of posts in different places talking about how to
> perform decompounding, but I haven't found too many discussing how to use
> the results of decompounding. If anyone can answer this question or point me
> to an existing discussion it would be very helpful.
>
> In the description of the org.apache.lucene.analysis.compound package, it
> gives the following example:
>
> Rindfleischüberwachungsgesetz, 0, 29
> Rind, 0, 4, posIncr=0
> fleisch, 4, 11, posIncr=0
> überwachung, 11, 22, posIncr=0
> gesetz, 23, 29, posIncr=0
>
> And I see how this allows me to find single components such as "gesetz" or
> "Rind". But what if I want to find combinations of components such as
> "Rindfleisch" or "überwachungsgesetz"? It seems that the pattern of using
> posIncr=0 for all components excludes the possibility of finding sub-strings
> that are made up of multiple components.
>
> Any comments or thoughts would be appreciated.
>
> Ben Douglas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


--
Robert Muir
rcmuir [at] gmail


paul at activemath

Oct 21, 2009, 2:27 AM

Post #3 of 12 (628 views)
Permalink
Re: Using org.apache.lucene.analysis.compound [In reply to]

I'm interested to this analyzer.. it had escaped me and solves an old
problem!
Could you report about its usage:
- did you have to feed words in a dictionary?
- does anyone have user-measures already?
... and the last question for the research fun: is there any approach
towards preferring Überwachunggesetz as a concept than, say,
Fleischüberwachung? (again, that could be based on a dictionary
probably).

thanks in advance

paul


Le 21-oct.-09 à 04:00, Robert Muir a écrit :

> hi, it will work because it will also decompound "Rindfleish" into
> Rind and
> fleish, with posIncr=0
>
> so if you index Rindfleischüberwachungsgesetz, then query with
> "Rindfleish",
> its matching because Rindfleish also gets decompounded into Rind and
> fleish.
>
> On Tue, Oct 20, 2009 at 8:35 PM, Benjamin Douglas
> <bbdouglas [at] basistech>wrote:
>
>> Hello,
>>
>> I've found a number of posts in different places talking about how to
>> perform decompounding, but I haven't found too many discussing how
>> to use
>> the results of decompounding. If anyone can answer this question or
>> point me
>> to an existing discussion it would be very helpful.
>>
>> In the description of the org.apache.lucene.analysis.compound
>> package, it
>> gives the following example:
>>
>> Rindfleischüberwachungsgesetz, 0, 29
>> Rind, 0, 4, posIncr=0
>> fleisch, 4, 11, posIncr=0
>> überwachung, 11, 22, posIncr=0
>> gesetz, 23, 29, posIncr=0
>>
>> And I see how this allows me to find single components such as
>> "gesetz" or
>> "Rind". But what if I want to find combinations of components such as
>> "Rindfleisch" or "überwachungsgesetz"? It seems that the pattern of
>> using
>> posIncr=0 for all components excludes the possibility of finding
>> sub-strings
>> that are made up of multiple components.
>>
>> Any comments or thoughts would be appreciated.
>>
>> Ben Douglas
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
>
> --
> Robert Muir
> rcmuir [at] gmail
Attachments: smime.p7s (1.59 KB)


rcmuir at gmail

Oct 21, 2009, 5:12 AM

Post #4 of 12 (630 views)
Permalink
Re: Using org.apache.lucene.analysis.compound [In reply to]

Paul, there are two implementations in compounds, one is dictionary-based,
the other is hyphenation-grammar + dictionary (it restricts the
decompounding based on hyphenation rules). You could also subclass the
compound base class and implement your own.

I haven't seen any user-measures (relevance, etc), would be a cool thing to
see though.

I'm not sure I understand your last question, can you elaborate?
it might be that to improve some cases, you want to use the onlyLongestMatch
parameter:
@param onlyLongestMatch Add only the longest matching subword to the stream

for scoring, I think lucene's scoring might help too, because the original
word, without decompounding, is left as a token so if you search on an exact
match it should be ranked higher. (not sure if this is answering your
question)

On Wed, Oct 21, 2009 at 5:27 AM, Paul Libbrecht <paul [at] activemath> wrote:

>
> I'm interested to this analyzer.. it had escaped me and solves an old
> problem!
> Could you report about its usage:
> - did you have to feed words in a dictionary?
> - does anyone have user-measures already?
> ... and the last question for the research fun: is there any approach
> towards preferring Überwachunggesetz as a concept than, say,
> Fleischüberwachung? (again, that could be based on a dictionary probably).
>
> thanks in advance
>
> paul
>
>
> Le 21-oct.-09 à 04:00, Robert Muir a écrit :
>
>
> hi, it will work because it will also decompound "Rindfleish" into Rind
>> and
>> fleish, with posIncr=0
>>
>> so if you index Rindfleischüberwachungsgesetz, then query with
>> "Rindfleish",
>> its matching because Rindfleish also gets decompounded into Rind and
>> fleish.
>>
>> On Tue, Oct 20, 2009 at 8:35 PM, Benjamin Douglas
>> <bbdouglas [at] basistech>wrote:
>>
>> Hello,
>>>
>>> I've found a number of posts in different places talking about how to
>>> perform decompounding, but I haven't found too many discussing how to use
>>> the results of decompounding. If anyone can answer this question or point
>>> me
>>> to an existing discussion it would be very helpful.
>>>
>>> In the description of the org.apache.lucene.analysis.compound package, it
>>> gives the following example:
>>>
>>> Rindfleischüberwachungsgesetz, 0, 29
>>> Rind, 0, 4, posIncr=0
>>> fleisch, 4, 11, posIncr=0
>>> überwachung, 11, 22, posIncr=0
>>> gesetz, 23, 29, posIncr=0
>>>
>>> And I see how this allows me to find single components such as "gesetz"
>>> or
>>> "Rind". But what if I want to find combinations of components such as
>>> "Rindfleisch" or "überwachungsgesetz"? It seems that the pattern of using
>>> posIncr=0 for all components excludes the possibility of finding
>>> sub-strings
>>> that are made up of multiple components.
>>>
>>> Any comments or thoughts would be appreciated.
>>>
>>> Ben Douglas
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>>
>>
>> --
>> Robert Muir
>> rcmuir [at] gmail
>>
>
>


--
Robert Muir
rcmuir [at] gmail


bbdouglas at basistech

Oct 21, 2009, 10:40 AM

Post #5 of 12 (622 views)
Permalink
RE: Using org.apache.lucene.analysis.compound [In reply to]

Thanks for all of the answers so far!

Paul's question is similar to another aspect I am curious about:

Given the way the sample word is analyzed, is there anything in the scoring mechanism that would rank "überwachungsgesetz" higher than "gesetzüberwachung" or "fleischgesetz"?

-----Original Message-----
From: Robert Muir [mailto:rcmuir [at] gmail]
Sent: Wednesday, October 21, 2009 5:12 AM
To: java-user [at] lucene
Subject: Re: Using org.apache.lucene.analysis.compound

Paul, there are two implementations in compounds, one is dictionary-based,
the other is hyphenation-grammar + dictionary (it restricts the
decompounding based on hyphenation rules). You could also subclass the
compound base class and implement your own.

I haven't seen any user-measures (relevance, etc), would be a cool thing to
see though.

I'm not sure I understand your last question, can you elaborate?
it might be that to improve some cases, you want to use the onlyLongestMatch
parameter:
@param onlyLongestMatch Add only the longest matching subword to the stream

for scoring, I think lucene's scoring might help too, because the original
word, without decompounding, is left as a token so if you search on an exact
match it should be ranked higher. (not sure if this is answering your
question)

On Wed, Oct 21, 2009 at 5:27 AM, Paul Libbrecht <paul [at] activemath> wrote:

>
> I'm interested to this analyzer.. it had escaped me and solves an old
> problem!
> Could you report about its usage:
> - did you have to feed words in a dictionary?
> - does anyone have user-measures already?
> ... and the last question for the research fun: is there any approach
> towards preferring Überwachunggesetz as a concept than, say,
> Fleischüberwachung? (again, that could be based on a dictionary probably).
>
> thanks in advance
>
> paul
>
>
> Le 21-oct.-09 à 04:00, Robert Muir a écrit :
>
>
> hi, it will work because it will also decompound "Rindfleish" into Rind
>> and
>> fleish, with posIncr=0
>>
>> so if you index Rindfleischüberwachungsgesetz, then query with
>> "Rindfleish",
>> its matching because Rindfleish also gets decompounded into Rind and
>> fleish.
>>
>> On Tue, Oct 20, 2009 at 8:35 PM, Benjamin Douglas
>> <bbdouglas [at] basistech>wrote:
>>
>> Hello,
>>>
>>> I've found a number of posts in different places talking about how to
>>> perform decompounding, but I haven't found too many discussing how to use
>>> the results of decompounding. If anyone can answer this question or point
>>> me
>>> to an existing discussion it would be very helpful.
>>>
>>> In the description of the org.apache.lucene.analysis.compound package, it
>>> gives the following example:
>>>
>>> Rindfleischüberwachungsgesetz, 0, 29
>>> Rind, 0, 4, posIncr=0
>>> fleisch, 4, 11, posIncr=0
>>> überwachung, 11, 22, posIncr=0
>>> gesetz, 23, 29, posIncr=0
>>>
>>> And I see how this allows me to find single components such as "gesetz"
>>> or
>>> "Rind". But what if I want to find combinations of components such as
>>> "Rindfleisch" or "überwachungsgesetz"? It seems that the pattern of using
>>> posIncr=0 for all components excludes the possibility of finding
>>> sub-strings
>>> that are made up of multiple components.
>>>
>>> Any comments or thoughts would be appreciated.
>>>
>>> Ben Douglas
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>>
>>
>> --
>> Robert Muir
>> rcmuir [at] gmail
>>
>
>


--
Robert Muir
rcmuir [at] gmail


rcmuir at gmail

Oct 21, 2009, 11:48 AM

Post #6 of 12 (619 views)
Permalink
Re: Using org.apache.lucene.analysis.compound [In reply to]

yes, your dictionary :)

if überwachungsgesetz is a real word, add it to your dictionary.

for example, if your dictionary is { "Rind", "Fleisch", "Draht", "Schere",
"Gesetz", "Aufgabe", "Überwachung" }, and you index
Rindfleischüberwachungsgesetz, then all 3 queries will have the same score.
but if you expand the dictionary to { "Rind", "Fleisch", "Draht", "Schere",
"Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then this makes
a big difference.

all 3 queries will still match, but überwachungsgesetz will have a higher
score. this is because now things are analyzed differently:
Rindfleischüberwachungsgesetz will be decompounded as before, but with an
additional token: Überwachungsgesetz.
so back to your original question, these 'concatenations' of multiple
components, yes compounds will do that, if they are real words. but it won't
just make them up.

"überwachungsgesetz"
0.23013961 = (MATCH) sum of:
0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
0.5 = queryWeight(field:überwachungsgesetz), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
1.0 = tf(termFreq(field:überwachungsgesetz)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)
0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
0.5 = queryWeight(field:überwachung), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
1.0 = tf(termFreq(field:überwachung)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)
0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
0.5 = queryWeight(field:überwachungsgesetz), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
1.0 = tf(termFreq(field:überwachungsgesetz)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)
0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
0.5 = queryWeight(field:gesetz), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
1.0 = tf(termFreq(field:gesetz)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)

"gesetzüberwachung"
0.064782135 = (MATCH) sum of:
0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
0.2814906 = queryWeight(field:gesetz), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
0.9173473 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
1.0 = tf(termFreq(field:gesetz)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)
0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
0.2814906 = queryWeight(field:überwachung), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
0.9173473 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
1.0 = tf(termFreq(field:überwachung)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)

"fleischgesetz"
0.064782135 = (MATCH) sum of:
0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
0.2814906 = queryWeight(field:fleisch), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
0.9173473 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
1.0 = tf(termFreq(field:fleisch)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)
0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
0.2814906 = queryWeight(field:gesetz), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
0.9173473 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
1.0 = tf(termFreq(field:gesetz)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)




On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
<bbdouglas [at] basistech>wrote:

> Thanks for all of the answers so far!
>
> Paul's question is similar to another aspect I am curious about:
>
> Given the way the sample word is analyzed, is there anything in the scoring
> mechanism that would rank "überwachungsgesetz" higher than
> "gesetzüberwachung" or "fleischgesetz"?
>
>

--
Robert Muir
rcmuir [at] gmail


bbdouglas at basistech

Oct 21, 2009, 12:09 PM

Post #7 of 12 (620 views)
Permalink
RE: Using org.apache.lucene.analysis.compound [In reply to]

OK, that makes sense. So I just need to add all of the sub-compounds that are real words at posIncr=0, even if they are combinations of other sub-compounds.

Thanks!

-----Original Message-----
From: Robert Muir [mailto:rcmuir [at] gmail]
Sent: Wednesday, October 21, 2009 11:49 AM
To: java-user [at] lucene
Subject: Re: Using org.apache.lucene.analysis.compound

yes, your dictionary :)

if überwachungsgesetz is a real word, add it to your dictionary.

for example, if your dictionary is { "Rind", "Fleisch", "Draht", "Schere",
"Gesetz", "Aufgabe", "Überwachung" }, and you index
Rindfleischüberwachungsgesetz, then all 3 queries will have the same score.
but if you expand the dictionary to { "Rind", "Fleisch", "Draht", "Schere",
"Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then this makes
a big difference.

all 3 queries will still match, but überwachungsgesetz will have a higher
score. this is because now things are analyzed differently:
Rindfleischüberwachungsgesetz will be decompounded as before, but with an
additional token: Überwachungsgesetz.
so back to your original question, these 'concatenations' of multiple
components, yes compounds will do that, if they are real words. but it won't
just make them up.

"überwachungsgesetz"
0.23013961 = (MATCH) sum of:
0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
0.5 = queryWeight(field:überwachungsgesetz), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
1.0 = tf(termFreq(field:überwachungsgesetz)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)
0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
0.5 = queryWeight(field:überwachung), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
1.0 = tf(termFreq(field:überwachung)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)
0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
0.5 = queryWeight(field:überwachungsgesetz), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
1.0 = tf(termFreq(field:überwachungsgesetz)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)
0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
0.5 = queryWeight(field:gesetz), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
1.0 = tf(termFreq(field:gesetz)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)

"gesetzüberwachung"
0.064782135 = (MATCH) sum of:
0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
0.2814906 = queryWeight(field:gesetz), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
0.9173473 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
1.0 = tf(termFreq(field:gesetz)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)
0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
0.2814906 = queryWeight(field:überwachung), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
0.9173473 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
1.0 = tf(termFreq(field:überwachung)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)

"fleischgesetz"
0.064782135 = (MATCH) sum of:
0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
0.2814906 = queryWeight(field:fleisch), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
0.9173473 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
1.0 = tf(termFreq(field:fleisch)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)
0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
0.2814906 = queryWeight(field:gesetz), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
0.9173473 = queryNorm
0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
1.0 = tf(termFreq(field:gesetz)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.375 = fieldNorm(field=field, doc=0)




On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
<bbdouglas [at] basistech>wrote:

> Thanks for all of the answers so far!
>
> Paul's question is similar to another aspect I am curious about:
>
> Given the way the sample word is analyzed, is there anything in the scoring
> mechanism that would rank "überwachungsgesetz" higher than
> "gesetzüberwachung" or "fleischgesetz"?
>
>

--
Robert Muir
rcmuir [at] gmail


paul at activemath

Oct 21, 2009, 12:16 PM

Post #8 of 12 (623 views)
Permalink
Re: Using org.apache.lucene.analysis.compound [In reply to]

Can the dictionary have weights?

überwachungsgesetz alone probably needs a higher rank than überwachung
and gesetzt or?

paul


Le 21-oct.-09 à 21:09, Benjamin Douglas a écrit :

> OK, that makes sense. So I just need to add all of the sub-compounds
> that are real words at posIncr=0, even if they are combinations of
> other sub-compounds.
>
> Thanks!
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir [at] gmail]
> Sent: Wednesday, October 21, 2009 11:49 AM
> To: java-user [at] lucene
> Subject: Re: Using org.apache.lucene.analysis.compound
>
> yes, your dictionary :)
>
> if überwachungsgesetz is a real word, add it to your dictionary.
>
> for example, if your dictionary is { "Rind", "Fleisch", "Draht",
> "Schere",
> "Gesetz", "Aufgabe", "Überwachung" }, and you index
> Rindfleischüberwachungsgesetz, then all 3 queries will have the same
> score.
> but if you expand the dictionary to { "Rind", "Fleisch", "Draht",
> "Schere",
> "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then
> this makes
> a big difference.
>
> all 3 queries will still match, but überwachungsgesetz will have a
> higher
> score. this is because now things are analyzed differently:
> Rindfleischüberwachungsgesetz will be decompounded as before, but
> with an
> additional token: Überwachungsgesetz.
> so back to your original question, these 'concatenations' of multiple
> components, yes compounds will do that, if they are real words. but
> it won't
> just make them up.
>
> "überwachungsgesetz"
> 0.23013961 = (MATCH) sum of:
> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0),
> product of:
> 0.5 = queryWeight(field:überwachungsgesetz), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),
> product
> of:
> 1.0 = tf(termFreq(field:überwachungsgesetz)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
> 0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
> 0.5 = queryWeight(field:überwachung), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product
> of:
> 1.0 = tf(termFreq(field:überwachung)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0),
> product of:
> 0.5 = queryWeight(field:überwachungsgesetz), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),
> product
> of:
> 1.0 = tf(termFreq(field:überwachungsgesetz)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
> 0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
> 0.5 = queryWeight(field:gesetz), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
> 1.0 = tf(termFreq(field:gesetz)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
>
> "gesetzüberwachung"
> 0.064782135 = (MATCH) sum of:
> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
> 0.2814906 = queryWeight(field:gesetz), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.9173473 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
> 1.0 = tf(termFreq(field:gesetz)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
> 0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
> 0.2814906 = queryWeight(field:überwachung), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.9173473 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product
> of:
> 1.0 = tf(termFreq(field:überwachung)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
>
> "fleischgesetz"
> 0.064782135 = (MATCH) sum of:
> 0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
> 0.2814906 = queryWeight(field:fleisch), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.9173473 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
> 1.0 = tf(termFreq(field:fleisch)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
> 0.2814906 = queryWeight(field:gesetz), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.9173473 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
> 1.0 = tf(termFreq(field:gesetz)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
>
>
>
>
> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
> <bbdouglas [at] basistech>wrote:
>
>> Thanks for all of the answers so far!
>>
>> Paul's question is similar to another aspect I am curious about:
>>
>> Given the way the sample word is analyzed, is there anything in the
>> scoring
>> mechanism that would rank "überwachungsgesetz" higher than
>> "gesetzüberwachung" or "fleischgesetz"?
>>
>>
>
> --
> Robert Muir
> rcmuir [at] gmail
Attachments: smime.p7s (1.59 KB)


rcmuir at gmail

Oct 21, 2009, 12:17 PM

Post #9 of 12 (620 views)
Permalink
Re: Using org.apache.lucene.analysis.compound [In reply to]

just add them to the dictionary, the compound filter will do this
automatically.

if you want to tweak it even further, you can also tell compounds to NOT
emit the subwords if they form a bigger compound with the onlyLongestMatch
parameter i spoke of earlier.
I haven't played with this option much but I think this is what its supposed
to do:

if the dictionary is
soft
ball
softball

then "softball" (or compounds containing it) won't emit "soft" and "ball",
because "softball" is in the dictionary and its a longest match.
with the option off, you'd get softball, ball, soft

On Wed, Oct 21, 2009 at 3:09 PM, Benjamin Douglas
<bbdouglas [at] basistech>wrote:

> OK, that makes sense. So I just need to add all of the sub-compounds that
> are real words at posIncr=0, even if they are combinations of other
> sub-compounds.
>
> Thanks!
>
> -----Original Message-----
> From: Robert Muir [mailto:rcmuir [at] gmail]
> Sent: Wednesday, October 21, 2009 11:49 AM
> To: java-user [at] lucene
> Subject: Re: Using org.apache.lucene.analysis.compound
>
> yes, your dictionary :)
>
> if überwachungsgesetz is a real word, add it to your dictionary.
>
> for example, if your dictionary is { "Rind", "Fleisch", "Draht", "Schere",
> "Gesetz", "Aufgabe", "Überwachung" }, and you index
> Rindfleischüberwachungsgesetz, then all 3 queries will have the same score.
> but if you expand the dictionary to { "Rind", "Fleisch", "Draht", "Schere",
> "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then this makes
> a big difference.
>
> all 3 queries will still match, but überwachungsgesetz will have a higher
> score. this is because now things are analyzed differently:
> Rindfleischüberwachungsgesetz will be decompounded as before, but with an
> additional token: Überwachungsgesetz.
> so back to your original question, these 'concatenations' of multiple
> components, yes compounds will do that, if they are real words. but it
> won't
> just make them up.
>
> "überwachungsgesetz"
> 0.23013961 = (MATCH) sum of:
> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
> 0.5 = queryWeight(field:überwachungsgesetz), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
> of:
> 1.0 = tf(termFreq(field:überwachungsgesetz)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
> 0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
> 0.5 = queryWeight(field:überwachung), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
> 1.0 = tf(termFreq(field:überwachung)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
> 0.5 = queryWeight(field:überwachungsgesetz), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
> of:
> 1.0 = tf(termFreq(field:überwachungsgesetz)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
> 0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
> 0.5 = queryWeight(field:gesetz), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
> 1.0 = tf(termFreq(field:gesetz)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
>
> "gesetzüberwachung"
> 0.064782135 = (MATCH) sum of:
> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
> 0.2814906 = queryWeight(field:gesetz), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.9173473 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
> 1.0 = tf(termFreq(field:gesetz)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
> 0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
> 0.2814906 = queryWeight(field:überwachung), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.9173473 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
> 1.0 = tf(termFreq(field:überwachung)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
>
> "fleischgesetz"
> 0.064782135 = (MATCH) sum of:
> 0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
> 0.2814906 = queryWeight(field:fleisch), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.9173473 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
> 1.0 = tf(termFreq(field:fleisch)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
> 0.2814906 = queryWeight(field:gesetz), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.9173473 = queryNorm
> 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
> 1.0 = tf(termFreq(field:gesetz)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.375 = fieldNorm(field=field, doc=0)
>
>
>
>
> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
> <bbdouglas [at] basistech>wrote:
>
> > Thanks for all of the answers so far!
> >
> > Paul's question is similar to another aspect I am curious about:
> >
> > Given the way the sample word is analyzed, is there anything in the
> scoring
> > mechanism that would rank "überwachungsgesetz" higher than
> > "gesetzüberwachung" or "fleischgesetz"?
> >
> >
>
> --
> Robert Muir
> rcmuir [at] gmail
>



--
Robert Muir
rcmuir [at] gmail


rcmuir at gmail

Oct 21, 2009, 12:23 PM

Post #10 of 12 (618 views)
Permalink
Re: Using org.apache.lucene.analysis.compound [In reply to]

Paul, i think in general scoring should take care of this too, its all about
your dictionary, same as the previous example.
this is because überwachungsgesetz matches 3 tokens: überwachungsgesetz,
überwachung, gesetz
but überwachung gesetz only matches 2.

überwachungsgesetz
0.37040412 = (MATCH) sum of:
0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
0.5 = queryWeight(field:überwachungsgesetz), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.5 = fieldNorm(field=field, doc=0)
0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
0.5 = queryWeight(field:überwachung), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
1.0 = tf(termFreq(field:überwachung)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.5 = fieldNorm(field=field, doc=0)
0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
0.5 = queryWeight(field:überwachungsgesetz), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
of:
1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.5 = fieldNorm(field=field, doc=0)
0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
0.5 = queryWeight(field:gesetz), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
1.0 = tf(termFreq(field:gesetz)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.5 = fieldNorm(field=field, doc=0)

überwachung gesetz
0.30685282 = (MATCH) sum of:
0.15342641 = (MATCH) sum of:
0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
0.5 = queryWeight(field:überwachung), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
1.0 = tf(termFreq(field:überwachung)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.5 = fieldNorm(field=field, doc=0)
0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
0.5 = queryWeight(field:überwachung), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
1.0 = tf(termFreq(field:überwachung)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.5 = fieldNorm(field=field, doc=0)
0.15342641 = (MATCH) sum of:
0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
0.5 = queryWeight(field:gesetz), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
1.0 = tf(termFreq(field:gesetz)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.5 = fieldNorm(field=field, doc=0)
0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
0.5 = queryWeight(field:gesetz), product of:
0.30685282 = idf(docFreq=1, maxDocs=1)
1.6294457 = queryNorm
0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
1.0 = tf(termFreq(field:gesetz)=1)
0.30685282 = idf(docFreq=1, maxDocs=1)
0.5 = fieldNorm(field=field, doc=0)

On Wed, Oct 21, 2009 at 3:16 PM, Paul Libbrecht <paul [at] activemath> wrote:

> Can the dictionary have weights?
>
> überwachungsgesetz alone probably needs a higher rank than überwachung and
> gesetzt or?
>
> paul
>
>
> Le 21-oct.-09 à 21:09, Benjamin Douglas a écrit :
>
>
> OK, that makes sense. So I just need to add all of the sub-compounds that
>> are real words at posIncr=0, even if they are combinations of other
>> sub-compounds.
>>
>> Thanks!
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir [at] gmail]
>> Sent: Wednesday, October 21, 2009 11:49 AM
>> To: java-user [at] lucene
>> Subject: Re: Using org.apache.lucene.analysis.compound
>>
>> yes, your dictionary :)
>>
>> if überwachungsgesetz is a real word, add it to your dictionary.
>>
>> for example, if your dictionary is { "Rind", "Fleisch", "Draht", "Schere",
>> "Gesetz", "Aufgabe", "Überwachung" }, and you index
>> Rindfleischüberwachungsgesetz, then all 3 queries will have the same
>> score.
>> but if you expand the dictionary to { "Rind", "Fleisch", "Draht",
>> "Schere",
>> "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then this
>> makes
>> a big difference.
>>
>> all 3 queries will still match, but überwachungsgesetz will have a higher
>> score. this is because now things are analyzed differently:
>> Rindfleischüberwachungsgesetz will be decompounded as before, but with an
>> additional token: Überwachungsgesetz.
>> so back to your original question, these 'concatenations' of multiple
>> components, yes compounds will do that, if they are real words. but it
>> won't
>> just make them up.
>>
>> "überwachungsgesetz"
>> 0.23013961 = (MATCH) sum of:
>> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>> 0.5 = queryWeight(field:überwachungsgesetz), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 1.6294457 = queryNorm
>> 0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
>> of:
>> 1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.375 = fieldNorm(field=field, doc=0)
>> 0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
>> 0.5 = queryWeight(field:überwachung), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 1.6294457 = queryNorm
>> 0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>> 1.0 = tf(termFreq(field:überwachung)=1)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.375 = fieldNorm(field=field, doc=0)
>> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>> 0.5 = queryWeight(field:überwachungsgesetz), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 1.6294457 = queryNorm
>> 0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
>> of:
>> 1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.375 = fieldNorm(field=field, doc=0)
>> 0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
>> 0.5 = queryWeight(field:gesetz), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 1.6294457 = queryNorm
>> 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>> 1.0 = tf(termFreq(field:gesetz)=1)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.375 = fieldNorm(field=field, doc=0)
>>
>> "gesetzüberwachung"
>> 0.064782135 = (MATCH) sum of:
>> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>> 0.2814906 = queryWeight(field:gesetz), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.9173473 = queryNorm
>> 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>> 1.0 = tf(termFreq(field:gesetz)=1)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.375 = fieldNorm(field=field, doc=0)
>> 0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
>> 0.2814906 = queryWeight(field:überwachung), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.9173473 = queryNorm
>> 0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>> 1.0 = tf(termFreq(field:überwachung)=1)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.375 = fieldNorm(field=field, doc=0)
>>
>> "fleischgesetz"
>> 0.064782135 = (MATCH) sum of:
>> 0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
>> 0.2814906 = queryWeight(field:fleisch), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.9173473 = queryNorm
>> 0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
>> 1.0 = tf(termFreq(field:fleisch)=1)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.375 = fieldNorm(field=field, doc=0)
>> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>> 0.2814906 = queryWeight(field:gesetz), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.9173473 = queryNorm
>> 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>> 1.0 = tf(termFreq(field:gesetz)=1)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.375 = fieldNorm(field=field, doc=0)
>>
>>
>>
>>
>> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
>> <bbdouglas [at] basistech>wrote:
>>
>> Thanks for all of the answers so far!
>>>
>>> Paul's question is similar to another aspect I am curious about:
>>>
>>> Given the way the sample word is analyzed, is there anything in the
>>> scoring
>>> mechanism that would rank "überwachungsgesetz" higher than
>>> "gesetzüberwachung" or "fleischgesetz"?
>>>
>>>
>>>
>> --
>> Robert Muir
>> rcmuir [at] gmail
>>
>
>


--
Robert Muir
rcmuir [at] gmail


paul at activemath

Oct 21, 2009, 1:19 PM

Post #11 of 12 (620 views)
Permalink
Re: Using org.apache.lucene.analysis.compound [In reply to]

Great,

now the next question: which dictionary to do you guys use? How big
can it be?
Is 50000 words acceptable?

paul


Le 21-oct.-09 à 21:23, Robert Muir a écrit :

> Paul, i think in general scoring should take care of this too, its
> all about
> your dictionary, same as the previous example.
> this is because überwachungsgesetz matches 3 tokens:
> überwachungsgesetz,
> überwachung, gesetz
> but überwachung gesetz only matches 2.
>
> überwachungsgesetz
> 0.37040412 = (MATCH) sum of:
> 0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product
> of:
> 0.5 = queryWeight(field:überwachungsgesetz), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),
> product
> of:
> 1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.5 = fieldNorm(field=field, doc=0)
> 0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
> 0.5 = queryWeight(field:überwachung), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product
> of:
> 1.0 = tf(termFreq(field:überwachung)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.5 = fieldNorm(field=field, doc=0)
> 0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product
> of:
> 0.5 = queryWeight(field:überwachungsgesetz), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),
> product
> of:
> 1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.5 = fieldNorm(field=field, doc=0)
> 0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
> 0.5 = queryWeight(field:gesetz), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
> 1.0 = tf(termFreq(field:gesetz)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.5 = fieldNorm(field=field, doc=0)
>
> überwachung gesetz
> 0.30685282 = (MATCH) sum of:
> 0.15342641 = (MATCH) sum of:
> 0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
> 0.5 = queryWeight(field:überwachung), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.15342641 = (MATCH) fieldWeight(field:überwachung in 0),
> product of:
> 1.0 = tf(termFreq(field:überwachung)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.5 = fieldNorm(field=field, doc=0)
> 0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
> 0.5 = queryWeight(field:überwachung), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.15342641 = (MATCH) fieldWeight(field:überwachung in 0),
> product of:
> 1.0 = tf(termFreq(field:überwachung)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.5 = fieldNorm(field=field, doc=0)
> 0.15342641 = (MATCH) sum of:
> 0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
> 0.5 = queryWeight(field:gesetz), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
> 1.0 = tf(termFreq(field:gesetz)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.5 = fieldNorm(field=field, doc=0)
> 0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
> 0.5 = queryWeight(field:gesetz), product of:
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 1.6294457 = queryNorm
> 0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
> 1.0 = tf(termFreq(field:gesetz)=1)
> 0.30685282 = idf(docFreq=1, maxDocs=1)
> 0.5 = fieldNorm(field=field, doc=0)
>
> On Wed, Oct 21, 2009 at 3:16 PM, Paul Libbrecht
> <paul [at] activemath> wrote:
>
>> Can the dictionary have weights?
>>
>> überwachungsgesetz alone probably needs a higher rank than
>> überwachung and
>> gesetzt or?
>>
>> paul
>>
>>
>> Le 21-oct.-09 à 21:09, Benjamin Douglas a écrit :
>>
>>
>> OK, that makes sense. So I just need to add all of the sub-
>> compounds that
>>> are real words at posIncr=0, even if they are combinations of other
>>> sub-compounds.
>>>
>>> Thanks!
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcmuir [at] gmail]
>>> Sent: Wednesday, October 21, 2009 11:49 AM
>>> To: java-user [at] lucene
>>> Subject: Re: Using org.apache.lucene.analysis.compound
>>>
>>> yes, your dictionary :)
>>>
>>> if überwachungsgesetz is a real word, add it to your dictionary.
>>>
>>> for example, if your dictionary is { "Rind", "Fleisch", "Draht",
>>> "Schere",
>>> "Gesetz", "Aufgabe", "Überwachung" }, and you index
>>> Rindfleischüberwachungsgesetz, then all 3 queries will have the same
>>> score.
>>> but if you expand the dictionary to { "Rind", "Fleisch", "Draht",
>>> "Schere",
>>> "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then
>>> this
>>> makes
>>> a big difference.
>>>
>>> all 3 queries will still match, but überwachungsgesetz will have a
>>> higher
>>> score. this is because now things are analyzed differently:
>>> Rindfleischüberwachungsgesetz will be decompounded as before, but
>>> with an
>>> additional token: Überwachungsgesetz.
>>> so back to your original question, these 'concatenations' of
>>> multiple
>>> components, yes compounds will do that, if they are real words.
>>> but it
>>> won't
>>> just make them up.
>>>
>>> "überwachungsgesetz"
>>> 0.23013961 = (MATCH) sum of:
>>> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0),
>>> product of:
>>> 0.5 = queryWeight(field:überwachungsgesetz), product of:
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 1.6294457 = queryNorm
>>> 0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),
>>> product
>>> of:
>>> 1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 0.375 = fieldNorm(field=field, doc=0)
>>> 0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
>>> 0.5 = queryWeight(field:überwachung), product of:
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 1.6294457 = queryNorm
>>> 0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product
>>> of:
>>> 1.0 = tf(termFreq(field:überwachung)=1)
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 0.375 = fieldNorm(field=field, doc=0)
>>> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0),
>>> product of:
>>> 0.5 = queryWeight(field:überwachungsgesetz), product of:
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 1.6294457 = queryNorm
>>> 0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),
>>> product
>>> of:
>>> 1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 0.375 = fieldNorm(field=field, doc=0)
>>> 0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
>>> 0.5 = queryWeight(field:gesetz), product of:
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 1.6294457 = queryNorm
>>> 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>> 1.0 = tf(termFreq(field:gesetz)=1)
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 0.375 = fieldNorm(field=field, doc=0)
>>>
>>> "gesetzüberwachung"
>>> 0.064782135 = (MATCH) sum of:
>>> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>> 0.2814906 = queryWeight(field:gesetz), product of:
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 0.9173473 = queryNorm
>>> 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>> 1.0 = tf(termFreq(field:gesetz)=1)
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 0.375 = fieldNorm(field=field, doc=0)
>>> 0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
>>> 0.2814906 = queryWeight(field:überwachung), product of:
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 0.9173473 = queryNorm
>>> 0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product
>>> of:
>>> 1.0 = tf(termFreq(field:überwachung)=1)
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 0.375 = fieldNorm(field=field, doc=0)
>>>
>>> "fleischgesetz"
>>> 0.064782135 = (MATCH) sum of:
>>> 0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
>>> 0.2814906 = queryWeight(field:fleisch), product of:
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 0.9173473 = queryNorm
>>> 0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
>>> 1.0 = tf(termFreq(field:fleisch)=1)
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 0.375 = fieldNorm(field=field, doc=0)
>>> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>> 0.2814906 = queryWeight(field:gesetz), product of:
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 0.9173473 = queryNorm
>>> 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>> 1.0 = tf(termFreq(field:gesetz)=1)
>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>> 0.375 = fieldNorm(field=field, doc=0)
>>>
>>>
>>>
>>>
>>> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
>>> <bbdouglas [at] basistech>wrote:
>>>
>>> Thanks for all of the answers so far!
>>>>
>>>> Paul's question is similar to another aspect I am curious about:
>>>>
>>>> Given the way the sample word is analyzed, is there anything in the
>>>> scoring
>>>> mechanism that would rank "überwachungsgesetz" higher than
>>>> "gesetzüberwachung" or "fleischgesetz"?
>>>>
>>>>
>>>>
>>> --
>>> Robert Muir
>>> rcmuir [at] gmail
>>>
>>
>>
>
>
> --
> Robert Muir
> rcmuir [at] gmail
Attachments: smime.p7s (1.59 KB)


rcmuir at gmail

Oct 21, 2009, 1:32 PM

Post #12 of 12 (620 views)
Permalink
Re: Using org.apache.lucene.analysis.compound [In reply to]

there is some information on this topic in the pkg summary:

http://lucene.apache.org/java/2_9_0/api/contrib-analyzers/org/apache/lucene/analysis/compound/package-summary.html

in short, for a large list (there is no limit in the code), you will want to
make use of a hyphenation grammar as well:
HyphenationCompoundWordTokenFilter instead of the brute-force dictionary
approach, for better speed.

there is also a pointer to some dictionaries at openoffice, i'd also look
around at spellcheckers and stuff too elsewhere if you cant find one that
fits your needs.

On Wed, Oct 21, 2009 at 4:19 PM, Paul Libbrecht <paul [at] activemath> wrote:

> Great,
>
> now the next question: which dictionary to do you guys use? How big can it
> be?
> Is 50000 words acceptable?
>
> paul
>
>
> Le 21-oct.-09 à 21:23, Robert Muir a écrit :
>
>
> Paul, i think in general scoring should take care of this too, its all
>> about
>> your dictionary, same as the previous example.
>> this is because überwachungsgesetz matches 3 tokens: überwachungsgesetz,
>> überwachung, gesetz
>> but überwachung gesetz only matches 2.
>>
>> überwachungsgesetz
>> 0.37040412 = (MATCH) sum of:
>> 0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>> 0.5 = queryWeight(field:überwachungsgesetz), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 1.6294457 = queryNorm
>> 0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
>> of:
>> 1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.5 = fieldNorm(field=field, doc=0)
>> 0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
>> 0.5 = queryWeight(field:überwachung), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 1.6294457 = queryNorm
>> 0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>> 1.0 = tf(termFreq(field:überwachung)=1)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.5 = fieldNorm(field=field, doc=0)
>> 0.10848885 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>> 0.5 = queryWeight(field:überwachungsgesetz), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 1.6294457 = queryNorm
>> 0.2169777 = (MATCH) fieldWeight(field:überwachungsgesetz in 0), product
>> of:
>> 1.4142135 = tf(termFreq(field:überwachungsgesetz)=2)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.5 = fieldNorm(field=field, doc=0)
>> 0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
>> 0.5 = queryWeight(field:gesetz), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 1.6294457 = queryNorm
>> 0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>> 1.0 = tf(termFreq(field:gesetz)=1)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.5 = fieldNorm(field=field, doc=0)
>>
>> überwachung gesetz
>> 0.30685282 = (MATCH) sum of:
>> 0.15342641 = (MATCH) sum of:
>> 0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
>> 0.5 = queryWeight(field:überwachung), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 1.6294457 = queryNorm
>> 0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>> 1.0 = tf(termFreq(field:überwachung)=1)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.5 = fieldNorm(field=field, doc=0)
>> 0.076713204 = (MATCH) weight(field:überwachung in 0), product of:
>> 0.5 = queryWeight(field:überwachung), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 1.6294457 = queryNorm
>> 0.15342641 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>> 1.0 = tf(termFreq(field:überwachung)=1)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.5 = fieldNorm(field=field, doc=0)
>> 0.15342641 = (MATCH) sum of:
>> 0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
>> 0.5 = queryWeight(field:gesetz), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 1.6294457 = queryNorm
>> 0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>> 1.0 = tf(termFreq(field:gesetz)=1)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.5 = fieldNorm(field=field, doc=0)
>> 0.076713204 = (MATCH) weight(field:gesetz in 0), product of:
>> 0.5 = queryWeight(field:gesetz), product of:
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 1.6294457 = queryNorm
>> 0.15342641 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>> 1.0 = tf(termFreq(field:gesetz)=1)
>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>> 0.5 = fieldNorm(field=field, doc=0)
>>
>> On Wed, Oct 21, 2009 at 3:16 PM, Paul Libbrecht <paul [at] activemath>
>> wrote:
>>
>> Can the dictionary have weights?
>>>
>>> überwachungsgesetz alone probably needs a higher rank than überwachung
>>> and
>>> gesetzt or?
>>>
>>> paul
>>>
>>>
>>> Le 21-oct.-09 à 21:09, Benjamin Douglas a écrit :
>>>
>>>
>>> OK, that makes sense. So I just need to add all of the sub-compounds that
>>>
>>>> are real words at posIncr=0, even if they are combinations of other
>>>> sub-compounds.
>>>>
>>>> Thanks!
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcmuir [at] gmail]
>>>> Sent: Wednesday, October 21, 2009 11:49 AM
>>>> To: java-user [at] lucene
>>>> Subject: Re: Using org.apache.lucene.analysis.compound
>>>>
>>>> yes, your dictionary :)
>>>>
>>>> if überwachungsgesetz is a real word, add it to your dictionary.
>>>>
>>>> for example, if your dictionary is { "Rind", "Fleisch", "Draht",
>>>> "Schere",
>>>> "Gesetz", "Aufgabe", "Überwachung" }, and you index
>>>> Rindfleischüberwachungsgesetz, then all 3 queries will have the same
>>>> score.
>>>> but if you expand the dictionary to { "Rind", "Fleisch", "Draht",
>>>> "Schere",
>>>> "Gesetz", "Aufgabe", "Überwachung", "Überwachungsgesetz" }, then this
>>>> makes
>>>> a big difference.
>>>>
>>>> all 3 queries will still match, but überwachungsgesetz will have a
>>>> higher
>>>> score. this is because now things are analyzed differently:
>>>> Rindfleischüberwachungsgesetz will be decompounded as before, but with
>>>> an
>>>> additional token: Überwachungsgesetz.
>>>> so back to your original question, these 'concatenations' of multiple
>>>> components, yes compounds will do that, if they are real words. but it
>>>> won't
>>>> just make them up.
>>>>
>>>> "überwachungsgesetz"
>>>> 0.23013961 = (MATCH) sum of:
>>>> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>>>> 0.5 = queryWeight(field:überwachungsgesetz), product of:
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 1.6294457 = queryNorm
>>>> 0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),
>>>> product
>>>> of:
>>>> 1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 0.375 = fieldNorm(field=field, doc=0)
>>>> 0.057534903 = (MATCH) weight(field:überwachung in 0), product of:
>>>> 0.5 = queryWeight(field:überwachung), product of:
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 1.6294457 = queryNorm
>>>> 0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>>> 1.0 = tf(termFreq(field:überwachung)=1)
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 0.375 = fieldNorm(field=field, doc=0)
>>>> 0.057534903 = (MATCH) weight(field:überwachungsgesetz in 0), product of:
>>>> 0.5 = queryWeight(field:überwachungsgesetz), product of:
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 1.6294457 = queryNorm
>>>> 0.11506981 = (MATCH) fieldWeight(field:überwachungsgesetz in 0),
>>>> product
>>>> of:
>>>> 1.0 = tf(termFreq(field:überwachungsgesetz)=1)
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 0.375 = fieldNorm(field=field, doc=0)
>>>> 0.057534903 = (MATCH) weight(field:gesetz in 0), product of:
>>>> 0.5 = queryWeight(field:gesetz), product of:
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 1.6294457 = queryNorm
>>>> 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>>> 1.0 = tf(termFreq(field:gesetz)=1)
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 0.375 = fieldNorm(field=field, doc=0)
>>>>
>>>> "gesetzüberwachung"
>>>> 0.064782135 = (MATCH) sum of:
>>>> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>>> 0.2814906 = queryWeight(field:gesetz), product of:
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 0.9173473 = queryNorm
>>>> 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>>> 1.0 = tf(termFreq(field:gesetz)=1)
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 0.375 = fieldNorm(field=field, doc=0)
>>>> 0.032391068 = (MATCH) weight(field:überwachung in 0), product of:
>>>> 0.2814906 = queryWeight(field:überwachung), product of:
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 0.9173473 = queryNorm
>>>> 0.11506981 = (MATCH) fieldWeight(field:überwachung in 0), product of:
>>>> 1.0 = tf(termFreq(field:überwachung)=1)
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 0.375 = fieldNorm(field=field, doc=0)
>>>>
>>>> "fleischgesetz"
>>>> 0.064782135 = (MATCH) sum of:
>>>> 0.032391068 = (MATCH) weight(field:fleisch in 0), product of:
>>>> 0.2814906 = queryWeight(field:fleisch), product of:
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 0.9173473 = queryNorm
>>>> 0.11506981 = (MATCH) fieldWeight(field:fleisch in 0), product of:
>>>> 1.0 = tf(termFreq(field:fleisch)=1)
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 0.375 = fieldNorm(field=field, doc=0)
>>>> 0.032391068 = (MATCH) weight(field:gesetz in 0), product of:
>>>> 0.2814906 = queryWeight(field:gesetz), product of:
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 0.9173473 = queryNorm
>>>> 0.11506981 = (MATCH) fieldWeight(field:gesetz in 0), product of:
>>>> 1.0 = tf(termFreq(field:gesetz)=1)
>>>> 0.30685282 = idf(docFreq=1, maxDocs=1)
>>>> 0.375 = fieldNorm(field=field, doc=0)
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Oct 21, 2009 at 1:40 PM, Benjamin Douglas
>>>> <bbdouglas [at] basistech>wrote:
>>>>
>>>> Thanks for all of the answers so far!
>>>>
>>>>>
>>>>> Paul's question is similar to another aspect I am curious about:
>>>>>
>>>>> Given the way the sample word is analyzed, is there anything in the
>>>>> scoring
>>>>> mechanism that would rank "überwachungsgesetz" higher than
>>>>> "gesetzüberwachung" or "fleischgesetz"?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>> Robert Muir
>>>> rcmuir [at] gmail
>>>>
>>>>
>>>
>>>
>>
>> --
>> Robert Muir
>> rcmuir [at] gmail
>>
>
>


--
Robert Muir
rcmuir [at] gmail

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.