Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

get original term for synonym

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


matthijs at impressie

Nov 12, 2007, 6:54 AM

Post #1 of 5 (1024 views)
Permalink
get original term for synonym

Hi there,

Currently I am trying to get synonyms to work. I have gotten as far as
injecting them into the index as Token.type SYNONYM. Lucene then finds
the original word and synonym and points to the same document. So far so
good.

However, I am stuck at highlighting the result. I have highlighters of
my own that currently need an array of words to highlight. Extraction is
done from the query, but the query gives the synonym, instead of the
word it points to. Therefore highlighting is incorrect (or rather
non-existent). Is there a simple way to make the synonym return the word
it points to?

Thank you in advance,
Matthijs Bierman
(Netherlands)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


markharw00d at yahoo

Nov 12, 2007, 7:49 AM

Post #2 of 5 (1002 views)
Permalink
Re: get original term for synonym [In reply to]

>>Is there a simple way to make the synonym return the
word it points to?

Give the highlighter the same analyzer you used to create the index, not the one you use to parse the query. This should ensure the set of words to be highlighted includes all synonyms.

Cheers
Mark

----- Original Message ----
From: Matthijs Bierman <matthijs[at]impressie.nl>
To: java-user[at]lucene.apache.org
Sent: Monday, 12 November, 2007 2:54:45 PM
Subject: get original term for synonym

Hi there,

Currently I am trying to get synonyms to work. I have gotten as far as
injecting them into the index as Token.type SYNONYM. Lucene then finds
the original word and synonym and points to the same document. So far
so
good.

However, I am stuck at highlighting the result. I have highlighters of
my own that currently need an array of words to highlight. Extraction
is
done from the query, but the query gives the synonym, instead of the
word it points to. Therefore highlighting is incorrect (or rather
non-existent). Is there a simple way to make the synonym return the
word
it points to?

Thank you in advance,
Matthijs Bierman
(Netherlands)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org






___________________________________________________________
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good http://uk.promotions.yahoo.com/forgood/environment.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


matthijs at impressie

Nov 14, 2007, 3:51 AM

Post #3 of 5 (989 views)
Permalink
Re: get original term for synonym [In reply to]

Hi Mark,

Your solution would be correct if the synonym would be a true 2-way
synonym. Unfortunately this is not the case. My analyzer takes care of
decomposition of specific Dutch words (where a "-" is used to create
compound words). For example: 'zone-indeling' would create synonyms for
'zone'-> 'zone-indeling' and 'indeling'->'zone-indeling'.
When analyzing 'zone' it will therefore not point back to
'zone-indeling' (this information is simply not available). Putting all
the results from the indexing process into a file or lucene document
(thus creating a 'lookup' index) would probably make the lookup process
rather slow, or make application startup too long (due to HashMap
generation).

Maybe you can do something with offsets?

Thanks,
Matthijs


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


markharw00d at yahoo

Nov 14, 2007, 5:10 AM

Post #4 of 5 (986 views)
Permalink
Re: get original term for synonym [In reply to]

It would be useful to have more details about the query input and the expected highlights you want.

So given your 'zone-indeling' example document and the index-time tokenisation you described, which of the following queries would you expect to match and what would you want highlighted in each case?
1) zone
2) zone-indeling
3) "zone indeling"
4) zone-somethingElse


My assumption here is that you are using the standard Lucene Query parser and that query 3 will therefore be a phrase query.

Cheers
Mark


----- Original Message ----
From: Matthijs Bierman <matthijs[at]impressie.nl>
To: java-user[at]lucene.apache.org
Sent: Wednesday, 14 November, 2007 11:51:07 AM
Subject: Re: get original term for synonym

Hi Mark,

Your solution would be correct if the synonym would be a true 2-way
synonym. Unfortunately this is not the case. My analyzer takes care of
decomposition of specific Dutch words (where a "-" is used to create
compound words). For example: 'zone-indeling' would create synonyms for
'zone'-> 'zone-indeling' and 'indeling'->'zone-indeling'.
When analyzing 'zone' it will therefore not point back to
'zone-indeling' (this information is simply not available). Putting all
the results from the indexing process into a file or lucene document
(thus creating a 'lookup' index) would probably make the lookup process
rather slow, or make application startup too long (due to HashMap
generation).

Maybe you can do something with offsets?

Thanks,
Matthijs


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org






___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


matthijs at impressie

Nov 15, 2007, 6:13 AM

Post #5 of 5 (977 views)
Permalink
Re: get original term for synonym [In reply to]

Hi Mark,

I have solved it in another way now. I've created my own implementation
of StandardAnalyzer (which I've called AdvancedAnalyzer). This analyzer
keeps the word "zone-indeling" together, so users can simply search for
this term and it will be highlighted exactly as is. These compound words
occur in Dutch only I presume.

The problem was my highlighter was for a PDF document (through PDFBox).
This would highlight 1,2 and "indeling". Resulting in unwanted behaviour
when the word was split into zone and indeling.

Thanks for your help though.

Cheers,
Matthijs

mark harwood wrote:
> It would be useful to have more details about the query input and the expected highlights you want.
>
> So given your 'zone-indeling' example document and the index-time tokenisation you described, which of the following queries would you expect to match and what would you want highlighted in each case?
> 1) zone
> 2) zone-indeling
> 3) "zone indeling"
> 4) zone-somethingElse
>
>
> My assumption here is that you are using the standard Lucene Query parser and that query 3 will therefore be a phrase query.
>
> Cheers
> Mark
>
>
> ----- Original Message ----
> From: Matthijs Bierman <matthijs[at]impressie.nl>
> To: java-user[at]lucene.apache.org
> Sent: Wednesday, 14 November, 2007 11:51:07 AM
> Subject: Re: get original term for synonym
>
> Hi Mark,
>
> Your solution would be correct if the synonym would be a true 2-way
> synonym. Unfortunately this is not the case. My analyzer takes care of
> decomposition of specific Dutch words (where a "-" is used to create
> compound words). For example: 'zone-indeling' would create synonyms for
> 'zone'-> 'zone-indeling' and 'indeling'->'zone-indeling'.
> When analyzing 'zone' it will therefore not point back to
> 'zone-indeling' (this information is simply not available). Putting all
> the results from the indexing process into a file or lucene document
> (thus creating a 'lookup' index) would probably make the lookup process
> rather slow, or make application startup too long (due to HashMap
> generation).
>
> Maybe you can do something with offsets?
>
> Thanks,
> Matthijs
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>
>
>
>
>
> ___________________________________________________________
> Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
> now.
> http://uk.answers.yahoo.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.