Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Problem with a "." for searching Lucene 2.4.0

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


khmarbaise at gmx

Nov 25, 2009, 9:57 AM

Post #1 of 5 (648 views)
Permalink
Problem with a "." for searching Lucene 2.4.0

Hi,

i'm just using Lucene 2.4 and have a problem with a "." within a field.
This field contains a filename and obviously a filename can contain a
"." (or multiple of them)...

So if i do a search "+filename:testExcel-xaz.xls" this file will not be
found...If i replace the "." with "?" it works...

So my thought was to modify a CustomQueryParser (which i already use for
ranges)...

To scan the information I'm using this:

Document doc....

doc.add(new Field(fieldName.getValue(), value, Field.Store.NO,
Field.Index.NOT_ANALYZED));


The question is: Is this the best solution or does exist a better one ?


Many thanks in advance.

Kind regards
Karl Heinz Marbaise
--
SoftwareEntwicklung Beratung Schulung Tel.: +49 (0) 2405 / 415 893
Dipl.Ing.(FH) Karl Heinz Marbaise ICQ#: 135949029
Hauptstrasse 177 USt.IdNr: DE191347579
52146 Würselen http://www.soebes.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Nov 25, 2009, 10:11 AM

Post #2 of 5 (606 views)
Permalink
Re: Problem with a "." for searching Lucene 2.4.0 [In reply to]

The first question for this is always "what analyzers do you use at index
AND
query time?".

I'd do two things immediately. First, what does query.toString() show you
the query parses
to? StandardAnalyzer does some "interesting" things with periods. Also, you
have a hyphen
(-) in your query which is an operator.... You may be getting a query very
different than you expect.

Second, if you haven't already gotten a copy of Luke, please do so. It'll
both allow
you to investigate your index to see what's *actually* in there and see what
various queries turn into when fed through various analyzers.....

Best
Erick

On Wed, Nov 25, 2009 at 12:57 PM, Karl Heinz Marbaise <khmarbaise [at] gmx>wrote:

> Hi,
>
> i'm just using Lucene 2.4 and have a problem with a "." within a field.
> This field contains a filename and obviously a filename can contain a "."
> (or multiple of them)...
>
> So if i do a search "+filename:testExcel-xaz.xls" this file will not be
> found...If i replace the "." with "?" it works...
>
> So my thought was to modify a CustomQueryParser (which i already use for
> ranges)...
>
> To scan the information I'm using this:
>
> Document doc....
>
> doc.add(new Field(fieldName.getValue(), value, Field.Store.NO,
> Field.Index.NOT_ANALYZED));
>
>
> The question is: Is this the best solution or does exist a better one ?
>
>
> Many thanks in advance.
>
> Kind regards
> Karl Heinz Marbaise
> --
> SoftwareEntwicklung Beratung Schulung Tel.: +49 (0) 2405 / 415 893
> Dipl.Ing.(FH) Karl Heinz Marbaise ICQ#: 135949029
> Hauptstrasse 177 USt.IdNr: DE191347579
> 52146 Würselen http://www.soebes.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


ian.lea at gmail

Nov 25, 2009, 12:08 PM

Post #3 of 5 (613 views)
Permalink
Re: Problem with a "." for searching Lucene 2.4.0 [In reply to]

In addition to Erick's advice, since you are storing filename without
analysis you could use a TermQuery to find it. You can use
BooleanQuery to combine that with other queries, including those
generated by QueryParser.


--
Ian.

On Wed, Nov 25, 2009 at 6:11 PM, Erick Erickson <erickerickson [at] gmail> wrote:
> The first question for this is always "what analyzers do you use at index
> AND
> query time?".
>
> I'd do two things immediately. First, what does query.toString() show you
> the query parses
> to? StandardAnalyzer does some "interesting" things with periods. Also, you
> have a hyphen
> (-) in your query which is an operator.... You may be getting a query very
> different than you expect.
>
> Second, if you haven't already gotten a copy of Luke, please do so. It'll
> both allow
> you to investigate your index to see what's *actually* in there and see what
> various queries turn into when fed through various analyzers.....
>
> Best
> Erick
>
> On Wed, Nov 25, 2009 at 12:57 PM, Karl Heinz Marbaise <khmarbaise [at] gmx>wrote:
>
>> Hi,
>>
>> i'm just using Lucene 2.4 and have a problem with a "." within a field.
>> This field contains a filename and obviously a filename can contain a "."
>> (or multiple of them)...
>>
>> So if i do a search "+filename:testExcel-xaz.xls" this file will not be
>> found...If i replace the "." with "?" it works...
>>
>> So my thought was to modify a CustomQueryParser (which i already use for
>> ranges)...
>>
>> To scan the information I'm using this:
>>
>> Document doc....
>>
>> doc.add(new Field(fieldName.getValue(),  value, Field.Store.NO,
>> Field.Index.NOT_ANALYZED));
>>
>>
>> The question is: Is this the best solution or does exist a better one ?
>>
>>
>> Many thanks in advance.
>>
>> Kind regards
>> Karl Heinz Marbaise
>> --
>> SoftwareEntwicklung Beratung Schulung    Tel.: +49 (0) 2405 / 415 893
>> Dipl.Ing.(FH) Karl Heinz Marbaise        ICQ#: 135949029
>> Hauptstrasse 177                         USt.IdNr: DE191347579
>> 52146 Würselen                           http://www.soebes.de
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


khmarbaise at gmx

Nov 28, 2009, 1:39 PM

Post #4 of 5 (579 views)
Permalink
Re: Problem with a "." for searching Lucene 2.4.0 [In reply to]

Hi Ian,

many thanks for the hints...based on your and Ericks hints i have taken
a deeper look into that...and the StandardAnalyzer which I'm using will
removed informations like "." and "-" from my queries
(+filename:testEXCEL-formats.xls) ...

> In addition to Erick's advice, since you are storing filename without
> analysis you could use a TermQuery to find it.
Does this mean i don't need to index the filename ?

> You can use
> BooleanQuery to combine that with other queries, including those
> generated by QueryParser.
>
Based on those advices i have made an implementation which modifies my
CustomerQueryParser:

protected Query getFieldQuery(String field, String term) throws
ParseException {
LOGGER.debug("getFieldQuery(): field:" + field + " Term: " + term)
if (FieldNames.REVISION.getValue().equals(field)) {
int revision = Integer.parseInt(term);
term = NumberUtils.pad(revision);
}

if (FieldNames.FILENAME.getValue().equals(field)) {
Term t = new Term(FieldNames.FILENAME.getValue(), term.toLowerCase());
TermQuery tq = new TermQuery (t);
BooleanQuery bq = new BooleanQuery ();
bq.add(tq, Occur.MUST);
return bq;
}
return super.getFieldQuery(field, term);
}

Based on my Unit Tests it works as expected...

But I'm not sure to understand the things like "queryparts
-filename:*.xls" correct..

Doesn't that mean that my implementation will change the behaviour into
the following:

"queryparts +filename:*.xls" or did i misunderstand things here ?


Thanks for your help...

Kind regards
Karl Heinz Marbaise
--
SoftwareEntwicklung Beratung Schulung Tel.: +49 (0) 2405 / 415 893
Dipl.Ing.(FH) Karl Heinz Marbaise ICQ#: 135949029
Hauptstrasse 177 USt.IdNr: DE191347579
52146 Würselen http://www.soebes.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Nov 29, 2009, 10:34 AM

Post #5 of 5 (558 views)
Permalink
Re: Problem with a "." for searching Lucene 2.4.0 [In reply to]

See below

On Sat, Nov 28, 2009 at 4:39 PM, Karl Heinz Marbaise <khmarbaise [at] gmx>wrote:

> Hi Ian,
>
> many thanks for the hints...based on your and Ericks hints i have taken a
> deeper look into that...and the StandardAnalyzer which I'm using will
> removed informations like "." and "-" from my queries
> (+filename:testEXCEL-formats.xls) ...
>
>
Here's the first issue. I wouldn't use StandardAnalyzer here at all. You're
taking an analyzer
that's not intended to handle file names (actually, it's intended to try to
preserve
emails, etc) and then having to compensate for it's actions in your
queryparser.
PerFieldAnalyzerWrapper can be used both at index and query time to parse
different fields with different analyzers.

Rather, I'd create my own analyzer from the tokenizers and tokenfilters
Lucene
provides that do what I want. Say a LowerCaseFilter and WhiteSpaceAnalyzer
or something. Use that analyzer for indexing and querying...


>
> In addition to Erick's advice, since you are storing filename without
>> analysis you could use a TermQuery to find it.
>>
> Does this mean i don't need to index the filename ?
>
>
Indexing and storing are orthogonal. That is, if you want to search
on something, you MUST index it. Storing it is simply putting an
un-analyzed copy in your Document so you can easily display
the original data.


>
> > You can use
>
>> BooleanQuery to combine that with other queries, including those
>> generated by QueryParser.
>>
>> Based on those advices i have made an implementation which modifies my
> CustomerQueryParser:
>
>
Rather than do this, I'd re-use a custom analyzer (see above, and assuming
that you
can't use one of the standard analyzers) and just escape the relevant
characters
before feeding them to the query parser. The Lucene Wiki has a list of
characters
that need escaping I'm pretty sure. But see QueryParser.escape....


> protected Query getFieldQuery(String field, String term) throws
> ParseException {
> LOGGER.debug("getFieldQuery(): field:" + field + " Term: " + term)
> if (FieldNames.REVISION.getValue().equals(field)) {
> int revision = Integer.parseInt(term);
> term = NumberUtils.pad(revision);
> }
>
> if (FieldNames.FILENAME.getValue().equals(field)) {
> Term t = new Term(FieldNames.FILENAME.getValue(), term.toLowerCase());
> TermQuery tq = new TermQuery (t);
> BooleanQuery bq = new BooleanQuery ();
> bq.add(tq, Occur.MUST);
> return bq;
> }
> return super.getFieldQuery(field, term);
> }
>
> Based on my Unit Tests it works as expected...
>
> But I'm not sure to understand the things like "queryparts -filename:*.xls"
> correct..
>
>
If you can use analyzers as above, you'll save yourself a lot of work by
letting Lucene
do the heavy lifting <G>...

Best
Erick


> Doesn't that mean that my implementation will change the behaviour into the
> following:
>
> "queryparts +filename:*.xls" or did i misunderstand things here ?
>
>
> Thanks for your help...
>
>
> Kind regards
> Karl Heinz Marbaise
> --
> SoftwareEntwicklung Beratung Schulung Tel.: +49 (0) 2405 / 415 893
> Dipl.Ing.(FH) Karl Heinz Marbaise ICQ#: 135949029
> Hauptstrasse 177 USt.IdNr: DE191347579
> 52146 Würselen http://www.soebes.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.