Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: General

How to structure lucene query?

 

 

Lucene general RSS feed   Index | Next | Previous | View Threaded


ywlee522 at gmail

Jun 6, 2009, 8:59 AM

Post #1 of 8 (1530 views)
Permalink
How to structure lucene query?

A document has two fields; username, date, and document text. A user can
have multiple documents.

The query is:

Of the users who have one or more documents with keyword "ABC", find users
who also have one or more document with keyword "XYZ".

This isn't finding documents with both "ABC" and "XYZ". How can this be
done in lucene query? THANK YOU



--
View this message in context: http://www.nabble.com/How-to-structure-lucene-query--tp23902784p23902784.html
Sent from the Lucene - General mailing list archive at Nabble.com.


ted.dunning at gmail

Jun 6, 2009, 11:39 PM

Post #2 of 8 (1465 views)
Permalink
Re: How to structure lucene query? [In reply to]

It is the same as finding documents with both "ABC" and "XYZ" except that
you need to run over the results yourself and collect the user names.

Lucene doesn't have a fancy query language so you can't magically do any
group-by or count(distinct) tricks.

On Sat, Jun 6, 2009 at 8:59 AM, ywlee522 <ywlee522 [at] gmail> wrote:

>
>
> A document has two fields; username, date, and document text. A user can
> have multiple documents.
>
> The query is:
>
> Of the users who have one or more documents with keyword "ABC", find users
> who also have one or more document with keyword "XYZ".
>
> This isn't finding documents with both "ABC" and "XYZ". How can this be
> done in lucene query? THANK YOU
>
>
>
> --
> View this message in context:
> http://www.nabble.com/How-to-structure-lucene-query--tp23902784p23902784.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>


--
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)


ywlee522 at gmail

Jun 7, 2009, 7:28 AM

Post #3 of 8 (1456 views)
Permalink
Re: How to structure lucene query? [In reply to]

Thanks for the tip. But, no, it is not same as finding documents with both
"ABC" and "XYZ", as they can be appear in separate documents of the same
user.




Ted Dunning wrote:
>
> It is the same as finding documents with both "ABC" and "XYZ" except that
> you need to run over the results yourself and collect the user names.
>
> Lucene doesn't have a fancy query language so you can't magically do any
> group-by or count(distinct) tricks.
>
> On Sat, Jun 6, 2009 at 8:59 AM, ywlee522 <ywlee522 [at] gmail> wrote:
>
>>
>>
>> A document has two fields; username, date, and document text. A user can
>> have multiple documents.
>>
>> The query is:
>>
>> Of the users who have one or more documents with keyword "ABC", find
>> users
>> who also have one or more document with keyword "XYZ".
>>
>> This isn't finding documents with both "ABC" and "XYZ". How can this be
>> done in lucene query? THANK YOU
>>
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-structure-lucene-query--tp23902784p23902784.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
>>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> http://www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)
>
>

--
View this message in context: http://www.nabble.com/How-to-structure-lucene-query--tp23902784p23911598.html
Sent from the Lucene - General mailing list archive at Nabble.com.


simon.willnauer at googlemail

Jun 7, 2009, 9:40 AM

Post #4 of 8 (1459 views)
Permalink
Re: How to structure lucene query? [In reply to]

could you please give us more details of you query or an example that
might help to understand what you are trying to do. I had the same
impression as Ted though.

simon

On Sun, Jun 7, 2009 at 4:28 PM, ywlee522<ywlee522 [at] gmail> wrote:
>
> Thanks for the tip.  But, no, it is not same as finding documents with both
> "ABC" and "XYZ", as they can be appear in separate documents of the same
> user.
>
>
>
>
> Ted Dunning wrote:
>>
>> It is the same as finding documents with both "ABC" and "XYZ" except that
>> you need to run over the results yourself and collect the user names.
>>
>> Lucene doesn't have a fancy query language so you can't magically do any
>> group-by or count(distinct) tricks.
>>
>> On Sat, Jun 6, 2009 at 8:59 AM, ywlee522 <ywlee522 [at] gmail> wrote:
>>
>>>
>>>
>>> A document has two fields; username, date, and document text. A user can
>>> have multiple documents.
>>>
>>> The query is:
>>>
>>> Of the users who have one or more documents with keyword "ABC", find
>>> users
>>> who also have one or more document with keyword "XYZ".
>>>
>>> This isn't finding documents with both "ABC" and "XYZ".   How can this be
>>> done in lucene query? THANK YOU
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/How-to-structure-lucene-query--tp23902784p23902784.html
>>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>> 111 West Evelyn Ave. Ste. 202
>> Sunnyvale, CA 94086
>> http://www.deepdyve.com
>> 858-414-0013 (m)
>> 408-773-0220 (fax)
>>
>>
>
> --
> View this message in context: http://www.nabble.com/How-to-structure-lucene-query--tp23902784p23911598.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>


ted.dunning at gmail

Jun 7, 2009, 11:32 AM

Post #5 of 8 (1460 views)
Permalink
Re: How to structure lucene query? [In reply to]

In that case, you need to build a bunch of user "documents" which are the
union of the fields in question.

Then the retrieval is trivial.

On Sun, Jun 7, 2009 at 7:28 AM, ywlee522 <ywlee522 [at] gmail> wrote:

>
> Thanks for the tip. But, no, it is not same as finding documents with both
> "ABC" and "XYZ", as they can be appear in separate documents of the same
> user.
>
>
>
>
> Ted Dunning wrote:
> >
> > It is the same as finding documents with both "ABC" and "XYZ" except that
> > you need to run over the results yourself and collect the user names.
> >
> > Lucene doesn't have a fancy query language so you can't magically do any
> > group-by or count(distinct) tricks.
> >
> > On Sat, Jun 6, 2009 at 8:59 AM, ywlee522 <ywlee522 [at] gmail> wrote:
> >
> >>
> >>
> >> A document has two fields; username, date, and document text. A user can
> >> have multiple documents.
> >>
> >> The query is:
> >>
> >> Of the users who have one or more documents with keyword "ABC", find
> >> users
> >> who also have one or more document with keyword "XYZ".
> >>
> >> This isn't finding documents with both "ABC" and "XYZ". How can this
> be
> >> done in lucene query? THANK YOU
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/How-to-structure-lucene-query--tp23902784p23902784.html
> >> Sent from the Lucene - General mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
> > 111 West Evelyn Ave. Ste. 202
> > Sunnyvale, CA 94086
> > http://www.deepdyve.com
> > 858-414-0013 (m)
> > 408-773-0220 (fax)
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/How-to-structure-lucene-query--tp23902784p23911598.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>


--
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)


ywlee522 at gmail

Jun 7, 2009, 11:48 AM

Post #6 of 8 (1447 views)
Permalink
Re: How to structure lucene query? [In reply to]

Thanks for the comments. Apology for not providing details earlier.

Users in my system generate reports of some type everyday. So a Lucene
document has 4 fields; user name, report create_dt, report type, and report
text. For example, an analyst writes a report of telco market today, and
may write a report of mobile phones in tomorrow.

The query is "of the users who has one or more reports containing "ABC",
find users who also has one or more reports containing "XYZ".

A user may have "ABC" in one report, and "XYZ" in another report, i.e., not
in the same report. But this will match the query.

I first tried this in two searches: one searching "ABC" and collecting user
names (going thru all results), and the second one searching "XYZ" among the
users found in the first search. But this seems very inefficient, and not
sure if this is the right use of Lucene.

If I put all reports of a user into a single Lucene document, then it is
equal to find all documents containing both "ABC" and "XYZ". But, then, i
will lose the report_dt field, which is another parameter in the query.








Simon Willnauer wrote:
>
> could you please give us more details of you query or an example that
> might help to understand what you are trying to do. I had the same
> impression as Ted though.
>
> simon
>
> On Sun, Jun 7, 2009 at 4:28 PM, ywlee522<ywlee522 [at] gmail> wrote:
>>
>> Thanks for the tip.  But, no, it is not same as finding documents with
>> both
>> "ABC" and "XYZ", as they can be appear in separate documents of the same
>> user.
>>
>>
>>
>>
>> Ted Dunning wrote:
>>>
>>> It is the same as finding documents with both "ABC" and "XYZ" except
>>> that
>>> you need to run over the results yourself and collect the user names.
>>>
>>> Lucene doesn't have a fancy query language so you can't magically do any
>>> group-by or count(distinct) tricks.
>>>
>>> On Sat, Jun 6, 2009 at 8:59 AM, ywlee522 <ywlee522 [at] gmail> wrote:
>>>
>>>>
>>>>
>>>> A document has two fields; username, date, and document text. A user
>>>> can
>>>> have multiple documents.
>>>>
>>>> The query is:
>>>>
>>>> Of the users who have one or more documents with keyword "ABC", find
>>>> users
>>>> who also have one or more document with keyword "XYZ".
>>>>
>>>> This isn't finding documents with both "ABC" and "XYZ".   How can this
>>>> be
>>>> done in lucene query? THANK YOU
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/How-to-structure-lucene-query--tp23902784p23902784.html
>>>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>>
>>> --
>>> Ted Dunning, CTO
>>> DeepDyve
>>>
>>> 111 West Evelyn Ave. Ste. 202
>>> Sunnyvale, CA 94086
>>> http://www.deepdyve.com
>>> 858-414-0013 (m)
>>> 408-773-0220 (fax)
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-structure-lucene-query--tp23902784p23911598.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
>>
>
>

--
View this message in context: http://www.nabble.com/How-to-structure-lucene-query--tp23902784p23914028.html
Sent from the Lucene - General mailing list archive at Nabble.com.


ted.dunning at gmail

Jun 7, 2009, 3:28 PM

Post #7 of 8 (1459 views)
Permalink
Re: How to structure lucene query? [In reply to]

You can have more than one kind of document in your index.

If you have users and reports in your index then you can search users to
find those who have touched the subjects you need and then you can search
for reports authored by one of the authors qualified in the first query.
This will be pretty efficient, especially if you paginate the list of
authors so you only retrieve the documents for a few authors at a time.

To be specific, reports would have 4 fields; author_name, author_id,
report_create_dt, report_type, and
text just as you mentioned above. I would add a globally unique report id
called report_id.

Then authors would have a few fields: user_id, user_name,
report_types_written and report_ids_written.

Query one would be

*report_types_written:(+"XYZ" +"ABC")
*

And you would retain user_name, report_ids_written. Note that reports don't
have this field and thus will never be retrieved here.

Suppose you find the following three authors:

*user_id: 52, user_name: Alice, report_ids_written: 1, 3, 5, 7*
*user_id: 1327, user_name: Bob, report_ids_written: 22, 11, 55, 77, 3*
*user_id: 52, user_name: Alice, report_ids_written: 4, 6, 12*

The second query would be

*report_id:(1 3 4 5 6 7 11 12 22 55 77) *

If you find thousands of reports that you want to retrieve, you should
retrieve them in batches. If you present the authors in pages of 10 or 20,
then you are unlikely to have more than dozens of reports to retrieve per
page.

Note that the second query will only retrieve reports because authors don't
have that field.

If you want to limit the dates for the second query, you could use this:

*+report_id:(1 3 4 5 6 7 11 12 22 55 77) **+report_create_dt:[20051009 TO
20090605]*

Note how there is now a + on the first term. It could have been used on the
first version of the second query but would have had no effect. Once you
add additional terms, however, you need to include the plusses to make sure
you strictly apply all the conditions you need.

Better?


On Sun, Jun 7, 2009 at 11:48 AM, ywlee522 <ywlee522 [at] gmail> wrote:

> The query is "of the users who has one or more reports containing "ABC",
> find users who also has one or more reports containing "XYZ".
>
> ..
>
> If I put all reports of a user into a single Lucene document, then it is
> equal to find all documents containing both "ABC" and "XYZ". But, then, i
> will lose the report_dt field, which is another parameter in the query.
>


ywlee522 at gmail

Jun 8, 2009, 6:56 AM

Post #8 of 8 (1458 views)
Permalink
Re: How to structure lucene query? [In reply to]

Thanks for such detailed suggestions. I will definitely try this.



--
View this message in context: http://www.nabble.com/How-to-structure-lucene-query--tp23902784p23924550.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Lucene general RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.