Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Using Lucene to match document sets to each other

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


pacesysjosh at gmail

Dec 15, 2011, 1:56 PM

Post #1 of 5 (241 views)
Permalink
Using Lucene to match document sets to each other

I have a use case for which I'm trying to figure out the best way to use
Lucene and could use some guidance.

I have a set of documents representing products in a catalog (name,
description, etc.). I then pull down data from different sources such as
Ebay and Amazon and need to determine if the items retrieved from those
sources match any of the products in the catalog. So I'm essentially
attempting to take many items and many products and determine where I have
matches.

I'm not sure the best way to go about this, but one questionable approach
is to index the items as I pull them in (to RAM) and do one search for
every product in my catalog, looking for matching names or descriptions.
This means an almost exponential number of queries though. Is there a
better approach? Any help is appreciated.

Thanks,
Josh


gresh at us

Dec 16, 2011, 5:02 AM

Post #2 of 5 (236 views)
Permalink
Re: Using Lucene to match document sets to each other [In reply to]

Maybe I'm misunderstanding what you're trying to do, but why not do it the
other
way around; that is, index the items in your catalog, and use the items on
the web
as the query into the catalog. I have an analogous process (though
completely
different application area) and I index the stuff that doesn't change
much, and use the
things that are constantly changing as the query.

Donna L. Gresh
Business Analytics and Mathematical Sciences
IBM T.J. Watson Research Center
(914) 945-2472
https://researcher.ibm.com/researcher/view.php?person=us-gresh
gresh [at] us




From:
Josh Stone <pacesysjosh [at] gmail>
To:
java-user [at] lucene
Date:
12/15/2011 04:57 PM
Subject:
Using Lucene to match document sets to each other



I have a use case for which I'm trying to figure out the best way to use
Lucene and could use some guidance.

I have a set of documents representing products in a catalog (name,
description, etc.). I then pull down data from different sources such as
Ebay and Amazon and need to determine if the items retrieved from those
sources match any of the products in the catalog. So I'm essentially
attempting to take many items and many products and determine where I have
matches.

I'm not sure the best way to go about this, but one questionable approach
is to index the items as I pull them in (to RAM) and do one search for
every product in my catalog, looking for matching names or descriptions.
This means an almost exponential number of queries though. Is there a
better approach? Any help is appreciated.

Thanks,
Josh


pacesysjosh at gmail

Dec 16, 2011, 9:53 AM

Post #3 of 5 (233 views)
Permalink
Re: Using Lucene to match document sets to each other [In reply to]

Thanks for the response Donna. That would make more sense, but the items
I'm pulling in from the web contain large bodies of text (descriptions)
whereas the products in my catalog consist of shorter fields such as
product name, manufacturer, product code, etc. So using the smaller fields
from my catalog to build queries against the larger fields in the items I
pull in seems to be the only way to do things (that I can think of).

And this brings up my exact problem. I have a document (set of fields) that
I want to use as search criteria for a search against another set of
documents. Can something like this be done?

Cheers,
Josh

On Fri, Dec 16, 2011 at 5:02 AM, Donna L Gresh <gresh [at] us> wrote:

> Maybe I'm misunderstanding what you're trying to do, but why not do it the
> other
> way around; that is, index the items in your catalog, and use the items on
> the web
> as the query into the catalog. I have an analogous process (though
> completely
> different application area) and I index the stuff that doesn't change
> much, and use the
> things that are constantly changing as the query.
>
> Donna L. Gresh
> Business Analytics and Mathematical Sciences
> IBM T.J. Watson Research Center
> (914) 945-2472
> https://researcher.ibm.com/researcher/view.php?person=us-gresh
> gresh [at] us
>
>
>
>
> From:
> Josh Stone <pacesysjosh [at] gmail>
> To:
> java-user [at] lucene
> Date:
> 12/15/2011 04:57 PM
> Subject:
> Using Lucene to match document sets to each other
>
>
>
> I have a use case for which I'm trying to figure out the best way to use
> Lucene and could use some guidance.
>
> I have a set of documents representing products in a catalog (name,
> description, etc.). I then pull down data from different sources such as
> Ebay and Amazon and need to determine if the items retrieved from those
> sources match any of the products in the catalog. So I'm essentially
> attempting to take many items and many products and determine where I have
> matches.
>
> I'm not sure the best way to go about this, but one questionable approach
> is to index the items as I pull them in (to RAM) and do one search for
> every product in my catalog, looking for matching names or descriptions.
> This means an almost exponential number of queries though. Is there a
> better approach? Any help is appreciated.
>
> Thanks,
> Josh
>
>
>


erickerickson at gmail

Dec 16, 2011, 12:04 PM

Post #4 of 5 (235 views)
Permalink
Re: Using Lucene to match document sets to each other [In reply to]

Have you looked at Lucene's "MoreLikeThis"? I confess I haven't
worked with this enough to recommend *how* to use it, but it seems
like it's in the general area you're talking about.

http://lucene.apache.org/java/3_5_0/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThis.html

Best
Erick

On Fri, Dec 16, 2011 at 12:53 PM, Josh Stone <pacesysjosh [at] gmail> wrote:
> Thanks for the response Donna. That would make more sense, but the items
> I'm pulling in from the web contain large bodies of text (descriptions)
> whereas the products in my catalog consist of shorter fields such as
> product name, manufacturer, product code, etc. So using the smaller fields
> from my catalog to build queries against the larger fields in the items I
> pull in seems to be the only way to do things (that I can think of).
>
> And this brings up my exact problem. I have a document (set of fields) that
> I want to use as search criteria for a search against another set of
> documents. Can something like this be done?
>
> Cheers,
> Josh
>
> On Fri, Dec 16, 2011 at 5:02 AM, Donna L Gresh <gresh [at] us> wrote:
>
>> Maybe I'm misunderstanding what you're trying to do, but why not do it the
>> other
>> way around; that is, index the items in your catalog, and use the items on
>> the web
>> as the query into the catalog. I have an analogous process (though
>> completely
>> different application area) and I index the stuff that doesn't change
>> much, and use the
>> things that are constantly changing as the query.
>>
>> Donna L. Gresh
>> Business Analytics and Mathematical Sciences
>> IBM T.J. Watson Research Center
>> (914) 945-2472
>> https://researcher.ibm.com/researcher/view.php?person=us-gresh
>> gresh [at] us
>>
>>
>>
>>
>> From:
>> Josh Stone <pacesysjosh [at] gmail>
>> To:
>> java-user [at] lucene
>> Date:
>> 12/15/2011 04:57 PM
>> Subject:
>> Using Lucene to match document sets to each other
>>
>>
>>
>> I have a use case for which I'm trying to figure out the best way to use
>> Lucene and could use some guidance.
>>
>> I have a set of documents representing products in a catalog (name,
>> description, etc.). I then pull down data from different sources such as
>> Ebay and Amazon and need to determine if the items retrieved from those
>> sources match any of the products in the catalog. So I'm essentially
>> attempting to take many items and many products and determine where I have
>> matches.
>>
>> I'm not sure the best way to go about this, but one questionable approach
>> is to index the items as I pull them in (to RAM) and do one search for
>> every product in my catalog, looking for matching names or descriptions.
>> This means an almost exponential number of queries though. Is there a
>> better approach? Any help is appreciated.
>>
>> Thanks,
>> Josh
>>
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul at metajure

Dec 19, 2011, 9:17 AM

Post #5 of 5 (225 views)
Permalink
RE: Using Lucene to match document sets to each other [In reply to]

I'm not sure I understand what your field arrangement would be when you say
"[T]he items I'm pulling in from the web contain large bodies of text (descriptions) whereas the products in my catalog consist of shorter fields such as product name, manufacturer, product code, etc. So using the smaller fields from my catalog to build queries against the larger fields in the items I pull in seems to be the only way to do things (that I can think of)."

I would want to take a vanilla crawl, parse, index approach: (1) find a candidate document, (2) parse the web document as best I could to located all the fields of your existing documents "product name, manufacturer, product code etc.". But instead of creating a new document, I would form a very general query against my document set.

That sounds good, but if the web documents are tricky to parse, I could see why you might want to index the web documents as "text body" and search for any of your existing fields. You'd get good throughput searching for as many of your documents in as many of the web documents as possible, but of course, you'd NOT want to wait until you've crawled Amazon before checking for any matches. This leads me to think about multiple phase approach where a crawler creates "useful" size indices, then it closes that index, hands it off to the query-for-my-products phase and starts another one. Note how this approach doesn't require your products in an Lucene index, just the web documents.

That sounds like a fun and interesting problem. Good luck.

-Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.