Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

A key value field storing

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


deb.lucene at gmail

Mar 21, 2012, 7:20 AM

Post #1 of 5 (227 views)
Permalink
A key value field storing

Hi Group,

Sorry for cross posting!

We need to index a document corpus (news articles) with some meta data
features. The meta data are actually company names with some scoring (a
double, between 0 to 1). For example, two documents can be -

document 1
(some text - say a technical article from NY times). It comes with the
metadata like -
IBM - 0.5
Google - 0.9
Apple - 0.3

where 0.5, 0.9, 0.3 are some confidence scores for the company names.

Similarly, the document 2 is about some IT article and then the meta data
are like -
IBM - 0.6
Google - 0.1
Apple - 0.4

now we can index the documents based on the contents or the company names
easily. But here the problem is we need to create a "field" where the
company names and the scores are linked. So that we can search something
like -

query = where the "company name" (a field) is "IBM" and the scores of IBM
is > 0.5.
So in that case the document 2 will be retrieved.

I am wondering if anyone has ideas about using the company names and scores
(linked) together as a field.

Thanks in advance,

--d


ian.lea at gmail

Mar 21, 2012, 7:41 AM

Post #2 of 5 (222 views)
Permalink
Re: A key value field storing [In reply to]

Why do you want to link name and confidence in one field? Store
confidence as a NumericField and search something like

BooleanQuery bq = new BooleanQuery();
Query nameq = parser.parse(...) or whatever
Query confq = NumericRangeQuery.newXxx(...);
bq.add(nameq, ...);
bq,add(confq, ...);

and search using bq.


--
Ian.


On Wed, Mar 21, 2012 at 2:20 PM, Deb Lucene <deb.lucene [at] gmail> wrote:
> Hi Group,
>
> Sorry for cross posting!
>
> We need to index a document corpus (news articles) with some meta data
> features. The meta data are actually company names with some scoring (a
> double, between 0 to 1). For example, two documents can be -
>
> document 1
> (some text - say a technical article from NY times). It comes with the
> metadata like -
> IBM - 0.5
> Google - 0.9
> Apple - 0.3
>
> where 0.5, 0.9, 0.3 are some confidence scores for the company names.
>
> Similarly, the document 2 is about some IT article and then the meta data
> are like -
> IBM - 0.6
> Google - 0.1
> Apple - 0.4
>
> now we can index the documents based on the contents or the company names
> easily. But here the problem is we need to create a "field" where the
> company names and the scores are linked. So that we can search something
> like -
>
> query = where the "company name" (a field) is "IBM" and the scores of IBM
> is > 0.5.
> So in that case the document 2 will be retrieved.
>
> I am wondering if anyone has ideas about using the company names and scores
> (linked) together as a field.
>
> Thanks in advance,
>
> --d

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


deb.lucene at gmail

Mar 21, 2012, 8:48 AM

Post #3 of 5 (220 views)
Permalink
Re: A key value field storing [In reply to]

Hi Ian,

Thanks for the reply. I am not sure if the bq solution will b able to solve
the problem. Let me explain with an example -

document 1 - (some text)
IBM - 0.6
Google - 0.1
Apple - 0.4

Now suppose I index the document based on the "company name" and
"confidence scores" separately and search using the bq where the Numeric
Field search is based on "anything below 0.5" and text = "IBM". Here, by
mistake the document 1 will be chosen (as it has been stored with 0.6, 0.1
and 0.4). But actually it should not be - as the "IBM" score is 0.6. So in
gist - this problem needs some sort of linking between the company name and
the scores.

--d



On Wed, Mar 21, 2012 at 10:41 AM, Ian Lea <ian.lea [at] gmail> wrote:

> Why do you want to link name and confidence in one field? Store
> confidence as a NumericField and search something like
>
> BooleanQuery bq = new BooleanQuery();
> Query nameq = parser.parse(...) or whatever
> Query confq = NumericRangeQuery.newXxx(...);
> bq.add(nameq, ...);
> bq,add(confq, ...);
>
> and search using bq.
>
>
> --
> Ian.
>
>
> On Wed, Mar 21, 2012 at 2:20 PM, Deb Lucene <deb.lucene [at] gmail> wrote:
> > Hi Group,
> >
> > Sorry for cross posting!
> >
> > We need to index a document corpus (news articles) with some meta data
> > features. The meta data are actually company names with some scoring (a
> > double, between 0 to 1). For example, two documents can be -
> >
> > document 1
> > (some text - say a technical article from NY times). It comes with the
> > metadata like -
> > IBM - 0.5
> > Google - 0.9
> > Apple - 0.3
> >
> > where 0.5, 0.9, 0.3 are some confidence scores for the company names.
> >
> > Similarly, the document 2 is about some IT article and then the meta data
> > are like -
> > IBM - 0.6
> > Google - 0.1
> > Apple - 0.4
> >
> > now we can index the documents based on the contents or the company names
> > easily. But here the problem is we need to create a "field" where the
> > company names and the scores are linked. So that we can search something
> > like -
> >
> > query = where the "company name" (a field) is "IBM" and the scores of IBM
> > is > 0.5.
> > So in that case the document 2 will be retrieved.
> >
> > I am wondering if anyone has ideas about using the company names and
> scores
> > (linked) together as a field.
> >
> > Thanks in advance,
> >
> > --d
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


uwe at thetaphi

Mar 21, 2012, 9:03 AM

Post #4 of 5 (226 views)
Permalink
RE: A key value field storing [In reply to]

You can use a CustomScoreQuery wrapping your scored query to multiply the
"confidence level" (as a DocValues field in Lucene trunk, or an indexed
NumericField with precisionStep=Integer.MAX_VALUE using FieldCache) into the
score.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe [at] thetaphi


> -----Original Message-----
> From: Deb Lucene [mailto:deb.lucene [at] gmail]
> Sent: Wednesday, March 21, 2012 4:48 PM
> To: java-user [at] lucene
> Subject: Re: A key value field storing
>
> Hi Ian,
>
> Thanks for the reply. I am not sure if the bq solution will b able to
solve the
> problem. Let me explain with an example -
>
> document 1 - (some text)
> IBM - 0.6
> Google - 0.1
> Apple - 0.4
>
> Now suppose I index the document based on the "company name" and
> "confidence scores" separately and search using the bq where the Numeric
> Field search is based on "anything below 0.5" and text = "IBM". Here, by
> mistake the document 1 will be chosen (as it has been stored with 0.6, 0.1
and
> 0.4). But actually it should not be - as the "IBM" score is 0.6. So in
gist - this
> problem needs some sort of linking between the company name and the
> scores.
>
> --d
>
>
>
> On Wed, Mar 21, 2012 at 10:41 AM, Ian Lea <ian.lea [at] gmail> wrote:
>
> > Why do you want to link name and confidence in one field? Store
> > confidence as a NumericField and search something like
> >
> > BooleanQuery bq = new BooleanQuery();
> > Query nameq = parser.parse(...) or whatever Query confq =
> > NumericRangeQuery.newXxx(...); bq.add(nameq, ...); bq,add(confq, ...);
> >
> > and search using bq.
> >
> >
> > --
> > Ian.
> >
> >
> > On Wed, Mar 21, 2012 at 2:20 PM, Deb Lucene <deb.lucene [at] gmail>
> wrote:
> > > Hi Group,
> > >
> > > Sorry for cross posting!
> > >
> > > We need to index a document corpus (news articles) with some meta
> > > data features. The meta data are actually company names with some
> > > scoring (a double, between 0 to 1). For example, two documents can
> > > be -
> > >
> > > document 1
> > > (some text - say a technical article from NY times). It comes with
> > > the metadata like - IBM - 0.5 Google - 0.9 Apple - 0.3
> > >
> > > where 0.5, 0.9, 0.3 are some confidence scores for the company names.
> > >
> > > Similarly, the document 2 is about some IT article and then the meta
> > > data are like - IBM - 0.6 Google - 0.1 Apple - 0.4
> > >
> > > now we can index the documents based on the contents or the company
> > > names easily. But here the problem is we need to create a "field"
> > > where the company names and the scores are linked. So that we can
> > > search something like -
> > >
> > > query = where the "company name" (a field) is "IBM" and the scores
> > > of IBM is > 0.5.
> > > So in that case the document 2 will be retrieved.
> > >
> > > I am wondering if anyone has ideas about using the company names and
> > scores
> > > (linked) together as a field.
> > >
> > > Thanks in advance,
> > >
> > > --d
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ian.lea at gmail

Mar 21, 2012, 9:03 AM

Post #5 of 5 (225 views)
Permalink
Re: A key value field storing [In reply to]

Ah, I see. More complicated than I realized. How about using two
sorts of documents.

Type 1, one lucene doc for your example
textid: 1234
text: some text about something

Type 2, 3 lucene docs for your example
First
textid: 1234
company: IBM
score: 0.6
Second
textid: 1234
company: Google
score: 0.1
Third
textid: 1234
company: Apple
score: 0.4

You could then use the BooleanQuery approach to get textids, with an
additional lookup to get the actual text. Not brilliant and won't
work if you want text:aaaa company:google minconf:0.1

There is BlockJoinQuery in recent versions that gives some sort of
parent/child relationship. Might be worth a look. Or wait for a
better idea from someone else.


--
Ian.

On Wed, Mar 21, 2012 at 3:48 PM, Deb Lucene <deb.lucene [at] gmail> wrote:
> Hi Ian,
>
> Thanks for the reply. I am not sure if the bq solution will b able to solve
> the problem. Let me explain with an example -
>
> document 1 - (some text)
> IBM - 0.6
> Google - 0.1
> Apple - 0.4
>
> Now suppose I index the document based on the "company name" and
> "confidence scores" separately and search using the bq where the Numeric
> Field search is based on "anything below 0.5" and text = "IBM". Here, by
> mistake the document 1 will be chosen (as it has been stored with 0.6, 0.1
> and 0.4). But actually it should not be - as the "IBM" score is 0.6. So in
> gist - this problem needs some sort of linking between the company name and
> the scores.
>
> --d
>
>
>
> On Wed, Mar 21, 2012 at 10:41 AM, Ian Lea <ian.lea [at] gmail> wrote:
>
>> Why do you want to link name and confidence in one field?  Store
>> confidence as a NumericField and search something like
>>
>> BooleanQuery bq = new BooleanQuery();
>> Query nameq = parser.parse(...) or whatever
>> Query confq = NumericRangeQuery.newXxx(...);
>> bq.add(nameq, ...);
>> bq,add(confq, ...);
>>
>> and search using bq.
>>
>>
>> --
>> Ian.
>>
>>
>> On Wed, Mar 21, 2012 at 2:20 PM, Deb Lucene <deb.lucene [at] gmail> wrote:
>> > Hi Group,
>> >
>> > Sorry for cross posting!
>> >
>> > We need to index a document corpus (news articles) with some meta data
>> > features. The meta data are actually company names with some scoring (a
>> > double, between 0 to 1). For example, two documents can be -
>> >
>> > document 1
>> > (some text - say a technical article from NY times). It comes with the
>> > metadata like -
>> > IBM - 0.5
>> > Google - 0.9
>> > Apple - 0.3
>> >
>> > where 0.5, 0.9, 0.3 are some confidence scores for the company names.
>> >
>> > Similarly, the document 2 is about some IT article and then the meta data
>> > are like -
>> > IBM - 0.6
>> > Google - 0.1
>> > Apple - 0.4
>> >
>> > now we can index the documents based on the contents or the company names
>> > easily. But here the problem is we need to create a "field" where the
>> > company names and the scores are linked. So that we can search something
>> > like -
>> >
>> > query = where the "company name" (a field) is "IBM" and the scores of IBM
>> > is > 0.5.
>> > So in that case the document 2 will be retrieved.
>> >
>> > I am wondering if anyone has ideas about using the company names and
>> scores
>> > (linked) together as a field.
>> >
>> > Thanks in advance,
>> >
>> > --d
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.