Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

The best strategy to "How store multiple fields of same document"

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


ksmmlist at gmail

Jul 31, 2008, 7:36 AM

Post #1 of 5 (287 views)
Permalink
The best strategy to "How store multiple fields of same document"

The best strategy.

Hello.
I want to ask you opinion about to "How
store multiple fields of same document".

I see now two possibility's.
1. Multiple fields in document
2. One filed: for example named PROPERTIES, with multiple instances.
And values combined with name for example “name[at]value”

What choice the best for search speed and resource usage?

Thanks.

Sergey Kabashnyuk
eXo Platform SAS

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


erickerickson at gmail

Jul 31, 2008, 7:50 AM

Post #2 of 5 (270 views)
Permalink
Re: The best strategy to "How store multiple fields of same document" [In reply to]

I'd go with option 1 unless and until you could demonstrate performance
problems. Speaking of which, you'd get a more informed answer if you
provided a bit more data, like how many fields are we talking, how many
documents, etc. If you're indexing 10,000 documents, go with the simplest.
If you're indexing 1,000,000,000 documents, more thought is required <G>..
Do you expect 3 fields/doc or 30,000 fields/doc?

But the reason I'd go with <1> is that your second option has some issues.
1> how to tokenize? You'll probably have to write a custom one or risk
getting tokens "name" "value" rather than "name[at]value".
2> Forming queries is, I believe, equally complex in both cases, so
choose the conceptually simplest one. Let's say you have
to search on foo1:val1 and foo2:val2. In the first case this is
simple +foo1:val1 +foo2:val2. For your second case, you get
+bigfield:foo1[at]val1 + bigfield:foo2[at]val2. There's not much
difference between the two.
3> Back to my initial comment about resource usage: we don't
have enough data to answer whether it makes any difference.
But even if we did, you'd find the response a variation of
"you'll have to try it and see" since there are so many
variables.

But I'll repeat that I always go with the simplest approach unless and
until I'm certain there's a problem...

Best
Erick

On Thu, Jul 31, 2008 at 10:36 AM, Sergey Kabashnyuk <ksmmlist[at]gmail.com>wrote:

> The best strategy.
>
> Hello.
> I want to ask you opinion about to "How
> store multiple fields of same document".
>
> I see now two possibility's.
> 1. Multiple fields in document
> 2. One filed: for example named PROPERTIES, with multiple instances.
> And values combined with name for example "name[at]value"
>
> What choice the best for search speed and resource usage?
>
> Thanks.
>
> Sergey Kabashnyuk
> eXo Platform SAS
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


ksmmlist at gmail

Jul 31, 2008, 8:29 AM

Post #3 of 5 (267 views)
Permalink
Re: The best strategy to "How store multiple fields of same document" [In reply to]

Thank you Erick.

I'm talking about more then 10,000 documents and 95% less then 10 fields.
Maximum number of fields per document is unlimited.
But in practice it's no more the 20.


I'm interesting: does Lucene have any internal optimization,
which depend of the fields count or fields size, as database do?
I mean to determinate position of row X in index:

positionX = sum(fieldsize[1]+...fieldsize[i])*(X-1)

Sergey Kabashnyuk
eXo Platform SAS


> I'd go with option 1 unless and until you could demonstrate performance
> problems. Speaking of which, you'd get a more informed answer if you
> provided a bit more data, like how many fields are we talking, how many
> documents, etc. If you're indexing 10,000 documents, go with the
> simplest.
> If you're indexing 1,000,000,000 documents, more thought is required
> <G>..
> Do you expect 3 fields/doc or 30,000 fields/doc?
>
> But the reason I'd go with <1> is that your second option has some
> issues.
> 1> how to tokenize? You'll probably have to write a custom one or risk
> getting tokens "name" "value" rather than "name[at]value".
> 2> Forming queries is, I believe, equally complex in both cases, so
> choose the conceptually simplest one. Let's say you have
> to search on foo1:val1 and foo2:val2. In the first case this is
> simple +foo1:val1 +foo2:val2. For your second case, you get
> +bigfield:foo1[at]val1 + bigfield:foo2[at]val2. There's not much
> difference between the two.
> 3> Back to my initial comment about resource usage: we don't
> have enough data to answer whether it makes any difference.
> But even if we did, you'd find the response a variation of
> "you'll have to try it and see" since there are so many
> variables.
>
> But I'll repeat that I always go with the simplest approach unless and
> until I'm certain there's a problem...
>
> Best
> Erick
>
> On Thu, Jul 31, 2008 at 10:36 AM, Sergey Kabashnyuk
> <ksmmlist[at]gmail.com>wrote:
>
>> The best strategy.
>>
>> Hello.
>> I want to ask you opinion about to "How
>> store multiple fields of same document".
>>
>> I see now two possibility's.
>> 1. Multiple fields in document
>> 2. One filed: for example named PROPERTIES, with multiple instances.
>> And values combined with name for example "name[at]value"
>>
>> What choice the best for search speed and resource usage?
>>
>> Thanks.
>>
>> Sergey Kabashnyuk
>> eXo Platform SAS
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


erickerickson at gmail

Jul 31, 2008, 12:43 PM

Post #4 of 5 (264 views)
Permalink
Re: The best strategy to "How store multiple fields of same document" [In reply to]

Haven't a clue <G>.

Erick

On Thu, Jul 31, 2008 at 11:29 AM, Sergey Kabashnyuk <ksmmlist[at]gmail.com>wrote:

> Thank you Erick.
>
> I'm talking about more then 10,000 documents and 95% less then 10 fields.
> Maximum number of fields per document is unlimited.
> But in practice it's no more the 20.
>
>
> I'm interesting: does Lucene have any internal optimization,
> which depend of the fields count or fields size, as database do?
> I mean to determinate position of row X in index:
>
> positionX = sum(fieldsize[1]+...fieldsize[i])*(X-1)
>
>
> Sergey Kabashnyuk
> eXo Platform SAS
>
>
> I'd go with option 1 unless and until you could demonstrate performance
>> problems. Speaking of which, you'd get a more informed answer if you
>> provided a bit more data, like how many fields are we talking, how many
>> documents, etc. If you're indexing 10,000 documents, go with the simplest.
>> If you're indexing 1,000,000,000 documents, more thought is required <G>..
>> Do you expect 3 fields/doc or 30,000 fields/doc?
>>
>> But the reason I'd go with <1> is that your second option has some issues.
>> 1> how to tokenize? You'll probably have to write a custom one or risk
>> getting tokens "name" "value" rather than "name[at]value".
>> 2> Forming queries is, I believe, equally complex in both cases, so
>> choose the conceptually simplest one. Let's say you have
>> to search on foo1:val1 and foo2:val2. In the first case this is
>> simple +foo1:val1 +foo2:val2. For your second case, you get
>> +bigfield:foo1[at]val1 + bigfield:foo2[at]val2. There's not much
>> difference between the two.
>> 3> Back to my initial comment about resource usage: we don't
>> have enough data to answer whether it makes any difference.
>> But even if we did, you'd find the response a variation of
>> "you'll have to try it and see" since there are so many
>> variables.
>>
>> But I'll repeat that I always go with the simplest approach unless and
>> until I'm certain there's a problem...
>>
>> Best
>> Erick
>>
>> On Thu, Jul 31, 2008 at 10:36 AM, Sergey Kabashnyuk <ksmmlist[at]gmail.com
>> >wrote:
>>
>> The best strategy.
>>>
>>> Hello.
>>> I want to ask you opinion about to "How
>>> store multiple fields of same document".
>>>
>>> I see now two possibility's.
>>> 1. Multiple fields in document
>>> 2. One filed: for example named PROPERTIES, with multiple instances.
>>> And values combined with name for example "name[at]value"
>>>
>>> What choice the best for search speed and resource usage?
>>>
>>> Thanks.
>>>
>>> Sergey Kabashnyuk
>>> eXo Platform SAS
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


anshumg at gmail

Aug 1, 2008, 11:51 AM

Post #5 of 5 (249 views)
Permalink
Re: The best strategy to "How store multiple fields of same document" [In reply to]

Hey Sergey,
With that kind of a dimension I guess you could work with multiple fields. I
have tried it over a score of fields for over 10 million documents. Works
fine if implemented neatly.
Is there more that you would be doing other than vanilla search?
--
Anshum Gupta
Naukri Labs!

On Thu, Jul 31, 2008 at 8:59 PM, Sergey Kabashnyuk <ksmmlist[at]gmail.com>wrote:

> Thank you Erick.
>
> I'm talking about more then 10,000 documents and 95% less then 10 fields.
> Maximum number of fields per document is unlimited.
> But in practice it's no more the 20.
>
>
> I'm interesting: does Lucene have any internal optimization,
> which depend of the fields count or fields size, as database do?
> I mean to determinate position of row X in index:
>
> positionX = sum(fieldsize[1]+...fieldsize[i])*(X-1)
>
>
> Sergey Kabashnyuk
> eXo Platform SAS
>
>
> I'd go with option 1 unless and until you could demonstrate performance
>> problems. Speaking of which, you'd get a more informed answer if you
>> provided a bit more data, like how many fields are we talking, how many
>> documents, etc. If you're indexing 10,000 documents, go with the simplest.
>> If you're indexing 1,000,000,000 documents, more thought is required <G>..
>> Do you expect 3 fields/doc or 30,000 fields/doc?
>>
>> But the reason I'd go with <1> is that your second option has some issues.
>> 1> how to tokenize? You'll probably have to write a custom one or risk
>> getting tokens "name" "value" rather than "name[at]value".
>> 2> Forming queries is, I believe, equally complex in both cases, so
>> choose the conceptually simplest one. Let's say you have
>> to search on foo1:val1 and foo2:val2. In the first case this is
>> simple +foo1:val1 +foo2:val2. For your second case, you get
>> +bigfield:foo1[at]val1 + bigfield:foo2[at]val2. There's not much
>> difference between the two.
>> 3> Back to my initial comment about resource usage: we don't
>> have enough data to answer whether it makes any difference.
>> But even if we did, you'd find the response a variation of
>> "you'll have to try it and see" since there are so many
>> variables.
>>
>> But I'll repeat that I always go with the simplest approach unless and
>> until I'm certain there's a problem...
>>
>> Best
>> Erick
>>
>> On Thu, Jul 31, 2008 at 10:36 AM, Sergey Kabashnyuk <ksmmlist[at]gmail.com
>> >wrote:
>>
>> The best strategy.
>>>
>>> Hello.
>>> I want to ask you opinion about to "How
>>> store multiple fields of same document".
>>>
>>> I see now two possibility's.
>>> 1. Multiple fields in document
>>> 2. One filed: for example named PROPERTIES, with multiple instances.
>>> And values combined with name for example "name[at]value"
>>>
>>> What choice the best for search speed and resource usage?
>>>
>>> Thanks.
>>>
>>> Sergey Kabashnyuk
>>> eXo Platform SAS
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


--
--
The facts expressed here belong to everybody, the opinions to me.
The distinction is yours to draw............

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.