Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Using separate index for each user

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


tobias.larsson.hult at findwise

Sep 16, 2008, 7:55 AM

Post #1 of 7 (576 views)
Permalink
Using separate index for each user

Hi,

We're thinking of using Lucene to integrate search in a backup service
application. The background is that we have a bunch of users using a
backup service, and we want them to be able to search their own, and
only their own, backups.

The total amount of data that's being backed up is very large (size in
terabyte). Even though the index will probably be smaller due to only
indexing relevant fields, it is still to much to incorporate in one
index. But since a user will only search in his/her own files we're
thinking of creating one index for each user. There will be a lot of
indexes of course but each index will not span to more than a couple
of gigabytes at the most.

So when a user searches or adds new content to the backup we will open
up his/her index and to a search/update in that particular index. That
way, each query/update should not be so performance intense.

Does this sound like a reasonable solution? Of course this means
creating a lot of IndexReaders/Writers but I prefer that to searching
in a huge index everytime when a user only wants to search in a slice
of the total index.

Best Regards,
Tobias Larsson Hult


erickerickson at gmail

Sep 16, 2008, 8:17 AM

Post #2 of 7 (545 views)
Permalink
Re: Using separate index for each user [In reply to]

The main arguments against using many separate indexes are
1> search warmup time. That is, each time you open an index
the first few queries take much longer than subsequent searches.
2> Managing a bazillion indexes is non-trivial.


That said, in your particular case these may not apply. I guess the
piece of information that really counts is "how often do you expect
to update/search a given index"? You could avoid the warmup issue
by keeping an index open for some period of time after the first
search on the assumption that the user is going to make multiple
searches rather than just one. I'm sure there are other tricks
you can try.

So, how often do you expect
1> users to backup date
2> users to query data?
and what is acceptable search response time? and are your
users willing to live with a significant delay on the first couple
of queries?

I'd only be comfortable with choosing an approach if I tried
it out with a single computer's content and generated a few
stats....

Best
Erick

On Tue, Sep 16, 2008 at 10:55 AM, Tobias Larsson Hult <
tobias.larsson.hult[at]findwise.se> wrote:

> Hi,
>
> We're thinking of using Lucene to integrate search in a backup service
> application. The background is that we have a bunch of users using a backup
> service, and we want them to be able to search their own, and only their
> own, backups.
>
> The total amount of data that's being backed up is very large (size in
> terabyte). Even though the index will probably be smaller due to only
> indexing relevant fields, it is still to much to incorporate in one index.
> But since a user will only search in his/her own files we're thinking of
> creating one index for each user. There will be a lot of indexes of course
> but each index will not span to more than a couple of gigabytes at the most.
>
> So when a user searches or adds new content to the backup we will open up
> his/her index and to a search/update in that particular index. That way,
> each query/update should not be so performance intense.
>
> Does this sound like a reasonable solution? Of course this means creating
> a lot of IndexReaders/Writers but I prefer that to searching in a huge index
> everytime when a user only wants to search in a slice of the total index.
>
> Best Regards,
> Tobias Larsson Hult
>
>


otis_gospodnetic at yahoo

Sep 16, 2008, 9:10 AM

Post #3 of 7 (541 views)
Permalink
Re: Using separate index for each user [In reply to]

Tobias,

That's the approach I took with Simpy.com and it's been working well for several years now. You'll have to keep track of searchers and close them when appropriate, of course.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Tobias Larsson Hult <tobias.larsson.hult[at]findwise.se>
> To: java-user[at]lucene.apache.org
> Sent: Tuesday, September 16, 2008 10:55:09 AM
> Subject: Using separate index for each user
>
> Hi,
>
> We're thinking of using Lucene to integrate search in a backup service
> application. The background is that we have a bunch of users using a
> backup service, and we want them to be able to search their own, and
> only their own, backups.
>
> The total amount of data that's being backed up is very large (size in
> terabyte). Even though the index will probably be smaller due to only
> indexing relevant fields, it is still to much to incorporate in one
> index. But since a user will only search in his/her own files we're
> thinking of creating one index for each user. There will be a lot of
> indexes of course but each index will not span to more than a couple
> of gigabytes at the most.
>
> So when a user searches or adds new content to the backup we will open
> up his/her index and to a search/update in that particular index. That
> way, each query/update should not be so performance intense.
>
> Does this sound like a reasonable solution? Of course this means
> creating a lot of IndexReaders/Writers but I prefer that to searching
> in a huge index everytime when a user only wants to search in a slice
> of the total index.
>
> Best Regards,
> Tobias Larsson Hult


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


tobias.larsson.hult at findwise

Sep 17, 2008, 11:32 PM

Post #4 of 7 (519 views)
Permalink
Re: Using separate index for each user [In reply to]

Thanks for the quick responses!

Good point about the warmup issues Erick, that's something we will
consider. Good to know that this kind of setup has been proved working
for at least one :) I think we will do a small setup and test the
performance.

Thanks again for valuable input!

Best Regards
Tobias


On 16 sep 2008, at 18.10, Otis Gospodnetic wrote:

> Tobias,
>
> That's the approach I took with Simpy.com and it's been working well
> for several years now. You'll have to keep track of searchers and
> close them when appropriate, of course.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>

On 16 sep 2008, at 17.17, Erick Erickson wrote:
>
> The main arguments against using many separate indexes are
> 1> search warmup time. That is, each time you open an index
> the first few queries take much longer than subsequent searches.
> 2> Managing a bazillion indexes is non-trivial.
>
>
> That said, in your particular case these may not apply. I guess the
> piece of information that really counts is "how often do you expect
> to update/search a given index"? You could avoid the warmup issue
> by keeping an index open for some period of time after the first
> search on the assumption that the user is going to make multiple
> searches rather than just one. I'm sure there are other tricks
> you can try.
>
> So, how often do you expect
> 1> users to backup date
> 2> users to query data?
> and what is acceptable search response time? and are your
> users willing to live with a significant delay on the first couple
> of queries?
>
> I'd only be comfortable with choosing an approach if I tried
> it out with a single computer's content and generated a few
> stats....
>
> Best
> Erick
>
> ----- Original Message ----
>> From: Tobias Larsson Hult <tobias.larsson.hult[at]findwise.se>
>> To: java-user[at]lucene.apache.org
>> Sent: Tuesday, September 16, 2008 10:55:09 AM
>> Subject: Using separate index for each user
>>
>> Hi,
>>
>> We're thinking of using Lucene to integrate search in a backup
>> service
>> application. The background is that we have a bunch of users using a
>> backup service, and we want them to be able to search their own, and
>> only their own, backups.
>>
>> The total amount of data that's being backed up is very large (size
>> in
>> terabyte). Even though the index will probably be smaller due to only
>> indexing relevant fields, it is still to much to incorporate in one
>> index. But since a user will only search in his/her own files we're
>> thinking of creating one index for each user. There will be a lot of
>> indexes of course but each index will not span to more than a couple
>> of gigabytes at the most.
>>
>> So when a user searches or adds new content to the backup we will
>> open
>> up his/her index and to a search/update in that particular index.
>> That
>> way, each query/update should not be so performance intense.
>>
>> Does this sound like a reasonable solution? Of course this means
>> creating a lot of IndexReaders/Writers but I prefer that to searching
>> in a huge index everytime when a user only wants to search in a slice
>> of the total index.
>>
>> Best Regards,
>> Tobias Larsson Hult
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


erickerickson at gmail

Sep 18, 2008, 6:34 AM

Post #5 of 7 (520 views)
Permalink
Re: Using separate index for each user [In reply to]

uuuhhhh, take anything Otis says as *much* more informed than anything I say
on this topic <G>.

Erick

On Thu, Sep 18, 2008 at 2:32 AM, Tobias Larsson Hult <
tobias.larsson.hult[at]findwise.se> wrote:

> Thanks for the quick responses!
>
> Good point about the warmup issues Erick, that's something we will
> consider. Good to know that this kind of setup has been proved working for
> at least one :) I think we will do a small setup and test the performance.
>
> Thanks again for valuable input!
>
> Best Regards
> Tobias
>
>
> On 16 sep 2008, at 18.10, Otis Gospodnetic wrote:
>
> Tobias,
>>
>> That's the approach I took with Simpy.com and it's been working well for
>> several years now. You'll have to keep track of searchers and close them
>> when appropriate, of course.
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
> On 16 sep 2008, at 17.17, Erick Erickson wrote:
>
>>
>> The main arguments against using many separate indexes are
>> 1> search warmup time. That is, each time you open an index
>> the first few queries take much longer than subsequent searches.
>> 2> Managing a bazillion indexes is non-trivial.
>>
>>
>> That said, in your particular case these may not apply. I guess the
>> piece of information that really counts is "how often do you expect
>> to update/search a given index"? You could avoid the warmup issue
>> by keeping an index open for some period of time after the first
>> search on the assumption that the user is going to make multiple
>> searches rather than just one. I'm sure there are other tricks
>> you can try.
>>
>> So, how often do you expect
>> 1> users to backup date
>> 2> users to query data?
>> and what is acceptable search response time? and are your
>> users willing to live with a significant delay on the first couple
>> of queries?
>>
>> I'd only be comfortable with choosing an approach if I tried
>> it out with a single computer's content and generated a few
>> stats....
>>
>> Best
>> Erick
>>
>> ----- Original Message ----
>>
>>> From: Tobias Larsson Hult <tobias.larsson.hult[at]findwise.se>
>>> To: java-user[at]lucene.apache.org
>>> Sent: Tuesday, September 16, 2008 10:55:09 AM
>>> Subject: Using separate index for each user
>>>
>>> Hi,
>>>
>>> We're thinking of using Lucene to integrate search in a backup service
>>> application. The background is that we have a bunch of users using a
>>> backup service, and we want them to be able to search their own, and
>>> only their own, backups.
>>>
>>> The total amount of data that's being backed up is very large (size in
>>> terabyte). Even though the index will probably be smaller due to only
>>> indexing relevant fields, it is still to much to incorporate in one
>>> index. But since a user will only search in his/her own files we're
>>> thinking of creating one index for each user. There will be a lot of
>>> indexes of course but each index will not span to more than a couple
>>> of gigabytes at the most.
>>>
>>> So when a user searches or adds new content to the backup we will open
>>> up his/her index and to a search/update in that particular index. That
>>> way, each query/update should not be so performance intense.
>>>
>>> Does this sound like a reasonable solution? Of course this means
>>> creating a lot of IndexReaders/Writers but I prefer that to searching
>>> in a huge index everytime when a user only wants to search in a slice
>>> of the total index.
>>>
>>> Best Regards,
>>> Tobias Larsson Hult
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


alexander.aristov at gmail

Sep 19, 2008, 4:43 AM

Post #6 of 7 (509 views)
Permalink
Re: Using separate index for each user [In reply to]

IF you create a field in the index which would hold username then you can
create search queries to reject entries which don;t belong to the user?

it's much efficient

Alexander


2008/9/16 Tobias Larsson Hult <tobias.larsson.hult[at]findwise.se>

> Hi,
>
> We're thinking of using Lucene to integrate search in a backup service
> application. The background is that we have a bunch of users using a backup
> service, and we want them to be able to search their own, and only their
> own, backups.
>
> The total amount of data that's being backed up is very large (size in
> terabyte). Even though the index will probably be smaller due to only
> indexing relevant fields, it is still to much to incorporate in one index.
> But since a user will only search in his/her own files we're thinking of
> creating one index for each user. There will be a lot of indexes of course
> but each index will not span to more than a couple of gigabytes at the most.
>
> So when a user searches or adds new content to the backup we will open up
> his/her index and to a search/update in that particular index. That way,
> each query/update should not be so performance intense.
>
> Does this sound like a reasonable solution? Of course this means creating
> a lot of IndexReaders/Writers but I prefer that to searching in a huge index
> everytime when a user only wants to search in a slice of the total index.
>
> Best Regards,
> Tobias Larsson Hult
>
>


--
Best Regards
Alexander Aristov


jimi.hullegard at mogul

Sep 19, 2008, 5:14 AM

Post #7 of 7 (513 views)
Permalink
RE: Using separate index for each user [In reply to]

Well, if the total size of the relevant data is to big to fit in one single index, then simply adding the username as a field would not solve the original problem.

But what if you combine the two alternaties? Lets say that you can find a way to divide the users in a evenly distributed way based on their user names, for example every user with a username that starts with "A" gets one index, the ones that start with "B" gets another index, etc etc. Or even those that start with any of the letters "A" to "D" gets one index, etc. That way you can make each index size be of reasonable size, and still don't end up with thousands of separate indexes. And since you know which index to search in based on the username, you don't need any kind of distributed search.
And then you should ofcourse add the username as a field, so that each search is filtered to only that users data.

/Jimi

mogul | jimi hullegård | system developer | hudiksvallsgatan 4, 113 30 stockholm sweden | +46 8 506 66 172 | +46 765 27 19 55 | jimi.hullegard[at]mogul.com | www.mogul.com


> -----Original Message-----
> From: Alexander Aristov [mailto:alexander.aristov[at]gmail.com]
> Sent: den 19 september 2008 13:43
> To: java-user[at]lucene.apache.org
> Subject: Re: Using separate index for each user
>
> IF you create a field in the index which would hold username
> then you can
> create search queries to reject entries which don;t belong to
> the user?
>
> it's much efficient
>
> Alexander
>
>
> 2008/9/16 Tobias Larsson Hult <tobias.larsson.hult[at]findwise.se>
>
> > Hi,
> >
> > We're thinking of using Lucene to integrate search in a
> backup service
> > application. The background is that we have a bunch of
> users using a backup
> > service, and we want them to be able to search their own,
> and only their
> > own, backups.
> >
> > The total amount of data that's being backed up is very
> large (size in
> > terabyte). Even though the index will probably be smaller
> due to only
> > indexing relevant fields, it is still to much to
> incorporate in one index.
> > But since a user will only search in his/her own files
> we're thinking of
> > creating one index for each user. There will be a lot of
> indexes of course
> > but each index will not span to more than a couple of
> gigabytes at the most.
> >
> > So when a user searches or adds new content to the backup
> we will open up
> > his/her index and to a search/update in that particular
> index. That way,
> > each query/update should not be so performance intense.
> >
> > Does this sound like a reasonable solution? Of course this
> means creating
> > a lot of IndexReaders/Writers but I prefer that to
> searching in a huge index
> > everytime when a user only wants to search in a slice of
> the total index.
> >
> > Best Regards,
> > Tobias Larsson Hult
> >
> >
>
>
> --
> Best Regards
> Alexander Aristov
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.