Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

faceted search performance

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


christoph.boosz at googlemail

Oct 12, 2009, 5:53 AM

Post #1 of 11 (960 views)
Permalink
faceted search performance

Hi,

I have a question related to faceted search. My index contains more than 1
million documents, and nearly 1 million terms. My aim is to get a DocIdSet
for each term occurring in the result of a query. I use the approach
described on
http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.html<https://service.gmx.net/de/cgi/derefer?TYPE=3&DEST=http%3A%2F%2Fsujitpal.blogspot.com%2F2007%2F04%2Flucene-search-within-search-with.html>,
where a BitSet is built out of a QueryFilter for each term and intersected
with the BitSet representing the user query.
However, performance could be better. I guess it’s because the term filter
considers each document in the index, even if it’s not in the result. My
attempt to use a ChainedFilter, where the first filter (cached) is for the
user query, and the second one for the term (done for all terms), didn’t
speed things up, though.
Am I missing something? Is there a better way to get the DocIdSets for a
huge number of terms in a limited set of documents?

Thanks in advance!
Chris


paul.elschot at xs4all

Oct 12, 2009, 8:45 AM

Post #2 of 11 (897 views)
Permalink
Re: faceted search performance [In reply to]

On Monday 12 October 2009 14:53:45 Christoph Boosz wrote:
> Hi,
>
> I have a question related to faceted search. My index contains more than 1
> million documents, and nearly 1 million terms. My aim is to get a DocIdSet
> for each term occurring in the result of a query. I use the approach
> described on
> http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.html<https://service.gmx.net/de/cgi/derefer?TYPE=3&DEST=http%3A%2F%2Fsujitpal.blogspot.com%2F2007%2F04%2Flucene-search-within-search-with.html>,
> where a BitSet is built out of a QueryFilter for each term and intersected
> with the BitSet representing the user query.
> However, performance could be better. I guess it’s because the term filter
> considers each document in the index, even if it’s not in the result. My
> attempt to use a ChainedFilter, where the first filter (cached) is for the
> user query, and the second one for the term (done for all terms), didn’t
> speed things up, though.
> Am I missing something? Is there a better way to get the DocIdSets for a
> huge number of terms in a limited set of documents?

Assuming you only need the number of documents within the original query
that contain each term, one thing that can be saved is the allocation of the
resulting BitSet for each term. To do this, use the cached BitSet (or the
OpenBitSet in current lucene) for the original Query as a filter for a TermQuery
per term, and then count the matching documents by using a counting
HitCollector on the IndexSearcher.

Regards,
Paul Elschot


john.wang at gmail

Oct 12, 2009, 9:32 AM

Post #3 of 11 (906 views)
Permalink
Re: faceted search performance [In reply to]

Given you have 1M docs and about 1M terms, do you see very few docs per
term?
If your DocSet per term is very sparse, BitSet is probably not a good
representation. Simple int array maybe better for memory, and faster for
iterating.

-John

On Mon, Oct 12, 2009 at 8:45 AM, Paul Elschot <paul.elschot [at] xs4all>wrote:

> On Monday 12 October 2009 14:53:45 Christoph Boosz wrote:
> > Hi,
> >
> > I have a question related to faceted search. My index contains more than
> 1
> > million documents, and nearly 1 million terms. My aim is to get a
> DocIdSet
> > for each term occurring in the result of a query. I use the approach
> > described on
> >
> http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.html
> <
> https://service.gmx.net/de/cgi/derefer?TYPE=3&DEST=http%3A%2F%2Fsujitpal.blogspot.com%2F2007%2F04%2Flucene-search-within-search-with.html
> >,
> > where a BitSet is built out of a QueryFilter for each term and
> intersected
> > with the BitSet representing the user query.
> > However, performance could be better. I guess it’s because the term
> filter
> > considers each document in the index, even if it’s not in the result. My
> > attempt to use a ChainedFilter, where the first filter (cached) is for
> the
> > user query, and the second one for the term (done for all terms), didn’t
> > speed things up, though.
> > Am I missing something? Is there a better way to get the DocIdSets for a
> > huge number of terms in a limited set of documents?
>
> Assuming you only need the number of documents within the original query
> that contain each term, one thing that can be saved is the allocation of
> the
> resulting BitSet for each term. To do this, use the cached BitSet (or the
> OpenBitSet in current lucene) for the original Query as a filter for a
> TermQuery
> per term, and then count the matching documents by using a counting
> HitCollector on the IndexSearcher.
>
> Regards,
> Paul Elschot
>


christoph.boosz at googlemail

Oct 12, 2009, 10:30 AM

Post #4 of 11 (905 views)
Permalink
Re: faceted search performance [In reply to]

Thanks for your reply.
Yes, it's likely that many terms occur in few documents.

If I understand you right, I should do the following:
-Write a HitCollector that simply increments a counter
-Get the filter for the user query once: new CachingWrapperFilter(new
QueryWrapperFilter(userQuery));
-Create a TermQuery for each term
-Perform the search and read the counter of the HitCollector

I did that, but it didn't get faster. Any ideas why?

Regards,
Chris

2009/10/12 John Wang <john.wang [at] gmail>

> Given you have 1M docs and about 1M terms, do you see very few docs per
> term?
> If your DocSet per term is very sparse, BitSet is probably not a good
> representation. Simple int array maybe better for memory, and faster for
> iterating.
>
> -John
>
> On Mon, Oct 12, 2009 at 8:45 AM, Paul Elschot <paul.elschot [at] xs4all
> >wrote:
>
> > On Monday 12 October 2009 14:53:45 Christoph Boosz wrote:
> > > Hi,
> > >
> > > I have a question related to faceted search. My index contains more
> than
> > 1
> > > million documents, and nearly 1 million terms. My aim is to get a
> > DocIdSet
> > > for each term occurring in the result of a query. I use the approach
> > > described on
> > >
> >
> http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.html
> > <
> >
> https://service.gmx.net/de/cgi/derefer?TYPE=3&DEST=http%3A%2F%2Fsujitpal.blogspot.com%2F2007%2F04%2Flucene-search-within-search-with.html
> > >,
> > > where a BitSet is built out of a QueryFilter for each term and
> > intersected
> > > with the BitSet representing the user query.
> > > However, performance could be better. I guess it’s because the term
> > filter
> > > considers each document in the index, even if it’s not in the result.
> My
> > > attempt to use a ChainedFilter, where the first filter (cached) is for
> > the
> > > user query, and the second one for the term (done for all terms),
> didn’t
> > > speed things up, though.
> > > Am I missing something? Is there a better way to get the DocIdSets for
> a
> > > huge number of terms in a limited set of documents?
> >
> > Assuming you only need the number of documents within the original query
> > that contain each term, one thing that can be saved is the allocation of
> > the
> > resulting BitSet for each term. To do this, use the cached BitSet (or the
> > OpenBitSet in current lucene) for the original Query as a filter for a
> > TermQuery
> > per term, and then count the matching documents by using a counting
> > HitCollector on the IndexSearcher.
> >
> > Regards,
> > Paul Elschot
> >
>


jake.mannix at gmail

Oct 12, 2009, 11:02 AM

Post #5 of 11 (897 views)
Permalink
Re: faceted search performance [In reply to]

Hey Chris,

On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz <
christoph.boosz [at] googlemail> wrote:

> Thanks for your reply.
> Yes, it's likely that many terms occur in few documents.
>
> If I understand you right, I should do the following:
> -Write a HitCollector that simply increments a counter
> -Get the filter for the user query once: new CachingWrapperFilter(new
> QueryWrapperFilter(userQuery));
> -Create a TermQuery for each term
> -Perform the search and read the counter of the HitCollector
>
> I did that, but it didn't get faster. Any ideas why?
>

This killer is the "TermQuery for each term" part - this is huge. You need
to invert this process,
and use your query as is, but while walking in the HitCollector, on each doc
which matches
your query, increment counters for each of the terms in that document (which
means you need
an in-memory forward lookup for your documents, like a multivalued
FieldCache - and if you've
got roughly the same number of terms as documents, this cache is likely to
be as large as
your entire index - a pretty hefty RAM cost).

But a good thing to keep in mind is that doing this kind of faceting
(massively multivalued
on a huge term-set) requires a lot of computation, even if you have all the
proper structures
living in memory:

For each document you look at (which matches your query), you need to look
at all
of the terms in that document, and increment a counter for that term. So
however much
time it would normally take for you to do the driving query, it can take as
much as that
multiplied by the average number of terms in a document in your index. If
your documents
are big, this could be a pretty huge latency penalty.

-jake


christoph.boosz at googlemail

Oct 12, 2009, 12:30 PM

Post #6 of 11 (908 views)
Permalink
Re: faceted search performance [In reply to]

Hi Jake,

Thanks for your helpful explanation.
In fact, my initial solution was to traverse each document in the result
once and count the contained terms. As you mentioned, this process took a
lot of memory.
Trying to confine the memory usage with the facet approach, I was surprised
by the decline in performance.
Now I know it's nothing abnormal, at least.

Chris


2009/10/12 Jake Mannix <jake.mannix [at] gmail>

> Hey Chris,
>
> On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz <
> christoph.boosz [at] googlemail> wrote:
>
> > Thanks for your reply.
> > Yes, it's likely that many terms occur in few documents.
> >
> > If I understand you right, I should do the following:
> > -Write a HitCollector that simply increments a counter
> > -Get the filter for the user query once: new CachingWrapperFilter(new
> > QueryWrapperFilter(userQuery));
> > -Create a TermQuery for each term
> > -Perform the search and read the counter of the HitCollector
> >
> > I did that, but it didn't get faster. Any ideas why?
> >
>
> This killer is the "TermQuery for each term" part - this is huge. You need
> to invert this process,
> and use your query as is, but while walking in the HitCollector, on each
> doc
> which matches
> your query, increment counters for each of the terms in that document
> (which
> means you need
> an in-memory forward lookup for your documents, like a multivalued
> FieldCache - and if you've
> got roughly the same number of terms as documents, this cache is likely to
> be as large as
> your entire index - a pretty hefty RAM cost).
>
> But a good thing to keep in mind is that doing this kind of faceting
> (massively multivalued
> on a huge term-set) requires a lot of computation, even if you have all the
> proper structures
> living in memory:
>
> For each document you look at (which matches your query), you need to look
> at all
> of the terms in that document, and increment a counter for that term. So
> however much
> time it would normally take for you to do the driving query, it can take as
> much as that
> multiplied by the average number of terms in a document in your index. If
> your documents
> are big, this could be a pretty huge latency penalty.
>
> -jake
>


paul.elschot at xs4all

Oct 12, 2009, 1:54 PM

Post #7 of 11 (897 views)
Permalink
Re: faceted search performance [In reply to]

Chris,

You could also store term vectors for all docs at indexing
time, and add the termvectors for the matching docs into a
(large) map of terms in RAM.

Regards,
Paul Elschot


On Monday 12 October 2009 21:30:48 Christoph Boosz wrote:
> Hi Jake,
>
> Thanks for your helpful explanation.
> In fact, my initial solution was to traverse each document in the result
> once and count the contained terms. As you mentioned, this process took a
> lot of memory.
> Trying to confine the memory usage with the facet approach, I was surprised
> by the decline in performance.
> Now I know it's nothing abnormal, at least.
>
> Chris
>
>
> 2009/10/12 Jake Mannix <jake.mannix [at] gmail>
>
> > Hey Chris,
> >
> > On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz <
> > christoph.boosz [at] googlemail> wrote:
> >
> > > Thanks for your reply.
> > > Yes, it's likely that many terms occur in few documents.
> > >
> > > If I understand you right, I should do the following:
> > > -Write a HitCollector that simply increments a counter
> > > -Get the filter for the user query once: new CachingWrapperFilter(new
> > > QueryWrapperFilter(userQuery));
> > > -Create a TermQuery for each term
> > > -Perform the search and read the counter of the HitCollector
> > >
> > > I did that, but it didn't get faster. Any ideas why?
> > >
> >
> > This killer is the "TermQuery for each term" part - this is huge. You need
> > to invert this process,
> > and use your query as is, but while walking in the HitCollector, on each
> > doc
> > which matches
> > your query, increment counters for each of the terms in that document
> > (which
> > means you need
> > an in-memory forward lookup for your documents, like a multivalued
> > FieldCache - and if you've
> > got roughly the same number of terms as documents, this cache is likely to
> > be as large as
> > your entire index - a pretty hefty RAM cost).
> >
> > But a good thing to keep in mind is that doing this kind of faceting
> > (massively multivalued
> > on a huge term-set) requires a lot of computation, even if you have all the
> > proper structures
> > living in memory:
> >
> > For each document you look at (which matches your query), you need to look
> > at all
> > of the terms in that document, and increment a counter for that term. So
> > however much
> > time it would normally take for you to do the driving query, it can take as
> > much as that
> > multiplied by the average number of terms in a document in your index. If
> > your documents
> > are big, this could be a pretty huge latency penalty.
> >
> > -jake
> >
>


christoph.boosz at googlemail

Oct 12, 2009, 2:29 PM

Post #8 of 11 (904 views)
Permalink
Re: faceted search performance [In reply to]

Hi Paul,

Thanks for your suggestion. I will test it within the next few days.
However, due to memory limitations, it will only work if the number of hits
is small enough, am I right?

Chris

2009/10/12 Paul Elschot <paul.elschot [at] xs4all>

> Chris,
>
> You could also store term vectors for all docs at indexing
> time, and add the termvectors for the matching docs into a
> (large) map of terms in RAM.
>
> Regards,
> Paul Elschot
>
>
> On Monday 12 October 2009 21:30:48 Christoph Boosz wrote:
> > Hi Jake,
> >
> > Thanks for your helpful explanation.
> > In fact, my initial solution was to traverse each document in the result
> > once and count the contained terms. As you mentioned, this process took a
> > lot of memory.
> > Trying to confine the memory usage with the facet approach, I was
> surprised
> > by the decline in performance.
> > Now I know it's nothing abnormal, at least.
> >
> > Chris
> >
> >
> > 2009/10/12 Jake Mannix <jake.mannix [at] gmail>
> >
> > > Hey Chris,
> > >
> > > On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz <
> > > christoph.boosz [at] googlemail> wrote:
> > >
> > > > Thanks for your reply.
> > > > Yes, it's likely that many terms occur in few documents.
> > > >
> > > > If I understand you right, I should do the following:
> > > > -Write a HitCollector that simply increments a counter
> > > > -Get the filter for the user query once: new CachingWrapperFilter(new
> > > > QueryWrapperFilter(userQuery));
> > > > -Create a TermQuery for each term
> > > > -Perform the search and read the counter of the HitCollector
> > > >
> > > > I did that, but it didn't get faster. Any ideas why?
> > > >
> > >
> > > This killer is the "TermQuery for each term" part - this is huge. You
> need
> > > to invert this process,
> > > and use your query as is, but while walking in the HitCollector, on
> each
> > > doc
> > > which matches
> > > your query, increment counters for each of the terms in that document
> > > (which
> > > means you need
> > > an in-memory forward lookup for your documents, like a multivalued
> > > FieldCache - and if you've
> > > got roughly the same number of terms as documents, this cache is likely
> to
> > > be as large as
> > > your entire index - a pretty hefty RAM cost).
> > >
> > > But a good thing to keep in mind is that doing this kind of faceting
> > > (massively multivalued
> > > on a huge term-set) requires a lot of computation, even if you have all
> the
> > > proper structures
> > > living in memory:
> > >
> > > For each document you look at (which matches your query), you need to
> look
> > > at all
> > > of the terms in that document, and increment a counter for that term.
> So
> > > however much
> > > time it would normally take for you to do the driving query, it can
> take as
> > > much as that
> > > multiplied by the average number of terms in a document in your index.
> If
> > > your documents
> > > are big, this could be a pretty huge latency penalty.
> > >
> > > -jake
> > >
> >
>
>


paul.elschot at xs4all

Oct 13, 2009, 12:29 AM

Post #9 of 11 (884 views)
Permalink
Re: faceted search performance [In reply to]

On Monday 12 October 2009 23:29:07 Christoph Boosz wrote:
> Hi Paul,
>
> Thanks for your suggestion. I will test it within the next few days.
> However, due to memory limitations, it will only work if the number of hits
> is small enough, am I right?

One can load a single term vector at a time, so in this case the memory
limitation is only in the possibly large map of doc counters per term.
For best performance try and load the term vectors in docId order,
after the original query has completed.

In any case it would be good to somehow limit the number of
documents considered, for example by using the ones with the best
query score.

Limiting the number of terms would also be good, but that less easy.

Regards,
Paul Elschot

>
> Chris
>
> 2009/10/12 Paul Elschot <paul.elschot [at] xs4all>
>
> > Chris,
> >
> > You could also store term vectors for all docs at indexing
> > time, and add the termvectors for the matching docs into a
> > (large) map of terms in RAM.
> >
> > Regards,
> > Paul Elschot
> >
> >
> > On Monday 12 October 2009 21:30:48 Christoph Boosz wrote:
> > > Hi Jake,
> > >
> > > Thanks for your helpful explanation.
> > > In fact, my initial solution was to traverse each document in the result
> > > once and count the contained terms. As you mentioned, this process took a
> > > lot of memory.
> > > Trying to confine the memory usage with the facet approach, I was
> > surprised
> > > by the decline in performance.
> > > Now I know it's nothing abnormal, at least.
> > >
> > > Chris
> > >
> > >
> > > 2009/10/12 Jake Mannix <jake.mannix [at] gmail>
> > >
> > > > Hey Chris,
> > > >
> > > > On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz <
> > > > christoph.boosz [at] googlemail> wrote:
> > > >
> > > > > Thanks for your reply.
> > > > > Yes, it's likely that many terms occur in few documents.
> > > > >
> > > > > If I understand you right, I should do the following:
> > > > > -Write a HitCollector that simply increments a counter
> > > > > -Get the filter for the user query once: new CachingWrapperFilter(new
> > > > > QueryWrapperFilter(userQuery));
> > > > > -Create a TermQuery for each term
> > > > > -Perform the search and read the counter of the HitCollector
> > > > >
> > > > > I did that, but it didn't get faster. Any ideas why?
> > > > >
> > > >
> > > > This killer is the "TermQuery for each term" part - this is huge. You
> > need
> > > > to invert this process,
> > > > and use your query as is, but while walking in the HitCollector, on
> > each
> > > > doc
> > > > which matches
> > > > your query, increment counters for each of the terms in that document
> > > > (which
> > > > means you need
> > > > an in-memory forward lookup for your documents, like a multivalued
> > > > FieldCache - and if you've
> > > > got roughly the same number of terms as documents, this cache is likely
> > to
> > > > be as large as
> > > > your entire index - a pretty hefty RAM cost).
> > > >
> > > > But a good thing to keep in mind is that doing this kind of faceting
> > > > (massively multivalued
> > > > on a huge term-set) requires a lot of computation, even if you have all
> > the
> > > > proper structures
> > > > living in memory:
> > > >
> > > > For each document you look at (which matches your query), you need to
> > look
> > > > at all
> > > > of the terms in that document, and increment a counter for that term.
> > So
> > > > however much
> > > > time it would normally take for you to do the driving query, it can
> > take as
> > > > much as that
> > > > multiplied by the average number of terms in a document in your index.
> > If
> > > > your documents
> > > > are big, this could be a pretty huge latency penalty.
> > > >
> > > > -jake
> > > >
> > >
> >
> >
>


christoph.boosz at googlemail

Oct 13, 2009, 2:19 AM

Post #10 of 11 (884 views)
Permalink
Re: faceted search performance [In reply to]

Ok, I will have a shot at the ascending docId order.

Chris

2009/10/13 Paul Elschot <paul.elschot [at] xs4all>

> On Monday 12 October 2009 23:29:07 Christoph Boosz wrote:
> > Hi Paul,
> >
> > Thanks for your suggestion. I will test it within the next few days.
> > However, due to memory limitations, it will only work if the number of
> hits
> > is small enough, am I right?
>
> One can load a single term vector at a time, so in this case the memory
> limitation is only in the possibly large map of doc counters per term.
> For best performance try and load the term vectors in docId order,
> after the original query has completed.
>
> In any case it would be good to somehow limit the number of
> documents considered, for example by using the ones with the best
> query score.
>
> Limiting the number of terms would also be good, but that less easy.
>
> Regards,
> Paul Elschot
>
> >
> > Chris
> >
> > 2009/10/12 Paul Elschot <paul.elschot [at] xs4all>
> >
> > > Chris,
> > >
> > > You could also store term vectors for all docs at indexing
> > > time, and add the termvectors for the matching docs into a
> > > (large) map of terms in RAM.
> > >
> > > Regards,
> > > Paul Elschot
> > >
> > >
> > > On Monday 12 October 2009 21:30:48 Christoph Boosz wrote:
> > > > Hi Jake,
> > > >
> > > > Thanks for your helpful explanation.
> > > > In fact, my initial solution was to traverse each document in the
> result
> > > > once and count the contained terms. As you mentioned, this process
> took a
> > > > lot of memory.
> > > > Trying to confine the memory usage with the facet approach, I was
> > > surprised
> > > > by the decline in performance.
> > > > Now I know it's nothing abnormal, at least.
> > > >
> > > > Chris
> > > >
> > > >
> > > > 2009/10/12 Jake Mannix <jake.mannix [at] gmail>
> > > >
> > > > > Hey Chris,
> > > > >
> > > > > On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz <
> > > > > christoph.boosz [at] googlemail> wrote:
> > > > >
> > > > > > Thanks for your reply.
> > > > > > Yes, it's likely that many terms occur in few documents.
> > > > > >
> > > > > > If I understand you right, I should do the following:
> > > > > > -Write a HitCollector that simply increments a counter
> > > > > > -Get the filter for the user query once: new
> CachingWrapperFilter(new
> > > > > > QueryWrapperFilter(userQuery));
> > > > > > -Create a TermQuery for each term
> > > > > > -Perform the search and read the counter of the HitCollector
> > > > > >
> > > > > > I did that, but it didn't get faster. Any ideas why?
> > > > > >
> > > > >
> > > > > This killer is the "TermQuery for each term" part - this is huge.
> You
> > > need
> > > > > to invert this process,
> > > > > and use your query as is, but while walking in the HitCollector, on
> > > each
> > > > > doc
> > > > > which matches
> > > > > your query, increment counters for each of the terms in that
> document
> > > > > (which
> > > > > means you need
> > > > > an in-memory forward lookup for your documents, like a multivalued
> > > > > FieldCache - and if you've
> > > > > got roughly the same number of terms as documents, this cache is
> likely
> > > to
> > > > > be as large as
> > > > > your entire index - a pretty hefty RAM cost).
> > > > >
> > > > > But a good thing to keep in mind is that doing this kind of
> faceting
> > > > > (massively multivalued
> > > > > on a huge term-set) requires a lot of computation, even if you have
> all
> > > the
> > > > > proper structures
> > > > > living in memory:
> > > > >
> > > > > For each document you look at (which matches your query), you need
> to
> > > look
> > > > > at all
> > > > > of the terms in that document, and increment a counter for that
> term.
> > > So
> > > > > however much
> > > > > time it would normally take for you to do the driving query, it can
> > > take as
> > > > > much as that
> > > > > multiplied by the average number of terms in a document in your
> index.
> > > If
> > > > > your documents
> > > > > are big, this could be a pretty huge latency penalty.
> > > > >
> > > > > -jake
> > > > >
> > > >
> > >
> > >
> >
>
>


te at statsbiblioteket

Oct 27, 2009, 5:29 AM

Post #11 of 11 (684 views)
Permalink
Re: faceted search performance [In reply to]

On Mon, 2009-10-12 at 20:02 +0200, Jake Mannix wrote:
> This killer is the "TermQuery for each term" part - this is huge. You need
> to invert this process, and use your query as is, but while walking in the
> HitCollector, on each doc which matches your query, increment counters for
> each of the terms in that document (which means you need an in-memory
> forward lookup for your documents, like a multivalued FieldCache - and if
> you've got roughly the same number of terms as documents, this cache is
> likely to be as large as your entire index - a pretty hefty RAM cost).

We use something a lot like this, with some tricks to reduce memory
usage, primarily by storing the terms as a file:

A docID-array (#docs * 4 bytes) has an entry for each document that
specifies where in the field/term list to start reading. The number of
field/term entries to read is inferred by looking at the offset for the
next docID.

The field/term array (#terms * 4 bytes) consists of packed pairs, where
the field is specified by 5 bits (giving a max of 32 fields) and the
term by 27 bits (giving a maximum of 134 million unique terms/field).

For each field, a file is stored with the concatenated terms. An
in-memory array (#terms * 8 bytes) of offsets in the file for the terms
provides fast lookup. It should be possible to store this array as a
file, which would significantly reduce memory usage.

Total memory usage: (#docs * 4 bytes) + (#terms * 12 bytes).

[...]

> For each document you look at (which matches your query), you need to look
> at all of the terms in that document, and increment a counter for that term.

This can be done surprisingly fast as you do not need to look at the
terms (Strings) themselves. The pseudo-code for our counting is
something like

for (int id: docIDs) {
for (int pos = docID-array[id] ; pos < docID-array[id+1] ; pos++) {
counters[field_terms >> 27][field_terms & TERM_MASK]++
)
)

In order to provide the result, we iterate the counters, extracting the
top-10 (or whatever amount of terms we want). After that we can retrieve
the actual Strings for the terms, as we can get the offset of the terms
by using the position from the counters.

Total memory-usage thus jumped to
(#docs * 4 bytes) + (#terms * 12 bytes) +
((#terms * 4 bytes) * #concurrent searches)

To recap, our faceted search is thus

- The search itself, providing a bitset with all matching docIDs
- Counting (very tight loop with no conditionals)
- Extracting top-x term-IDs for each field (heapsort or such)
- Retrieving only the relevant terms from file
- Clearing the counters for next usage

The last step is obviously somewhat critical, as requesting a facet
model with 10 fields of 20 terms each means a worst-case of 200 seeks.
In real world usage, caching helps a lot and SSDs help even more.

We did some tests on this and got the following, where #references is
the number of facet-search-relevant terms in the documents (in the first
example, this means that each document had an average of 10
facet-relevant terms).

* 10M docs, 10M unique facet-relevant terms, 100M refs, 1 machine
A few 1000 hits < 100 ms
A few 100.000 hits < 200 ms
10M hits ~3 sec

* 100M docs, 1G unique facet-relevant terms, 1G refs, 3 machines
A few 1000 hits ~1 sec (ouch)
A few 100.000 hits ~1 sec
10M hits < 3 sec
100M hits ~15 sec

Lots of room for improvement for small numbers of hits and as always,
YMMW, but faceted search with a large number of terms in a large number
of documents is feasible.

Regards,
Toke Eskildsen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.