Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Facets

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


henrik.hjalmarsson at raa

Nov 3, 2009, 1:23 AM

Post #1 of 8 (254 views)
Permalink
Facets

Hello

I am trying to develop an API for a search application that is using Lucene 2.4.1
The search application is maintained by RAA (swedish goverment organization that keeps track of historical and cultural data).

I have gotten a demand for an API method that returns an XML response, listing all the indexes in this application and the number of unique values these indexes can have, filtered by a query that is recieved in the method request.

The application contains a large amount of indexes and some indexes contains a very large amount of unique values. Is there some way to achive this in an effective way?

With regards Henrik


ian.lea at gmail

Nov 3, 2009, 1:43 AM

Post #2 of 8 (245 views)
Permalink
Re: Facets [In reply to]

Well, lucene is blazingly quick and sometimes things take less time
than one might expect, but your combination of large and very large is
not encouraging. It doesn't sounds like the new API method would
necessarily need an exact reply - could you run something in the
background out of peak hours that built the XML response or at least
saved numbers for it to be built quickly when requested? Depending on
the volatility of your indexes, the background job could be somewhat
intelligent and only update the figures for indexes that have had
significant activity. Defining significant activity is left as an
exercise for the reader ...

Good luck.


--
Ian.


On Tue, Nov 3, 2009 at 9:23 AM, Henrik Hjalmarsson
<henrik.hjalmarsson[at]raa.se> wrote:
> Hello
>
> I am trying to develop an API for a search application that is using Lucene 2.4.1
> The search application is maintained by RAA (swedish goverment organization that keeps track of historical and cultural data).
>
> I have gotten a demand for an API method that returns an XML response, listing all the indexes in this application and the number of unique values these indexes can have, filtered by a query that is recieved in the method request.
>
> The application contains a large amount of indexes and some indexes contains a very large amount of unique values. Is there some way to achive this in an effective way?
>
> With regards Henrik
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


te at statsbiblioteket

Nov 3, 2009, 5:42 AM

Post #3 of 8 (238 views)
Permalink
Re: Facets [In reply to]

On Tue, 2009-11-03 at 10:23 +0100, Henrik Hjalmarsson wrote:
> I have gotten a demand for an API method that returns an XML response,
> listing all the indexes in this application and the number of unique
> values these indexes can have, filtered by a query that is recieved in
> the method request.

We've had the same request a number of times, but when we discuss the
scenarios in detail, they can always be scaled down to "The first X
values", instead of "all values", where X is < 1000.


While you can build efficient handling of the faceting on fields with
many terms, simply returning the Strings for the terms (ignoring all the
grunt work of extraction) poses problems.

5 million values of 20 characters each takes up about
5M * (20 * 2 + ~40) bytes ~ 400MByte
of RAM. If you wrap that in nice XML and send it using SOAP, memory
usage goes through the roof. Streaming, as Ian suggests, seems to be the
answer here.

> The application contains a large amount of indexes and some indexes
> contains a very large amount of unique values. Is there some way to
> achive this in an effective way?

It is definitely possible in the case where you limit the number of
returned values. Well, at least we've tested it for 1000M unique values
in 100M documents. But before we go there, it would help to know what
you mean by "large".

Regards,
Toke Eskildsen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


jake.mannix at gmail

Nov 3, 2009, 11:07 AM

Post #4 of 8 (231 views)
Permalink
Re: Facets [In reply to]

If you need faceting on top of Lucene and you're not using Solr, Bobo-browse
( http://bobo-browse.googlecode.com ) is a high-performance open source
faceting library which may suit your needs. You're asking for "all facet
values", which in bobo isn't terribly hard to get: because of the way bobo
keeps the facet counts, it already has all of the counts in memory for all
the unique values once you've run the query with faceting turned on, and
it's just a question of returning them.

How big is your index, and how many unique values for this field?

-jake

On Tue, Nov 3, 2009 at 1:23 AM, Henrik Hjalmarsson <
henrik.hjalmarsson[at]raa.se> wrote:

> Hello
>
> I am trying to develop an API for a search application that is using Lucene
> 2.4.1
> The search application is maintained by RAA (swedish goverment organization
> that keeps track of historical and cultural data).
>
> I have gotten a demand for an API method that returns an XML response,
> listing all the indexes in this application and the number of unique values
> these indexes can have, filtered by a query that is recieved in the method
> request.
>
> The application contains a large amount of indexes and some indexes
> contains a very large amount of unique values. Is there some way to achive
> this in an effective way?
>
> With regards Henrik
>


chris.lu at gmail

Nov 3, 2009, 5:49 PM

Post #5 of 8 (223 views)
Permalink
Re: Facets [In reply to]

If the query is a very selective one, you can go through the XML
document and do the counting.

If the query is not so selective, which is usually the case, and the
number of matches are large, basically all the values need to be loaded
into memory, or solid state disk, to do a fast counting.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding!


Henrik Hjalmarsson wrote:
> Hello
>
> I am trying to develop an API for a search application that is using Lucene 2.4.1
> The search application is maintained by RAA (swedish goverment organization that keeps track of historical and cultural data).
>
> I have gotten a demand for an API method that returns an XML response, listing all the indexes in this application and the number of unique values these indexes can have, filtered by a query that is recieved in the method request.
>
> The application contains a large amount of indexes and some indexes contains a very large amount of unique values. Is there some way to achive this in an effective way?
>
> With regards Henrik
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


henrik.hjalmarsson at raa

Nov 5, 2009, 7:37 AM

Post #6 of 8 (206 views)
Permalink
Sv: Re: Facets [In reply to]

So basicly there is no effective way of doing this?
The only solution I've come up with is

Pseudo code:

for( every index )
{
create term(index, "*")
create wildcard query with term
rewrite query to primitive query
extract terms from primive query
for( each term extracted )
{
create query with term and query string from request
}
}

for( each query created above )
{
search with query asking for 1 result.
check total number of hits for query
}

print result

Something like that. Is that a totaly idiotic way of doing it?


>>> Chris Lu 09-11-04 02:51 >>>
If the query is a very selective one, you can go through the XML
document and do the counting.

If the query is not so selective, which is usually the case, and the
number of matches are large, basically all the values need to be loaded
into memory, or solid state disk, to do a fast counting.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding!


Henrik Hjalmarsson wrote:
> Hello
>
> I am trying to develop an API for a search application that is using Lucene 2.4.1
> The search application is maintained by RAA (swedish goverment organization that keeps track of historical and cultural data).
>
> I have gotten a demand for an API method that returns an XML response, listing all the indexes in this application and the number of unique values these indexes can have, filtered by a query that is recieved in the method request.
>
> The application contains a large amount of indexes and some indexes contains a very large amount of unique values. Is there some way to achive this in an effective way?
>
> With regards Henrik
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


henrik.hjalmarsson at raa

Nov 5, 2009, 8:03 AM

Post #7 of 8 (206 views)
Permalink
Sv: Re: Facets [In reply to]

>>> Toke Eskildsen 09-11-03 14:43 >>>
On Tue, 2009-11-03 at 10:23 +0100, Henrik Hjalmarsson wrote:
> I have gotten a demand for an API method that returns an XML response,
> listing all the indexes in this application and the number of unique
> values these indexes can have, filtered by a query that is recieved in
> the method request.

We've had the same request a number of times, but when we discuss the
scenarios in detail, they can always be scaled down to "The first X
values", instead of "all values", where X is < 1000.


While you can build efficient handling of the faceting on fields with
many terms, simply returning the Strings for the terms (ignoring all the
grunt work of extraction) poses problems.

5 million values of 20 characters each takes up about
5M * (20 * 2 + ~40) bytes ~ 400MByte
of RAM. If you wrap that in nice XML and send it using SOAP, memory
usage goes through the roof. Streaming, as Ian suggests, seems to be the
answer here.

> The application contains a large amount of indexes and some indexes
> contains a very large amount of unique values. Is there some way to
> achive this in an effective way?

It is definitely possible in the case where you limit the number of
returned values. Well, at least we've tested it for 1000M unique values
in 100M documents. But before we go there, it would help to know what
you mean by "large".

Ok. To be honest, I don't know what is considered "large" for Lucene. But its significanly larger than what examples I've managed to find so far. Its roughly around 100 different indexes at the moment and every index (from what I know) can have everything from 2 unique values up to 50 000 or 200 000 unique values.

Regards,
Toke Eskildsen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


jake.mannix at gmail

Nov 5, 2009, 9:49 AM

Post #8 of 8 (204 views)
Permalink
Re: Re: Facets [In reply to]

Can you describe your lucene index a little further? How many documents are
in it, and
what do you mean by "listing all the indexes in this application and the
number of unique
values these indexes can have, filtered by a query that is recieved in the
method request."

Index means what in this context - it's a pretty overloaded word. What you
describe in this
pseudocode does not look like it will perform *at all*, so to help you get
this to go faster,
I need to understand what exactly it is you're trying to accomplish - you
have an incoming
query "q" which you want to get facet results filtered by (it's sometimes
call the "driving query"
in the faceting problem), and documents have a field, called "index", which
is what, single
valued, an integer, and you effectively want the count of (q AND index:x)
for all possible
unique values of x (this is the basic faceting problem), or what?

If this is what you want to do, both Solr and bobo-browse, which are both
open source and
built on top of lucene, can do this query fairly performantly for you.

-jake

On Thu, Nov 5, 2009 at 7:37 AM, Henrik Hjalmarsson <
henrik.hjalmarsson[at]raa.se> wrote:

> So basicly there is no effective way of doing this?
> The only solution I've come up with is
>
> Pseudo code:
>
> for( every index )
> {
> create term(index, "*")
> create wildcard query with term
> rewrite query to primitive query
> extract terms from primive query
> for( each term extracted )
> {
> create query with term and query string from request
> }
> }
>
> for( each query created above )
> {
> search with query asking for 1 result.
> check total number of hits for query
> }
>
> print result
>
> Something like that. Is that a totaly idiotic way of doing it?
>
>
> >>> Chris Lu 09-11-04 02:51 >>>
> If the query is a very selective one, you can go through the XML
> document and do the counting.
>
> If the query is not so selective, which is usually the case, and the
> number of matches are large, basically all the values need to be loaded
> into memory, or solid state disk, to do a fast counting.
>
> --
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!
>
>
> Henrik Hjalmarsson wrote:
> > Hello
> >
> > I am trying to develop an API for a search application that is using
> Lucene 2.4.1
> > The search application is maintained by RAA (swedish goverment
> organization that keeps track of historical and cultural data).
> >
> > I have gotten a demand for an API method that returns an XML response,
> listing all the indexes in this application and the number of unique values
> these indexes can have, filtered by a query that is recieved in the method
> request.
> >
> > The application contains a large amount of indexes and some indexes
> contains a very large amount of unique values. Is there some way to achive
> this in an effective way?
> >
> > With regards Henrik
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.