Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

frequent keyword computation within a search ( and timeinterval )

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


prasen.bea at gmail

Jan 3, 2012, 9:17 PM

Post #1 of 8 (284 views)
Permalink
frequent keyword computation within a search ( and timeinterval )

I have a requirement where reads and writes are quite high ( @ 100-500
per-sec ). A document has the following fields : timestamp,
unique-docid, content-text, keyword. Average content-text length is ~
20 bytes, there is only 1 keyword for a given docid.

At runtime, given a query-term ( which could be null ) and a
time-interval, I need to find out top-k frequent keywords which
contains the query-term ( optional if its null ) in its context-text
field within that time-interval. I can purge the data every day, hence
no need for me to have more than a days data.

I have quite a few options here : Starting with MySQL, NoSQLs (
Cassandra, Mongo, Couch, Riak, Redis ) , Search-Engine based (
lucene/solr ) each having its own pros/cons.

In MySQL we can achieve this via : GROUP-BY/COUNT clause
In NoSQL I can probably write a map/reduce task to query these
numbers. Although I am not very sure about the query response time.
Not sure of we can achieve it via lucene/solr OOB.

Any suggestions on what would be a good choice for this use case ?

-Thanks,
prasenjit

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Jan 5, 2012, 5:40 AM

Post #2 of 8 (271 views)
Permalink
Re: frequent keyword computation within a search ( and timeinterval ) [In reply to]

the time interval is just a RangeQuery in the Lucene
world. The rest is pretty standard search stuff.

You probably want to have a look at the NRT
(near real time) stuff in trunk.

Your reads/writes are pretty high, so you'll need
some experimentation to size your site
correctly.

Best
Erick

On Wed, Jan 4, 2012 at 12:17 AM, prasenjit mukherjee
<prasen.bea [at] gmail> wrote:
> I have a requirement where reads and writes are quite high ( @ 100-500
> per-sec ). A document has the following fields : timestamp,
> unique-docid,  content-text, keyword. Average content-text length is ~
> 20 bytes, there is only 1 keyword for a given docid.
>
> At runtime, given a query-term ( which could be null ) and a
> time-interval,  I need to find out top-k frequent keywords which
> contains the query-term ( optional if its null )  in its context-text
> field within that time-interval. I can purge the data every day, hence
> no need for me to have more than a days data.
>
> I have quite a few options here : Starting with MySQL, NoSQLs (
> Cassandra, Mongo, Couch, Riak, Redis ) , Search-Engine based (
> lucene/solr ) each having its own pros/cons.
>
> In MySQL we can achieve this via : GROUP-BY/COUNT  clause
> In NoSQL I can probably write a map/reduce task to query these
> numbers. Although I am not very sure about the query response time.
> Not sure of we can achieve it via lucene/solr OOB.
>
> Any suggestions on what would be a good choice for this use case ?
>
> -Thanks,
> prasenjit
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


prasen.bea at gmail

Jan 5, 2012, 9:53 AM

Post #3 of 8 (269 views)
Permalink
Re: frequent keyword computation within a search ( and timeinterval ) [In reply to]

Thanks Eric for the response.

Will lucene/solr provide me aggregations ( of field vaues ) satisying
a query criteria ? e.g. SELECT SUM(price) WHERE item=fruits

Or I need to use hitCollector to achieve that ?

Any sample solr/lucene query to compte aggregates ( like SUM ) will be great.

-Thanks,
Prasenjit

On Thu, Jan 5, 2012 at 7:10 PM, Erick Erickson <erickerickson [at] gmail> wrote:
> the time interval is just a RangeQuery in the Lucene
> world. The rest is pretty standard search stuff.
>
> You probably want to have a look at the NRT
> (near real time) stuff in trunk.
>
> Your reads/writes are pretty high, so you'll need
> some experimentation to size your site
> correctly.
>
> Best
> Erick
>
> On Wed, Jan 4, 2012 at 12:17 AM, prasenjit mukherjee
> <prasen.bea [at] gmail> wrote:
>> I have a requirement where reads and writes are quite high ( @ 100-500
>> per-sec ). A document has the following fields : timestamp,
>> unique-docid,  content-text, keyword. Average content-text length is ~
>> 20 bytes, there is only 1 keyword for a given docid.
>>
>> At runtime, given a query-term ( which could be null ) and a
>> time-interval,  I need to find out top-k frequent keywords which
>> contains the query-term ( optional if its null )  in its context-text
>> field within that time-interval. I can purge the data every day, hence
>> no need for me to have more than a days data.
>>
>> I have quite a few options here : Starting with MySQL, NoSQLs (
>> Cassandra, Mongo, Couch, Riak, Redis ) , Search-Engine based (
>> lucene/solr ) each having its own pros/cons.
>>
>> In MySQL we can achieve this via : GROUP-BY/COUNT  clause
>> In NoSQL I can probably write a map/reduce task to query these
>> numbers. Although I am not very sure about the query response time.
>> Not sure of we can achieve it via lucene/solr OOB.
>>
>> Any suggestions on what would be a good choice for this use case ?
>>
>> -Thanks,
>> prasenjit
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Jan 5, 2012, 1:37 PM

Post #4 of 8 (271 views)
Permalink
Re: frequent keyword computation within a search ( and timeinterval ) [In reply to]

You will encounter endless grief until you stop
thinking of Solr/Lucene as a replacement for
an RDBMS. It is a *text search engine*.
Whenever you start asking "how do I implement
a SQL statement in Solr", you have to stop
and reconsider *why* you are trying to do that.
Then recast the question in terms of searching.

Short answer is that no, there isn't an aggregate
function. And you shouldn't even try.

Best
Erick

On Thu, Jan 5, 2012 at 12:53 PM, prasenjit mukherjee
<prasen.bea [at] gmail> wrote:
> Thanks Eric for the response.
>
> Will lucene/solr provide me aggregations ( of field vaues ) satisying
> a query criteria ? e.g. SELECT SUM(price) WHERE item=fruits
>
> Or I need to use hitCollector to achieve that ?
>
> Any sample solr/lucene query to compte aggregates ( like SUM ) will be great.
>
> -Thanks,
> Prasenjit
>
> On Thu, Jan 5, 2012 at 7:10 PM, Erick Erickson <erickerickson [at] gmail> wrote:
>> the time interval is just a RangeQuery in the Lucene
>> world. The rest is pretty standard search stuff.
>>
>> You probably want to have a look at the NRT
>> (near real time) stuff in trunk.
>>
>> Your reads/writes are pretty high, so you'll need
>> some experimentation to size your site
>> correctly.
>>
>> Best
>> Erick
>>
>> On Wed, Jan 4, 2012 at 12:17 AM, prasenjit mukherjee
>> <prasen.bea [at] gmail> wrote:
>>> I have a requirement where reads and writes are quite high ( @ 100-500
>>> per-sec ). A document has the following fields : timestamp,
>>> unique-docid,  content-text, keyword. Average content-text length is ~
>>> 20 bytes, there is only 1 keyword for a given docid.
>>>
>>> At runtime, given a query-term ( which could be null ) and a
>>> time-interval,  I need to find out top-k frequent keywords which
>>> contains the query-term ( optional if its null )  in its context-text
>>> field within that time-interval. I can purge the data every day, hence
>>> no need for me to have more than a days data.
>>>
>>> I have quite a few options here : Starting with MySQL, NoSQLs (
>>> Cassandra, Mongo, Couch, Riak, Redis ) , Search-Engine based (
>>> lucene/solr ) each having its own pros/cons.
>>>
>>> In MySQL we can achieve this via : GROUP-BY/COUNT  clause
>>> In NoSQL I can probably write a map/reduce task to query these
>>> numbers. Although I am not very sure about the query response time.
>>> Not sure of we can achieve it via lucene/solr OOB.
>>>
>>> Any suggestions on what would be a good choice for this use case ?
>>>
>>> -Thanks,
>>> prasenjit
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


jason.rutherglen at gmail

Jan 5, 2012, 4:23 PM

Post #5 of 8 (266 views)
Permalink
Re: frequent keyword computation within a search ( and timeinterval ) [In reply to]

> Short answer is that no, there isn't an aggregate
> function. And you shouldn't even try

If that is the case why does a 'stats' component exist for Solr with
the SUM function built in?

http://wiki.apache.org/solr/StatsComponent

On Thu, Jan 5, 2012 at 1:37 PM, Erick Erickson <erickerickson [at] gmail> wrote:
> You will encounter endless grief until you stop
> thinking of Solr/Lucene as a replacement for
> an RDBMS. It is a *text search engine*.
> Whenever you start asking "how do I implement
> a SQL statement in Solr", you have to stop
> and reconsider *why* you are trying to do that.
> Then recast the question in terms of searching.
>
> Short answer is that no, there isn't an aggregate
> function. And you shouldn't even try.
>
> Best
> Erick
>
> On Thu, Jan 5, 2012 at 12:53 PM, prasenjit mukherjee
> <prasen.bea [at] gmail> wrote:
>> Thanks Eric for the response.
>>
>> Will lucene/solr provide me aggregations ( of field vaues ) satisying
>> a query criteria ? e.g. SELECT SUM(price) WHERE item=fruits
>>
>> Or I need to use hitCollector to achieve that ?
>>
>> Any sample solr/lucene query to compte aggregates ( like SUM ) will be great.
>>
>> -Thanks,
>> Prasenjit
>>
>> On Thu, Jan 5, 2012 at 7:10 PM, Erick Erickson <erickerickson [at] gmail> wrote:
>>> the time interval is just a RangeQuery in the Lucene
>>> world. The rest is pretty standard search stuff.
>>>
>>> You probably want to have a look at the NRT
>>> (near real time) stuff in trunk.
>>>
>>> Your reads/writes are pretty high, so you'll need
>>> some experimentation to size your site
>>> correctly.
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Jan 4, 2012 at 12:17 AM, prasenjit mukherjee
>>> <prasen.bea [at] gmail> wrote:
>>>> I have a requirement where reads and writes are quite high ( @ 100-500
>>>> per-sec ). A document has the following fields : timestamp,
>>>> unique-docid,  content-text, keyword. Average content-text length is ~
>>>> 20 bytes, there is only 1 keyword for a given docid.
>>>>
>>>> At runtime, given a query-term ( which could be null ) and a
>>>> time-interval,  I need to find out top-k frequent keywords which
>>>> contains the query-term ( optional if its null )  in its context-text
>>>> field within that time-interval. I can purge the data every day, hence
>>>> no need for me to have more than a days data.
>>>>
>>>> I have quite a few options here : Starting with MySQL, NoSQLs (
>>>> Cassandra, Mongo, Couch, Riak, Redis ) , Search-Engine based (
>>>> lucene/solr ) each having its own pros/cons.
>>>>
>>>> In MySQL we can achieve this via : GROUP-BY/COUNT  clause
>>>> In NoSQL I can probably write a map/reduce task to query these
>>>> numbers. Although I am not very sure about the query response time.
>>>> Not sure of we can achieve it via lucene/solr OOB.
>>>>
>>>> Any suggestions on what would be a good choice for this use case ?
>>>>
>>>> -Thanks,
>>>> prasenjit
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Jan 5, 2012, 4:54 PM

Post #6 of 8 (264 views)
Permalink
Re: frequent keyword computation within a search ( and timeinterval ) [In reply to]

Hmmm, guess you're right, the stats component
does return that data. It's been a long day...

Although I still question whether this is a *good*
use of Solr, I'd still re-examine my approach
whenever I found myself trying to translate
SQL queries into Solr....

But if, after that examination I still required
SUM, stats would do it.

Erick

On Thu, Jan 5, 2012 at 7:23 PM, Jason Rutherglen
<jason.rutherglen [at] gmail> wrote:
>> Short answer is that no, there isn't an aggregate
>> function. And you shouldn't even try
>
> If that is the case why does a 'stats' component exist for Solr with
> the SUM function built in?
>
> http://wiki.apache.org/solr/StatsComponent
>
> On Thu, Jan 5, 2012 at 1:37 PM, Erick Erickson <erickerickson [at] gmail> wrote:
>> You will encounter endless grief until you stop
>> thinking of Solr/Lucene as a replacement for
>> an RDBMS. It is a *text search engine*.
>> Whenever you start asking "how do I implement
>> a SQL statement in Solr", you have to stop
>> and reconsider *why* you are trying to do that.
>> Then recast the question in terms of searching.
>>
>> Short answer is that no, there isn't an aggregate
>> function. And you shouldn't even try.
>>
>> Best
>> Erick
>>
>> On Thu, Jan 5, 2012 at 12:53 PM, prasenjit mukherjee
>> <prasen.bea [at] gmail> wrote:
>>> Thanks Eric for the response.
>>>
>>> Will lucene/solr provide me aggregations ( of field vaues ) satisying
>>> a query criteria ? e.g. SELECT SUM(price) WHERE item=fruits
>>>
>>> Or I need to use hitCollector to achieve that ?
>>>
>>> Any sample solr/lucene query to compte aggregates ( like SUM ) will be great.
>>>
>>> -Thanks,
>>> Prasenjit
>>>
>>> On Thu, Jan 5, 2012 at 7:10 PM, Erick Erickson <erickerickson [at] gmail> wrote:
>>>> the time interval is just a RangeQuery in the Lucene
>>>> world. The rest is pretty standard search stuff.
>>>>
>>>> You probably want to have a look at the NRT
>>>> (near real time) stuff in trunk.
>>>>
>>>> Your reads/writes are pretty high, so you'll need
>>>> some experimentation to size your site
>>>> correctly.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Wed, Jan 4, 2012 at 12:17 AM, prasenjit mukherjee
>>>> <prasen.bea [at] gmail> wrote:
>>>>> I have a requirement where reads and writes are quite high ( @ 100-500
>>>>> per-sec ). A document has the following fields : timestamp,
>>>>> unique-docid,  content-text, keyword. Average content-text length is ~
>>>>> 20 bytes, there is only 1 keyword for a given docid.
>>>>>
>>>>> At runtime, given a query-term ( which could be null ) and a
>>>>> time-interval,  I need to find out top-k frequent keywords which
>>>>> contains the query-term ( optional if its null )  in its context-text
>>>>> field within that time-interval. I can purge the data every day, hence
>>>>> no need for me to have more than a days data.
>>>>>
>>>>> I have quite a few options here : Starting with MySQL, NoSQLs (
>>>>> Cassandra, Mongo, Couch, Riak, Redis ) , Search-Engine based (
>>>>> lucene/solr ) each having its own pros/cons.
>>>>>
>>>>> In MySQL we can achieve this via : GROUP-BY/COUNT  clause
>>>>> In NoSQL I can probably write a map/reduce task to query these
>>>>> numbers. Although I am not very sure about the query response time.
>>>>> Not sure of we can achieve it via lucene/solr OOB.
>>>>>
>>>>> Any suggestions on what would be a good choice for this use case ?
>>>>>
>>>>> -Thanks,
>>>>> prasenjit
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


jason.rutherglen at gmail

Jan 5, 2012, 5:12 PM

Post #7 of 8 (273 views)
Permalink
Re: frequent keyword computation within a search ( and timeinterval ) [In reply to]

> Although I still question whether this is a *good* use of Solr

It's a great use of Lucene, which can be made into a superior
horizontally scalable database when compared with open source
relational database systems.

My only concern, going back to *other* conversation(s) is whether or
not the field cache used by stats component is operated on per-segment
or not. If *true* then the stats part of Solr can be checked off as
NRT / soft commit capable / efficient.

I think the answer is *FALSE* based on these lines in StatsComponent
which seem to be operating on the top-level reader (eg, NOT
per-segment).

si = FieldCache.DEFAULT.getTermsIndex(searcher.getIndexReader(), fieldName);

UnInvertedField uif = UnInvertedField.getUnInvertedField(f, searcher);

On Thu, Jan 5, 2012 at 4:54 PM, Erick Erickson <erickerickson [at] gmail> wrote:
> Hmmm, guess you're right, the stats component
> does return that data. It's been a long day...
>
> Although I still question whether this is a *good*
> use of Solr, I'd still re-examine my approach
> whenever I found myself trying to translate
> SQL queries into Solr....
>
> But if, after that examination I still required
> SUM, stats would do it.
>
> Erick
>
> On Thu, Jan 5, 2012 at 7:23 PM, Jason Rutherglen
> <jason.rutherglen [at] gmail> wrote:
>>> Short answer is that no, there isn't an aggregate
>>> function. And you shouldn't even try
>>
>> If that is the case why does a 'stats' component exist for Solr with
>> the SUM function built in?
>>
>> http://wiki.apache.org/solr/StatsComponent
>>
>> On Thu, Jan 5, 2012 at 1:37 PM, Erick Erickson <erickerickson [at] gmail> wrote:
>>> You will encounter endless grief until you stop
>>> thinking of Solr/Lucene as a replacement for
>>> an RDBMS. It is a *text search engine*.
>>> Whenever you start asking "how do I implement
>>> a SQL statement in Solr", you have to stop
>>> and reconsider *why* you are trying to do that.
>>> Then recast the question in terms of searching.
>>>
>>> Short answer is that no, there isn't an aggregate
>>> function. And you shouldn't even try.
>>>
>>> Best
>>> Erick
>>>
>>> On Thu, Jan 5, 2012 at 12:53 PM, prasenjit mukherjee
>>> <prasen.bea [at] gmail> wrote:
>>>> Thanks Eric for the response.
>>>>
>>>> Will lucene/solr provide me aggregations ( of field vaues ) satisying
>>>> a query criteria ? e.g. SELECT SUM(price) WHERE item=fruits
>>>>
>>>> Or I need to use hitCollector to achieve that ?
>>>>
>>>> Any sample solr/lucene query to compte aggregates ( like SUM ) will be great.
>>>>
>>>> -Thanks,
>>>> Prasenjit
>>>>
>>>> On Thu, Jan 5, 2012 at 7:10 PM, Erick Erickson <erickerickson [at] gmail> wrote:
>>>>> the time interval is just a RangeQuery in the Lucene
>>>>> world. The rest is pretty standard search stuff.
>>>>>
>>>>> You probably want to have a look at the NRT
>>>>> (near real time) stuff in trunk.
>>>>>
>>>>> Your reads/writes are pretty high, so you'll need
>>>>> some experimentation to size your site
>>>>> correctly.
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>> On Wed, Jan 4, 2012 at 12:17 AM, prasenjit mukherjee
>>>>> <prasen.bea [at] gmail> wrote:
>>>>>> I have a requirement where reads and writes are quite high ( @ 100-500
>>>>>> per-sec ). A document has the following fields : timestamp,
>>>>>> unique-docid,  content-text, keyword. Average content-text length is ~
>>>>>> 20 bytes, there is only 1 keyword for a given docid.
>>>>>>
>>>>>> At runtime, given a query-term ( which could be null ) and a
>>>>>> time-interval,  I need to find out top-k frequent keywords which
>>>>>> contains the query-term ( optional if its null )  in its context-text
>>>>>> field within that time-interval. I can purge the data every day, hence
>>>>>> no need for me to have more than a days data.
>>>>>>
>>>>>> I have quite a few options here : Starting with MySQL, NoSQLs (
>>>>>> Cassandra, Mongo, Couch, Riak, Redis ) , Search-Engine based (
>>>>>> lucene/solr ) each having its own pros/cons.
>>>>>>
>>>>>> In MySQL we can achieve this via : GROUP-BY/COUNT  clause
>>>>>> In NoSQL I can probably write a map/reduce task to query these
>>>>>> numbers. Although I am not very sure about the query response time.
>>>>>> Not sure of we can achieve it via lucene/solr OOB.
>>>>>>
>>>>>> Any suggestions on what would be a good choice for this use case ?
>>>>>>
>>>>>> -Thanks,
>>>>>> prasenjit
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


prasen.bea at gmail

Jan 5, 2012, 5:41 PM

Post #8 of 8 (274 views)
Permalink
Re: frequent keyword computation within a search ( and timeinterval ) [In reply to]

It seems that the field ( on which stats needs to be cimputed ) should
always remain in memory. This could be a killer. Why isn't it possible
to put that stat-field information into posting stream ( using payload
) which facilitate fast computation of stats withouting requiring it
to keep the content in memory.


On 1/6/12, Jason Rutherglen <jason.rutherglen [at] gmail> wrote:
>> Although I still question whether this is a *good* use of Solr
>
> It's a great use of Lucene, which can be made into a superior
> horizontally scalable database when compared with open source
> relational database systems.
>
> My only concern, going back to *other* conversation(s) is whether or
> not the field cache used by stats component is operated on per-segment
> or not. If *true* then the stats part of Solr can be checked off as
> NRT / soft commit capable / efficient.
>
> I think the answer is *FALSE* based on these lines in StatsComponent
> which seem to be operating on the top-level reader (eg, NOT
> per-segment).
>
> si = FieldCache.DEFAULT.getTermsIndex(searcher.getIndexReader(),
> fieldName);
>
> UnInvertedField uif = UnInvertedField.getUnInvertedField(f, searcher);
>
> On Thu, Jan 5, 2012 at 4:54 PM, Erick Erickson <erickerickson [at] gmail>
> wrote:
>> Hmmm, guess you're right, the stats component
>> does return that data. It's been a long day...
>>
>> Although I still question whether this is a *good*
>> use of Solr, I'd still re-examine my approach
>> whenever I found myself trying to translate
>> SQL queries into Solr....
>>
>> But if, after that examination I still required
>> SUM, stats would do it.
>>
>> Erick
>>
>> On Thu, Jan 5, 2012 at 7:23 PM, Jason Rutherglen
>> <jason.rutherglen [at] gmail> wrote:
>>>> Short answer is that no, there isn't an aggregate
>>>> function. And you shouldn't even try
>>>
>>> If that is the case why does a 'stats' component exist for Solr with
>>> the SUM function built in?
>>>
>>> http://wiki.apache.org/solr/StatsComponent
>>>
>>> On Thu, Jan 5, 2012 at 1:37 PM, Erick Erickson <erickerickson [at] gmail>
>>> wrote:
>>>> You will encounter endless grief until you stop
>>>> thinking of Solr/Lucene as a replacement for
>>>> an RDBMS. It is a *text search engine*.
>>>> Whenever you start asking "how do I implement
>>>> a SQL statement in Solr", you have to stop
>>>> and reconsider *why* you are trying to do that.
>>>> Then recast the question in terms of searching.
>>>>
>>>> Short answer is that no, there isn't an aggregate
>>>> function. And you shouldn't even try.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Thu, Jan 5, 2012 at 12:53 PM, prasenjit mukherjee
>>>> <prasen.bea [at] gmail> wrote:
>>>>> Thanks Eric for the response.
>>>>>
>>>>> Will lucene/solr provide me aggregations ( of field vaues ) satisying
>>>>> a query criteria ? e.g. SELECT SUM(price) WHERE item=fruits
>>>>>
>>>>> Or I need to use hitCollector to achieve that ?
>>>>>
>>>>> Any sample solr/lucene query to compte aggregates ( like SUM ) will be
>>>>> great.
>>>>>
>>>>> -Thanks,
>>>>> Prasenjit
>>>>>
>>>>> On Thu, Jan 5, 2012 at 7:10 PM, Erick Erickson
>>>>> <erickerickson [at] gmail> wrote:
>>>>>> the time interval is just a RangeQuery in the Lucene
>>>>>> world. The rest is pretty standard search stuff.
>>>>>>
>>>>>> You probably want to have a look at the NRT
>>>>>> (near real time) stuff in trunk.
>>>>>>
>>>>>> Your reads/writes are pretty high, so you'll need
>>>>>> some experimentation to size your site
>>>>>> correctly.
>>>>>>
>>>>>> Best
>>>>>> Erick
>>>>>>
>>>>>> On Wed, Jan 4, 2012 at 12:17 AM, prasenjit mukherjee
>>>>>> <prasen.bea [at] gmail> wrote:
>>>>>>> I have a requirement where reads and writes are quite high ( @
>>>>>>> 100-500
>>>>>>> per-sec ). A document has the following fields : timestamp,
>>>>>>> unique-docid,  content-text, keyword. Average content-text length is
>>>>>>> ~
>>>>>>> 20 bytes, there is only 1 keyword for a given docid.
>>>>>>>
>>>>>>> At runtime, given a query-term ( which could be null ) and a
>>>>>>> time-interval,  I need to find out top-k frequent keywords which
>>>>>>> contains the query-term ( optional if its null )  in its context-text
>>>>>>> field within that time-interval. I can purge the data every day,
>>>>>>> hence
>>>>>>> no need for me to have more than a days data.
>>>>>>>
>>>>>>> I have quite a few options here : Starting with MySQL, NoSQLs (
>>>>>>> Cassandra, Mongo, Couch, Riak, Redis ) , Search-Engine based (
>>>>>>> lucene/solr ) each having its own pros/cons.
>>>>>>>
>>>>>>> In MySQL we can achieve this via : GROUP-BY/COUNT  clause
>>>>>>> In NoSQL I can probably write a map/reduce task to query these
>>>>>>> numbers. Although I am not very sure about the query response time.
>>>>>>> Not sure of we can achieve it via lucene/solr OOB.
>>>>>>>
>>>>>>> Any suggestions on what would be a good choice for this use case ?
>>>>>>>
>>>>>>> -Thanks,
>>>>>>> prasenjit
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

--
Sent from my mobile device

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.