Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Indexing 100Gb of readonly numeric data

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


psilvaferreira at gmail

Feb 15, 2012, 10:04 AM

Post #1 of 4 (254 views)
Permalink
Indexing 100Gb of readonly numeric data

Hi guys,

I hope I'm sending this to the right place.

I have this possible idea in mind (still fuzzy, but enough to describe
this), and I was wondering if Lucene or Solr could help in this. I've
implemented a Lucene index on custom enterprise data before and have
it running on Azure as well, so I know the basics of it.

For this idea, this are the premises:

- about 100Gb of data
- data is expected to be in one gigantic table. conceptually, is like
a spreadsheet table: rows are objects and columns are properties.
- values are mostly floating point numbers, and I expect them to be,
let's say, unique, or almost randomly distributed (1.89868776E+50,
1.434E-12)
- The data is readonly. it will never change.

Now I need to query this data based mostly in range queries on the
columns. Something like:

"SELECT * FROM Table WHERE (Col1 > 1.2E2 AND Col1 < 1.8E2) OR (Col3 == 0)"

which is basically "give me all the rows that satisfy this criteria".

I believe this could be easily done with a standard RDBMS, but I would
like to avoid that route.

So, is this someething doable with Lucene or Solr? And if so, how much
can be done with a stock, out of the box Lucene implementation?

While thinking about this, and assuming this could work well with
Lucene, I had 2 major questions:

- Won't I get an index that will be pretty much the same size of the
data source? I would have to index all columns from all rows, and as
there is not much "repetition" in the data source, wouldn't the index
almost mirror the data source?.

- If the data source is readonly, should I be creating the index once,
offline, and the replicate it to the search servers?

Or am I just being crazy and making a monster of a small problem? :)

Thanks
--
Pedro Ferreira

mobile: 00 44 7712 557303
skype: pedrosilvaferreira
email: psilvaferreira [at] gmail
linkedin: http://uk.linkedin.com/in/pedrosilvaferreira

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erickerickson at gmail

Feb 15, 2012, 1:48 PM

Post #2 of 4 (247 views)
Permalink
Re: Indexing 100Gb of readonly numeric data [In reply to]

Actually, you might well have your index be larger than your source, assuming
you're going to be both storing and indexing everything.

There's also the "deep paging" issue, see:
https://issues.apache.org/jira/browse/SOLR-1726
which comes into play if you expect to return a lot of rows.
Solr really doesn't have the "cursor" concept as RDBMSs do.

My gut feeling is that solr is a *text* search engine primarily and
this feels like something more suited to an RDBMS. That said,
I'm quite sure you can make Solr/Lucene do the tricks you want
if you're really RDBMS-averse <G>....

And at that size, you may well have to deal with sharding the
index (you'd have to test)..

I guess my "bottom line" is that you could get Solr up and running,
index the data and just see in a few days with data that size.

Best
Erick

On Wed, Feb 15, 2012 at 1:04 PM, Pedro Ferreira
<psilvaferreira [at] gmail> wrote:
> Hi guys,
>
> I hope I'm sending this to the right place.
>
> I have this possible idea in mind (still fuzzy, but enough to describe
> this), and I was wondering if Lucene or Solr could help in this. I've
> implemented a Lucene index on custom enterprise data before and have
> it running on Azure as well, so I know the basics of it.
>
> For this idea, this are the premises:
>
> - about 100Gb of data
> - data is expected to be in one gigantic table. conceptually, is like
> a spreadsheet table: rows are objects and columns are properties.
> - values are mostly floating point numbers, and I expect them to be,
> let's say, unique, or almost randomly distributed (1.89868776E+50,
> 1.434E-12)
> - The data is readonly. it will never change.
>
> Now I need to query this data based mostly in range queries on the
> columns. Something like:
>
> "SELECT * FROM Table WHERE (Col1 > 1.2E2 AND Col1 < 1.8E2) OR (Col3 == 0)"
>
> which is basically "give me all the rows that satisfy this criteria".
>
> I believe this could be easily done with a standard RDBMS, but I would
> like to avoid that route.
>
> So, is this someething doable with Lucene or Solr? And if so, how much
> can be done with a stock, out of the box Lucene implementation?
>
> While thinking about this, and assuming this could work well with
> Lucene, I had 2 major questions:
>
> - Won't I get an index that will be pretty much the same size of the
> data source? I would have to index all columns from all rows, and as
> there is not much "repetition" in the data source, wouldn't the index
> almost mirror the data source?.
>
> - If the data source is readonly, should I be creating the index once,
> offline, and the replicate it to the search servers?
>
> Or am I just being crazy and making a monster of a small problem? :)
>
> Thanks
> --
> Pedro Ferreira
>
> mobile: 00 44 7712 557303
> skype: pedrosilvaferreira
> email: psilvaferreira [at] gmail
> linkedin: http://uk.linkedin.com/in/pedrosilvaferreira
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


psilvaferreira at gmail

Feb 15, 2012, 2:18 PM

Post #3 of 4 (243 views)
Permalink
Re: Indexing 100Gb of readonly numeric data [In reply to]

Thanks Eric,

Yes, the limitations you pointed confirm my first feeling on it. Even
if it is doable with Solr or Lucene, I would have to go deep inside of
it to get the most out of it.

About my RDBMS issues... there are 2 reasons:

First, Im interested in this whole cloud crazyness. I love to work
with Azure, and try a different approach. In this case, I was thinking
in storing the data in Data Tables, and have several Indexers.

Then, while 100Gb is fine for a SQL server, if it grows to 200 or 300
Gb its becomes too expensive for a small open source project. On the
other hand, Data Tables in Azure are much more affordable. Still
expensive, but on another scale.

On Wed, Feb 15, 2012 at 9:48 PM, Erick Erickson <erickerickson [at] gmail> wrote:
> Actually, you might well have your index be larger than your source, assuming
> you're going to be both storing and indexing everything.
>
> There's also the "deep paging" issue, see:
> https://issues.apache.org/jira/browse/SOLR-1726
> which comes into play if you expect to return a lot of rows.
> Solr really doesn't have the "cursor" concept as RDBMSs do.
>
> My gut feeling is that solr is a *text* search engine primarily and
> this feels like something more suited to an RDBMS. That said,
> I'm quite sure you can make Solr/Lucene do the tricks you want
> if you're really RDBMS-averse <G>....
>
> And at that size, you may well have to deal with sharding the
> index (you'd have to test)..
>
> I guess my "bottom line" is that you could get Solr up and running,
> index the data and just see in a few days with data that size.
>
> Best
> Erick
>
> On Wed, Feb 15, 2012 at 1:04 PM, Pedro Ferreira
> <psilvaferreira [at] gmail> wrote:
>> Hi guys,
>>
>> I hope I'm sending this to the right place.
>>
>> I have this possible idea in mind (still fuzzy, but enough to describe
>> this), and I was wondering if Lucene or Solr could help in this. I've
>> implemented a Lucene index on custom enterprise data before and have
>> it running on Azure as well, so I know the basics of it.
>>
>> For this idea, this are the premises:
>>
>> - about 100Gb of data
>> - data is expected to be in one gigantic table. conceptually, is like
>> a spreadsheet table: rows are objects and columns are properties.
>> - values are mostly floating point numbers, and I expect them to be,
>> let's say, unique, or almost randomly distributed (1.89868776E+50,
>> 1.434E-12)
>> - The data is readonly. it will never change.
>>
>> Now I need to query this data based mostly in range queries on the
>> columns. Something like:
>>
>> "SELECT * FROM Table WHERE (Col1 > 1.2E2 AND Col1 < 1.8E2) OR (Col3 == 0)"
>>
>> which is basically "give me all the rows that satisfy this criteria".
>>
>> I believe this could be easily done with a standard RDBMS, but I would
>> like to avoid that route.
>>
>> So, is this someething doable with Lucene or Solr? And if so, how much
>> can be done with a stock, out of the box Lucene implementation?
>>
>> While thinking about this, and assuming this could work well with
>> Lucene, I had 2 major questions:
>>
>> - Won't I get an index that will be pretty much the same size of the
>> data source? I would have to index all columns from all rows, and as
>> there is not much "repetition" in the data source, wouldn't the index
>> almost mirror the data source?.
>>
>> - If the data source is readonly, should I be creating the index once,
>> offline, and the replicate it to the search servers?
>>
>> Or am I just being crazy and making a monster of a small problem? :)
>>
>> Thanks
>> --
>> Pedro Ferreira
>>
>> mobile: 00 44 7712 557303
>> skype: pedrosilvaferreira
>> email: psilvaferreira [at] gmail
>> linkedin: http://uk.linkedin.com/in/pedrosilvaferreira
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>



--
Pedro Ferreira

mobile: 00 44 7712 557303
skype: pedrosilvaferreira
email: psilvaferreira [at] gmail
linkedin: http://uk.linkedin.com/in/pedrosilvaferreira

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ralf.heyde at gmx

Feb 20, 2012, 9:22 AM

Post #4 of 4 (233 views)
Permalink
RE: Indexing 100Gb of readonly numeric data [In reply to]

Hi Pedro,

Maybe have a look at Hadoop / JAQL / HBase?
For this "simple" setup it could be a scalable and simple solution (with
additional aggregation functions).

Best
Ralf


-----Original Message-----
From: Pedro Ferreira [mailto:psilvaferreira [at] gmail]
Sent: Mittwoch, 15. Februar 2012 23:18
To: java-user [at] lucene
Subject: Re: Indexing 100Gb of readonly numeric data

Thanks Eric,

Yes, the limitations you pointed confirm my first feeling on it. Even if it
is doable with Solr or Lucene, I would have to go deep inside of it to get
the most out of it.

About my RDBMS issues... there are 2 reasons:

First, Im interested in this whole cloud crazyness. I love to work with
Azure, and try a different approach. In this case, I was thinking in storing
the data in Data Tables, and have several Indexers.

Then, while 100Gb is fine for a SQL server, if it grows to 200 or 300 Gb its
becomes too expensive for a small open source project. On the other hand,
Data Tables in Azure are much more affordable. Still expensive, but on
another scale.

On Wed, Feb 15, 2012 at 9:48 PM, Erick Erickson <erickerickson [at] gmail>
wrote:
> Actually, you might well have your index be larger than your source,
> assuming you're going to be both storing and indexing everything.
>
> There's also the "deep paging" issue, see:
> https://issues.apache.org/jira/browse/SOLR-1726
> which comes into play if you expect to return a lot of rows.
> Solr really doesn't have the "cursor" concept as RDBMSs do.
>
> My gut feeling is that solr is a *text* search engine primarily and
> this feels like something more suited to an RDBMS. That said, I'm
> quite sure you can make Solr/Lucene do the tricks you want if you're
> really RDBMS-averse <G>....
>
> And at that size, you may well have to deal with sharding the index
> (you'd have to test)..
>
> I guess my "bottom line" is that you could get Solr up and running,
> index the data and just see in a few days with data that size.
>
> Best
> Erick
>
> On Wed, Feb 15, 2012 at 1:04 PM, Pedro Ferreira
> <psilvaferreira [at] gmail> wrote:
>> Hi guys,
>>
>> I hope I'm sending this to the right place.
>>
>> I have this possible idea in mind (still fuzzy, but enough to
>> describe this), and I was wondering if Lucene or Solr could help in
>> this. I've implemented a Lucene index on custom enterprise data
>> before and have it running on Azure as well, so I know the basics of it.
>>
>> For this idea, this are the premises:
>>
>> - about 100Gb of data
>> - data is expected to be in one gigantic table. conceptually, is like
>> a spreadsheet table: rows are objects and columns are properties.
>> - values are mostly floating point numbers, and I expect them to be,
>> let's say, unique, or almost randomly distributed (1.89868776E+50,
>> 1.434E-12)
>> - The data is readonly. it will never change.
>>
>> Now I need to query this data based mostly in range queries on the
>> columns. Something like:
>>
>> "SELECT * FROM Table WHERE (Col1 > 1.2E2 AND Col1 < 1.8E2) OR (Col3 ==
0)"
>>
>> which is basically "give me all the rows that satisfy this criteria".
>>
>> I believe this could be easily done with a standard RDBMS, but I
>> would like to avoid that route.
>>
>> So, is this someething doable with Lucene or Solr? And if so, how
>> much can be done with a stock, out of the box Lucene implementation?
>>
>> While thinking about this, and assuming this could work well with
>> Lucene, I had 2 major questions:
>>
>> - Won't I get an index that will be pretty much the same size of the
>> data source? I would have to index all columns from all rows, and as
>> there is not much "repetition" in the data source, wouldn't the index
>> almost mirror the data source?.
>>
>> - If the data source is readonly, should I be creating the index
>> once, offline, and the replicate it to the search servers?
>>
>> Or am I just being crazy and making a monster of a small problem? :)
>>
>> Thanks
>> --
>> Pedro Ferreira
>>
>> mobile: 00 44 7712 557303
>> skype: pedrosilvaferreira
>> email: psilvaferreira [at] gmail
>> linkedin: http://uk.linkedin.com/in/pedrosilvaferreira
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>



--
Pedro Ferreira

mobile: 00 44 7712 557303
skype: pedrosilvaferreira
email: psilvaferreira [at] gmail
linkedin: http://uk.linkedin.com/in/pedrosilvaferreira

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.