Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Bet you didn't know Lucene can...

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


gsingers at apache

Oct 22, 2011, 2:11 AM

Post #1 of 18 (1779 views)
Permalink
Bet you didn't know Lucene can...

Hi All,

I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396). It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search. I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have. I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.

Thanks in advance,
Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com


paul at hoplahup

Oct 22, 2011, 5:58 AM

Post #2 of 18 (1730 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

Grant,

for years the ActiveMath learning environment has been using as storage engine.
At the time (~2004), it was by far the best storage engine ever doable in a pure java-world.
Now it still is perfect in terms of performance.
We had an issue with the separate versions where the stored-fields were not lazily loaded (~version 1.x-2.0) so that we do not store the big fragments yet there. However, for small fragments it's very very efficient (~5000 queries a second).

The objects stored are fragments of XML documents (the format is called OMDoc, they're mostly hand-written).

Tell me if you need more details, I am sure the pure storage option is something very common.

paul


Le 22 oct. 2011 ŗ 11:11, Grant Ingersoll a ťcrit :

> Hi All,
>
> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396). It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search. I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have. I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.
>
> Thanks in advance,
> Grant
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


sujit.pal at comcast

Oct 22, 2011, 9:03 AM

Post #3 of 18 (1726 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

Hi Grant,

Not sure if this qualifies as a "bet you didn't know", but one could use
Lucene term vectors to construct document vectors for similarity,
clustering and classification tasks. I found this out recently (although
I am probably not the first one), and I think this could be quite
useful.

-sujit

On Sat, 2011-10-22 at 11:11 +0200, Grant Ingersoll wrote:
> Hi All,
>
> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396). It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search. I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have. I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.
>
> Thanks in advance,
> Grant
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


wheijke at xs4all

Oct 22, 2011, 10:27 AM

Post #4 of 18 (1728 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

Hi Grant,

These are 2 cases into work i've done that I can think of:

-use Lucene to match products in a database with eBay auctions, the title
of the auction is used as the query to Lucene.

-use a servlet filter and Lucene to map well-formed URL's into a website
to it's individual (product) pages. A deeper URL results in a Lucene
BooleanQuery with more clauses.

Hope this is enough (ab)use...

Wouter


> Hi All,
>
> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..."
> (http://na11.apachecon.com/talks/18396). It's based on my observation,
> that over the years, a number of us in the community have done some pretty
> cool things using Lucene that don't fit under the core premise of full
> text search. I've got a fair number of ideas for the talk (easily enough
> for 1 hour), but I wanted to reach out to hear your stories of ways you've
> (ab)used Lucene and Solr to see if we couldn't extend the conversation to
> a bit more than the conference and also see if I can't inject more ideas
> beyond the ones I have. I don't need deep technical details, but just
> high level use case and the basic insight that led you to believe Lucene
> could solve the problem.
>
> Thanks in advance,
> Grant
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


gsingers at apache

Oct 22, 2011, 3:33 PM

Post #5 of 18 (1726 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

On Oct 22, 2011, at 6:03 PM, Sujit Pal wrote:

> Hi Grant,
>
> Not sure if this qualifies as a "bet you didn't know", but one could use
> Lucene term vectors to construct document vectors for similarity,
> clustering and classification tasks. I found this out recently (although
> I am probably not the first one), and I think this could be quite
> useful.

Yep, had these on my list!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


skant at sloan

Oct 22, 2011, 5:33 PM

Post #6 of 18 (1727 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

Using Lucene as a recommendation engine.

On Sat, Oct 22, 2011 at 6:33 PM, Grant Ingersoll <gsingers [at] apache> wrote:
>
> On Oct 22, 2011, at 6:03 PM, Sujit Pal wrote:
>
>> Hi Grant,
>>
>> Not sure if this qualifies as a "bet you didn't know", but one could use
>> Lucene term vectors to construct document vectors for similarity,
>> clustering and classification tasks. I found this out recently (although
>> I am probably not the first one), and I think this could be quite
>> useful.
>
> Yep, had these on my list!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


dawid.weiss at gmail

Oct 23, 2011, 12:29 AM

Post #7 of 18 (1718 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

Hi Grant,

In Carrot2 (and Carrot Search's commercial products) we're not using
Lucene as an indexing/ search service directly, but we are re-using a
lot of internal infrastructure (like analyzers, ported snowball
stemmers and other segmentation stuff). We also plan on using the new
language identifiers, automata, tests framework...

I guess this shows that Lucene is a lot _more_ than just a document
retrieval library. There are nuggets in the codebase that one can
utilize on their own, without the rest of Lucene.

If you need details, let me know on prv, I'll scan the sources and
provide concrete examples.

Dawid

On Sun, Oct 23, 2011 at 2:33 AM, Shashi Kant <skant [at] sloan> wrote:
> Using Lucene as a recommendation engine.
>
> On Sat, Oct 22, 2011 at 6:33 PM, Grant Ingersoll <gsingers [at] apache> wrote:
>>
>> On Oct 22, 2011, at 6:03 PM, Sujit Pal wrote:
>>
>>> Hi Grant,
>>>
>>> Not sure if this qualifies as a "bet you didn't know", but one could use
>>> Lucene term vectors to construct document vectors for similarity,
>>> clustering and classification tasks. I found this out recently (although
>>> I am probably not the first one), and I think this could be quite
>>> useful.
>>
>> Yep, had these on my list!
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


markharw00d at yahoo

Oct 25, 2011, 8:26 AM

Post #8 of 18 (1710 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

>>using Lucene that don't fit under the core premise of full text search

†I've had several use cases over the years that use features peculiar to Lucene but here's a very simple one I came across today that illustrates its raw index lookup capability:

I needed a fast, scalable and persistent "Set" implementation to maintain a large cold-list (millions of string-based keys).
I benchmarked various implementations using a set of ~6 million keys with 10,000 random key lookups.
When it comes to RAM use, retrieval times and start-up costs Lucene stands up very well against equivalent embedded databases for this task:

* Benchmarks for times to initially open the set when stored on disk: †http://goo.gl/dJL3g
* Benchmarks for Avg key lookup time once opened:†http://goo.gl/SG79N
* Stats for RAM use after 10,000 lookups:†http://goo.gl/MyJDn

I don't doubt all of these implementations could be tweaked (e.g. optimizing the Lucene index, various DB-specific settings) but I tried to use sensible defaults to make the tests fair e.g. use of prepared statements, indexes, minimal data retrieved.
Speeds varied with each run of the random lookup test due to OS-level caching effects so the best times were recorded in each case.
The HashSet tests are loaded entirely from file (hence the long start-up time) and are not a scalable solution because of RAM costs.
MySQL requires an inter-process call as it was not †embedded but even using a remoted Lucene call I get significantly better performance (avg 0.5ms lookup vs MySQL 10ms)


Cheers
Mark



----- Original Message -----
From: Grant Ingersoll <gsingers [at] apache>
To: java-user [at] lucene
Cc:
Sent: Saturday, 22 October 2011, 10:11
Subject: Bet you didn't know Lucene can...

Hi All,

I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396).† It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search.† I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have.† I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.

Thanks in advance,
Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


erik.hatcher at gmail

Oct 25, 2011, 8:50 AM

Post #9 of 18 (1713 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

At the group where I worked at UVa once upon a time, a coworker built Juxta, this way cool tool to diff multiple versions of a document visually with heat maps and "difference"-o-meters, and it leverages Lucene analyzers to extract words and positions and such.

You can find it here: http://www.juxtasoftware.org/

Erik



On Oct 22, 2011, at 05:11 , Grant Ingersoll wrote:

> Hi All,
>
> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396). It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search. I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have. I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.
>
> Thanks in advance,
> Grant
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


gsingers at apache

Oct 25, 2011, 12:57 PM

Post #10 of 18 (1704 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

On Oct 25, 2011, at 11:26 AM, mark harwood wrote:

>>> using Lucene that don't fit under the core premise of full text search
>
> I've had several use cases over the years that use features peculiar to Lucene but here's a very simple one I came across today that illustrates its raw index lookup capability:
>
> I needed a fast, scalable and persistent "Set" implementation to maintain a large cold-list (millions of string-based keys).
> I benchmarked various implementations using a set of ~6 million keys with 10,000 random key lookups.
> When it comes to RAM use, retrieval times and start-up costs Lucene stands up very well against equivalent embedded databases for this task:
>
> * Benchmarks for times to initially open the set when stored on disk: http://goo.gl/dJL3g
> * Benchmarks for Avg key lookup time once opened: http://goo.gl/SG79N
> * Stats for RAM use after 10,000 lookups: http://goo.gl/MyJDn

Those charts are beautiful. I have Lucene/Solr down as an excellent key-value store (I've seen this done many times) and these charts further cement it.

>
> I don't doubt all of these implementations could be tweaked (e.g. optimizing the Lucene index, various DB-specific settings) but I tried to use sensible defaults to make the tests fair e.g. use of prepared statements, indexes, minimal data retrieved.
> Speeds varied with each run of the random lookup test due to OS-level caching effects so the best times were recorded in each case.
> The HashSet tests are loaded entirely from file (hence the long start-up time) and are not a scalable solution because of RAM costs.
> MySQL requires an inter-process call as it was not embedded but even using a remoted Lucene call I get significantly better performance (avg 0.5ms lookup vs MySQL 10ms)
>
>
> Cheers
> Mark
>
>
>
> ----- Original Message -----
> From: Grant Ingersoll <gsingers [at] apache>
> To: java-user [at] lucene
> Cc:
> Sent: Saturday, 22 October 2011, 10:11
> Subject: Bet you didn't know Lucene can...
>
> Hi All,
>
> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396). It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search. I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have. I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.
>
> Thanks in advance,
> Grant
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com


dawid.weiss at gmail

Oct 25, 2011, 2:47 PM

Post #11 of 18 (1706 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

Avg lookup time slightly less than a HashSet? Interesting. Is the code
to these benchmarks available somewhere?

Dawid

On Tue, Oct 25, 2011 at 9:57 PM, Grant Ingersoll <gsingers [at] apache> wrote:
>
> On Oct 25, 2011, at 11:26 AM, mark harwood wrote:
>
>>>> using Lucene that don't fit under the core premise of full text search
>>
>>  I've had several use cases over the years that use features peculiar to Lucene but here's a very simple one I came across today that illustrates its raw index lookup capability:
>>
>> I needed a fast, scalable and persistent "Set" implementation to maintain a large cold-list (millions of string-based keys).
>> I benchmarked various implementations using a set of ~6 million keys with 10,000 random key lookups.
>> When it comes to RAM use, retrieval times and start-up costs Lucene stands up very well against equivalent embedded databases for this task:
>>
>> * Benchmarks for times to initially open the set when stored on disk:  http://goo.gl/dJL3g
>> * Benchmarks for Avg key lookup time once opened: http://goo.gl/SG79N
>> * Stats for RAM use after 10,000 lookups: http://goo.gl/MyJDn
>
> Those charts are beautiful.  I have Lucene/Solr down as an excellent key-value store (I've seen this done many times) and these charts further cement it.
>
>>
>> I don't doubt all of these implementations could be tweaked (e.g. optimizing the Lucene index, various DB-specific settings) but I tried to use sensible defaults to make the tests fair e.g. use of prepared statements, indexes, minimal data retrieved.
>> Speeds varied with each run of the random lookup test due to OS-level caching effects so the best times were recorded in each case.
>> The HashSet tests are loaded entirely from file (hence the long start-up time) and are not a scalable solution because of RAM costs.
>> MySQL requires an inter-process call as it was not  embedded but even using a remoted Lucene call I get significantly better performance (avg 0.5ms lookup vs MySQL 10ms)
>>
>>
>> Cheers
>> Mark
>>
>>
>>
>> ----- Original Message -----
>> From: Grant Ingersoll <gsingers [at] apache>
>> To: java-user [at] lucene
>> Cc:
>> Sent: Saturday, 22 October 2011, 10:11
>> Subject: Bet you didn't know Lucene can...
>>
>> Hi All,
>>
>> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396).  It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search.  I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have.  I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.
>>
>> Thanks in advance,
>> Grant
>>
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


markharw00d at yahoo

Oct 25, 2011, 3:08 PM

Post #12 of 18 (1704 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

> Avg lookup time slightly less than a HashSet? Interesting.

Yep, HashSet comparison was a surprise to me too. I threw it in as a datapoint for what I thought would be the fastest option on the example dataset but clearly not a long-term answer to my problem as it costs so much in RAM.
Lucene started out at an avg 3ms but subsequent runs took it down dramatically due to OS file caching. The all-in-memory hashset implementation clearly did not demonstrate the same speed ups between runs.

> Is the code
> to these benchmarks available somewhere?


I can make the code available but the data wouldn't be possible.
The English Wikipedia page titles are probably an equivalent size and shape so I could try and package something up around that as a benchmarking tool for others to play with.

Cheers
Mark

On 25 Oct 2011, at 22:47, Dawid Weiss wrote:

> Avg lookup time slightly less than a HashSet? Interesting. Is the code
> to these benchmarks available somewhere?
>
> Dawid
>
> On Tue, Oct 25, 2011 at 9:57 PM, Grant Ingersoll <gsingers [at] apache> wrote:
>>
>> On Oct 25, 2011, at 11:26 AM, mark harwood wrote:
>>
>>>>> using Lucene that don't fit under the core premise of full text search
>>>
>>> I've had several use cases over the years that use features peculiar to Lucene but here's a very simple one I came across today that illustrates its raw index lookup capability:
>>>
>>> I needed a fast, scalable and persistent "Set" implementation to maintain a large cold-list (millions of string-based keys).
>>> I benchmarked various implementations using a set of ~6 million keys with 10,000 random key lookups.
>>> When it comes to RAM use, retrieval times and start-up costs Lucene stands up very well against equivalent embedded databases for this task:
>>>
>>> * Benchmarks for times to initially open the set when stored on disk: http://goo.gl/dJL3g
>>> * Benchmarks for Avg key lookup time once opened: http://goo.gl/SG79N
>>> * Stats for RAM use after 10,000 lookups: http://goo.gl/MyJDn
>>
>> Those charts are beautiful. I have Lucene/Solr down as an excellent key-value store (I've seen this done many times) and these charts further cement it.
>>
>>>
>>> I don't doubt all of these implementations could be tweaked (e.g. optimizing the Lucene index, various DB-specific settings) but I tried to use sensible defaults to make the tests fair e.g. use of prepared statements, indexes, minimal data retrieved.
>>> Speeds varied with each run of the random lookup test due to OS-level caching effects so the best times were recorded in each case.
>>> The HashSet tests are loaded entirely from file (hence the long start-up time) and are not a scalable solution because of RAM costs.
>>> MySQL requires an inter-process call as it was not embedded but even using a remoted Lucene call I get significantly better performance (avg 0.5ms lookup vs MySQL 10ms)
>>>
>>>
>>> Cheers
>>> Mark
>>>
>>>
>>>
>>> ----- Original Message -----
>>> From: Grant Ingersoll <gsingers [at] apache>
>>> To: java-user [at] lucene
>>> Cc:
>>> Sent: Saturday, 22 October 2011, 10:11
>>> Subject: Bet you didn't know Lucene can...
>>>
>>> Hi All,
>>>
>>> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396). It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search. I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have. I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.
>>>
>>> Thanks in advance,
>>> Grant
>>>
>>> --------------------------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


dawid.weiss at gmail

Oct 25, 2011, 3:17 PM

Post #13 of 18 (1704 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

> Lucene started out at an avg 3ms but subsequent runs took it down dramatically due to OS file caching. The all-in-memory hashset implementation clearly did not demonstrate the same speed ups between runs.

I don't say the benchmark was wrong or anything, but this is
surprising. I mean, the default HashSet impl. is a bucketed
linked-list implementation. It made me wonder how the data was
distributed. Even with OS file caching the in-memory data structure
shouldn't fall short, at least intuitively.

> I can make the code available but the data wouldn't be possible.
> The English Wikipedia page titles are probably an equivalent size and shape so I could try and package something up around that as a benchmarking tool for others to play with.

If you find a spare cycle, it'd be great, thanks!

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


markharw00d at yahoo

Oct 26, 2011, 10:02 AM

Post #14 of 18 (1684 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

>>††> Avg lookup time slightly less than a HashSet? Interesting.

Scratch that. A new dataset and revised code shows HashSets out in front†(but still not a realistic option for very large sets)†:†http://goo.gl/Lb4J1

In this benchmark I removed the code common to all previous tests which was first retrieving a random key from a test query Lucene index to then look up in the target Set ( a choice of database, hashset or a different Lucene index).†

I assumed that being common code to all tests, this initial Lucene-based fetch would not bias results but it was. Now the tests first load a random sample of 100k keys from a flat file *then* start the timer on the look-ups.
I'm also using public domain Wikipedia data so can release the code and data somewhere if that's of interest.

Cheers
Mark



----- Original Message -----
From: Dawid Weiss <dawid.weiss [at] gmail>
To: java-user [at] lucene
Cc:
Sent: Tuesday, 25 October 2011, 23:17
Subject: Re: Bet you didn't know Lucene can...

> Lucene started out at an avg 3ms but subsequent runs took it down dramatically due to OS file caching. The all-in-memory hashset implementation clearly did not demonstrate the same speed ups between runs.

I don't say the benchmark was wrong or anything, but this is
surprising. I mean, the default HashSet impl. is a bucketed
linked-list implementation. It made me wonder how the data was
distributed. Even with OS file caching the in-memory data structure
shouldn't fall short, at least intuitively.

> I can make the code available but the data wouldn't be possible.
> The English Wikipedia page titles are probably an equivalent size and shape so I could try and package something up around that as a benchmarking tool for others to play with.

If you find a spare cycle, it'd be great, thanks!

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


dawid.weiss at gmail

Oct 26, 2011, 10:33 AM

Post #15 of 18 (1685 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

Yes, sure it is interesting -- github would be probably a good spot?

Dawid

On Wed, Oct 26, 2011 at 7:02 PM, mark harwood <markharw00d [at] yahoo> wrote:
>>>  > Avg lookup time slightly less than a HashSet? Interesting.
>
> Scratch that. A new dataset and revised code shows HashSets out in front (but still not a realistic option for very large sets) : http://goo.gl/Lb4J1
>
> In this benchmark I removed the code common to all previous tests which was first retrieving a random key from a test query Lucene index to then look up in the target Set ( a choice of database, hashset or a different Lucene index).
>
> I assumed that being common code to all tests, this initial Lucene-based fetch would not bias results but it was. Now the tests first load a random sample of 100k keys from a flat file *then* start the timer on the look-ups.
> I'm also using public domain Wikipedia data so can release the code and data somewhere if that's of interest.
>
> Cheers
> Mark
>
>
>
> ----- Original Message -----
> From: Dawid Weiss <dawid.weiss [at] gmail>
> To: java-user [at] lucene
> Cc:
> Sent: Tuesday, 25 October 2011, 23:17
> Subject: Re: Bet you didn't know Lucene can...
>
>> Lucene started out at an avg 3ms but subsequent runs took it down dramatically due to OS file caching. The all-in-memory hashset implementation clearly did not demonstrate the same speed ups between runs.
>
> I don't say the benchmark was wrong or anything, but this is
> surprising. I mean, the default HashSet impl. is a bucketed
> linked-list implementation. It made me wonder how the data was
> distributed. Even with OS file caching the in-memory data structure
> shouldn't fall short, at least intuitively.
>
>> I can make the code available but the data wouldn't be possible.
>> The English Wikipedia page titles are probably an equivalent size and shape so I could try and package something up around that as a benchmarking tool for others to play with.
>
> If you find a spare cycle, it'd be great, thanks!
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ab at getopt

Oct 31, 2011, 1:32 PM

Post #16 of 18 (1661 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

On 22/10/2011 11:11, Grant Ingersoll wrote:
> Hi All,
>
> I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396). It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search. I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have. I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem.

Better late than never ... :) I briefly mentioned this use case to you
at Eurocon, but here it is for the record.

I used Lucene in a duplicate-detection scenario where instead of
documents individual sentences would be indexed (with a fuzz). A
similarity-preserving hash function was calculated on each sentence, and
the hash was added as a field. The property of the hash was that similar
documents (sentences) would produce a similar hash, with only some
bit-level perturbation. The challenge was to find a ranked list of
possible duplicates with similar (not exact same) hashes, which in this
case meant to find a ranked list of documents that have the smallest
bit-level distance in their hashes from the query hash.

The solution is described in SOLR-1918 - Bit-wise scoring field type.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


petite_abeille at me

Oct 31, 2011, 1:42 PM

Post #17 of 18 (1672 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote:

> similarity-preserving hash function was calculated on each sentence, and the hash was added as a field. The property of the hash was that similar documents (sentences) would produce a similar hash, with only some bit-level perturbation. The challenge was to find a ranked list of possible duplicates with similar (not exact same) hashes, which in this case meant to find a ranked list of documents that have the smallest bit-level distance in their hashes from the query hash.
>
> The solution is described in SOLR-1918 - Bit-wise scoring field type.

In other words, a simhash, no?

Similarity Estimation Techniques from Rounding Algorithms
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

http://www.matpalm.com/resemblance/simhash/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ab at getopt

Oct 31, 2011, 5:32 PM

Post #18 of 18 (1654 views)
Permalink
Re: Bet you didn't know Lucene can... [In reply to]

On 31/10/2011 21:42, Petite Abeille wrote:
>
> On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote:
>
>> similarity-preserving hash function was calculated on each sentence, and the hash was added as a field. The property of the hash was that similar documents (sentences) would produce a similar hash, with only some bit-level perturbation. The challenge was to find a ranked list of possible duplicates with similar (not exact same) hashes, which in this case meant to find a ranked list of documents that have the smallest bit-level distance in their hashes from the query hash.
>>
>> The solution is described in SOLR-1918 - Bit-wise scoring field type.
>
> In other words, a simhash, no?
>
> Similarity Estimation Techniques from Rounding Algorithms
> http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
>
> http://www.matpalm.com/resemblance/simhash/

Yes, you could use this. In that project we used a different
application-specific hash.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.