Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

 

 

First page Previous page 1 2 3 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Nov 16, 2009, 3:27 PM

Post #1 of 72 (1451 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778645#action_12778645 ]

Jason Rutherglen commented on LUCENE-2075:
------------------------------------------

Solr used CHM as an LRU, however it turned out to be somewhat
less than truly LRU? I'd expect Google Collections to offer a
concurrent linked hash map however no dice?
http://code.google.com/p/google-collections/

Maybe there's a way to build a concurrent LRU using their CHM?

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 16, 2009, 5:17 PM

Post #2 of 72 (1406 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778675#action_12778675 ]

Earwin Burrfoot commented on LUCENE-2075:
-----------------------------------------

There's no such thing in Google Collections. However, look at this - http://code.google.com/p/concurrentlinkedhashmap/

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 17, 2009, 11:14 AM

Post #3 of 72 (1393 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779055#action_12779055 ]

Michael McCandless commented on LUCENE-2075:
--------------------------------------------

Since Solr already has already created a concurrent LRU, I think we simply reuse that? Is there any reason not to?

I don't think we need absolutely truly LRU for the terminfo cache.

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 17, 2009, 11:28 AM

Post #4 of 72 (1392 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779071#action_12779071 ]

Mark Miller commented on LUCENE-2075:
-------------------------------------

We should prob compare with google's (its apache 2 licensed, so why not)

Solr has two synchronized lru caches - LRUCache, which is basically just a synchronized LinkedHashMap, and FastLRUCache which I believe tries to minimize the cost of gets - however, unless you have a high hit ratio, it was tested as slower than LRUCache.

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 17, 2009, 5:36 PM

Post #5 of 72 (1389 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779247#action_12779247 ]

Michael McCandless commented on LUCENE-2075:
--------------------------------------------

bq. We should prob compare with google's (its apache 2 licensed, so why not)

Well, that's just hosted on code.google.com (ie it's not "Google's"), and reading its description it sounds sort of experimental (though they do state that they created a "Production Version"). It made me a bit nervous... however, it does sound people use it in "production".

I think FastLRUCache is probably best for Lucene, because it scales up well w/ high number of threads? My guess is it's slower cost for low hit rates is negligible to Lucene, but I'll run some perf tests.

It looks like ConcurrentLRUCache (used by FastLRUCache, but the latter does other solr-specific things) is the right low-level one to use for Lucene?

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 17, 2009, 5:52 PM

Post #6 of 72 (1389 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779253#action_12779253 ]

Mark Miller commented on LUCENE-2075:
-------------------------------------

bq. Well, that's just hosted on code.google.com (ie it's not "Google's"),

Ah - got that vibe, but it didn't really hit me.

bq. though they do state that they created a "Production Version"

Right - thats what I was thinking we might try. Though the whole, trying this from scratch to learn is a bit scary too ;) But hey, I'm not recommending, just perhaps trying it.

bq. I think FastLRUCache is probably best for Lucene, because it scales up well w/ high number of threads?

Indeed - though if we expect a low hit ratio, we might still compare it with regular old synchronized LinkedHashMap to be sure. In certain cases, puts become quite expensive I think.

bq. It looks like ConcurrentLRUCache (used by FastLRUCache, but the latter does other solr-specific things) is the right low-level one to use for Lucene?

Right.

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 17, 2009, 5:56 PM

Post #7 of 72 (1388 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779255#action_12779255 ]

Mark Miller commented on LUCENE-2075:
-------------------------------------

When/If yonik finally pops up here, he will have some good info to add I think.

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 17, 2009, 11:40 PM

Post #8 of 72 (1387 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779354#action_12779354 ]

Uwe Schindler commented on LUCENE-2075:
---------------------------------------

Should this ConcurrentLRUCache not better be fitted into the o.a.l.util.cache package?

About the Solr implementation: The generification has a "small" problem: get(), contains(), remove() and other by-key-querying methods should use Object as type for the key, not the generic V, because it is not bad to test with contains any java type (it would just return false). The sun generic howto explains that. Very funny video about that: [http://www.youtube.com/watch?v=wDN_EYUvUq0] (explaination starts at 4:35)

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 18, 2009, 5:16 AM

Post #9 of 72 (1385 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779437#action_12779437 ]

Michael McCandless commented on LUCENE-2075:
--------------------------------------------

I'll work out a simple perf test to compare the options...

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 18, 2009, 8:18 AM

Post #10 of 72 (1372 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779512#action_12779512 ]

Earwin Burrfoot commented on LUCENE-2075:
-----------------------------------------

> Well, that's just hosted on code.google.com (ie it's not "Google's"), and reading its description it sounds sort of experimental (though they do state that they created a "Production Version"). It made me a bit nervous... however, it does sound people use it in "production".

I run it in production for several months (starting from 'experimental' version) as a cache for Filters. No visible problems.

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 18, 2009, 8:30 AM

Post #11 of 72 (1368 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779514#action_12779514 ]

Yonik Seeley commented on LUCENE-2075:
--------------------------------------

The Solr one could be simplified a lot for Lucene... no need to keep some of the statistics and things like "isLive".

Testing via something like the double barrel approach will be tricky. The behavior of ConcurrentLRUCache (i.e. the cost of puts) depends on the access pattern - in the best cases, a single linear scan would be all that's needed. In the worst case, a subset of the map needs to go into a priority queue. It's all in markAndSweep... that's my monster - let me know if the comments don't make sense.

How many entries must be removed to be considered a success also obviously affects whether a single linear scan is enough. If that's often the case, some other optimizations can be done such as not collecting the entries for further passes:
{code}
// This entry *could* be in the bottom group.
// Collect these entries to avoid another full pass... this is wasted
// effort if enough entries are normally removed in this first pass.
// An alternate impl could make a full second pass.
{code}

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 18, 2009, 10:04 AM

Post #12 of 72 (1369 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779560#action_12779560 ]

Uwe Schindler commented on LUCENE-2075:
---------------------------------------

Looks good! Can this cache subclass the abstract (Map)Cache; it is in the correct package but does not subclass Cache?

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 18, 2009, 11:40 AM

Post #13 of 72 (1368 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779631#action_12779631 ]

Uwe Schindler commented on LUCENE-2075:
---------------------------------------

Sorry a small problem with cast. Will upload new patch, soon.

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 18, 2009, 12:06 PM

Post #14 of 72 (1369 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779638#action_12779638 ]

Michael Busch commented on LUCENE-2075:
---------------------------------------

{quote}
Things won't work well after 2B accesses since Integer.MAX_VALUE is used
{quote}

From ReentrantLock javadocs:
"This lock supports a maximum of 2147483648 recursive locks by the same thread."

I think you only use the lock for markAndSweep and everything else uses atomics, but ConcurrentHashMap uses ReentrantLocks internally for each segment. So overall, things wil probably run longer than 2B ops, but not sure how long.

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 18, 2009, 12:06 PM

Post #15 of 72 (1369 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779639#action_12779639 ]

Uwe Schindler commented on LUCENE-2075:
---------------------------------------

Hi Yonik, thaks, that you used my class, but I found one type erasure problem in the PQueue, because thee Heap is erasured to Object[] by javac. The getValues() tries to cast this array -> ClassCastException. This is described here: http://safalra.com/programming/java/wrong-type-erasure/
The same happens in myInsertWithOverflow().

Will fix.

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 18, 2009, 12:12 PM

Post #16 of 72 (1368 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779640#action_12779640 ]

Yonik Seeley commented on LUCENE-2075:
--------------------------------------

bq. "This lock supports a maximum of 2147483648 recursive locks by the same thread."

I read this as a maximum of recursive locks (which this class won't do at all)... not the total number of times one can successfully lock/unlock the lock.

This cache impl should be able to support 1B operations per second for almost 300 years (i.e. the time it would take to overflow a long).

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 18, 2009, 12:18 PM

Post #17 of 72 (1367 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779641#action_12779641 ]

Paul Smith commented on LUCENE-2075:
------------------------------------

bq. This cache impl should be able to support 1B operations per second for almost 300 years (i.e. the time it would take to overflow a long).

Hopefully Sun has released Java 7 by then. :)

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 2:31 PM

Post #18 of 72 (1291 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780795#action_12780795 ]

Yonik Seeley commented on LUCENE-2075:
--------------------------------------

bq. Also, the results for ConcurrentLRUCache are invalid (its hit rate is
way too high) - I think this is because its eviction process can take
a longish amount of time, which temporarily allows the map to hold way
too many entries, and means it's using up alot more transient RAM than
it should.

Yep - there's no hard limit. It's not an issue in practice in Solr since doing the work to generate a new entry to put in the cache is much more expensive than cache cleaning (i.e. generation will never swamp cleaning). Seems like a realistic benchmark would do some amount of work on a cache miss? Or perhaps putting it in lucene and doing real benchmarks?

bq. Another idea: I wonder whether a simple cache-line like cache would be sufficient. Ie, we hash to a fixed slot and we evict whatever is
there.

We need to balance the overhead of the cache with the hit ratio and the cost of a miss. for the String intern cache, the cost of a miss is very low, hence lowering overhead but giving up hit ratio is the right trade-off. For this term cache, the cost of a miss seems relatively high, and warrants increasing overhead to increase the hit ratio.


> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 3:05 PM

Post #19 of 72 (1295 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780819#action_12780819 ]

Yonik Seeley commented on LUCENE-2075:
--------------------------------------

Aside: a singe numeric range query will be doing many term seeks (one at the start of each enumeration). It doesn't look like these will currently utilize the cache - can someone refresh my memory on why this is? We should keep the logic that prevents the cache while iterating over terms with a term enumerator, but it seems like using the cache for the initial seek would be nice.

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 20, 2009, 3:11 PM

Post #20 of 72 (1288 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780824#action_12780824 ]

Uwe Schindler commented on LUCENE-2075:
---------------------------------------

The initial seek should really be optimized, this also affects the new AutomatonTermEnum for the future of RegEx queries, WildCardQueries and maybe FuzzyQueries with DFAs. With the automaton enum, depending of the DFA, there can be lot's of seeks (LUCENE-1606).

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 21, 2009, 2:48 AM

Post #21 of 72 (1273 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780944#action_12780944 ]

Michael McCandless commented on LUCENE-2075:
--------------------------------------------


bq. a singe numeric range query will be doing many term seeks (one at the start of each enumeration). It doesn't look like these will currently utilize the cache - can someone refresh my memory on why this is?

You're right -- here's the code/comment:

{code}
/** Returns an enumeration of terms starting at or after the named term. */
public SegmentTermEnum terms(Term term) throws IOException {
// don't use the cache in this call because we want to reposition the
// enumeration
get(term, false);
return (SegmentTermEnum)getThreadResources().termEnum.clone();
}
{code}

I think this is because "useCache" (the 2nd arg to get) is overloaded
-- if you look at get(), if useCache is true and you have a cache hit,
it doesn't do it's "normal" side-effect of repositioning the
thread-private TermEnum. So you'd get incorrect results.

If get had a 2nd arg "repositionTermEnum", to decouple caching from
repositioning, then we could make use of the cache for NRQ (& soon
AutomatonTermEnum as well), though, this isn't so simple because the
cache entry (just a TermInfo) doesn't store the term's ord. And we
don't want to add ord to TermInfo since, eg, this sucks up alot of
extra RAM storing the terms index. Probably we should make a new
class that's used for caching, and not reuse TermInfo.

This was also done before NumericRangeQuery, ie, all MTQs before NRQ
did a single seek.

BTW the flex branch fixes this -- TermsEnum.seek always checks the
cache.


> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 21, 2009, 3:04 AM

Post #22 of 72 (1270 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780946#action_12780946 ]

Michael McCandless commented on LUCENE-2075:
--------------------------------------------

{quote}
bq. Also, the results for ConcurrentLRUCache are invalid (its hit rate is way too high) - I think this is because its eviction process can take a longish amount of time, which temporarily allows the map to hold way too many entries, and means it's using up alot more transient RAM than it should.

Yep - there's no hard limit. It's not an issue in practice in Solr since doing the work to generate a new entry to put in the cache is much more expensive than cache cleaning (i.e. generation will never swamp cleaning). Seems like a realistic benchmark would do some amount of work on a cache miss? Or perhaps putting it in lucene and doing real benchmarks?
{quote}

I agree the test is synthetic, so the blowup we're seeing is a worse
case sitatuion, but are you really sure this can never be hit in
practice?

EG as CPUs gain more and more cores... it becomes more and more
possible with time that the 1 thread that's trying to do the cleaning
will be swamped by the great many threads generating. Then if the CPU
is over-saturated (too many threads running), that 1 thread doing the
cleaning only gets slices of CPU time vs all the other threads that
may be generating...

It makes me nervous using a collection that, in the "perfect storm",
suddenly consumes way too much RAM. It's a leaky abstraction.

That said, I agree the test is obviously very synthetic. It's not
like a real Lucene installation will be pushing 2M QPS through Lucene
any time soon...

But still I'm more comfortable w/ the simplicity of the double-barrel
approach. In my tests its performance is in the same ballpark as
ConcurrentLRUCache; it's much simpler; and the .clear() calls appear
in practice to very quickly free up the entries.

{quote}
bq. Another idea: I wonder whether a simple cache-line like cache would be sufficient. Ie, we hash to a fixed slot and we evict whatever is there.

We need to balance the overhead of the cache with the hit ratio and the cost of a miss. for the String intern cache, the cost of a miss is very low, hence lowering overhead but giving up hit ratio is the right trade-off. For this term cache, the cost of a miss seems relatively high, and warrants increasing overhead to increase the hit ratio.
{quote}

OK I agree.

Yet another option... would be to create some sort of "thread-private
Query scope", ie, a store that's created & cleared per-Query where
Lucene can store things. When a Term's info is retrieved, it'd be
stored here, and then that "query-private" cache is consulted whenever
that Term is looked up again within that query. This would be the
"perfect cache" in that a single query would never see its terms
evicted due to other queries burning through the cache...

Though, net/net I suspect the overhead of creating/pulling from this
new cache would just be an overall search slowdown in practice.


> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 21, 2009, 6:10 AM

Post #23 of 72 (1262 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780977#action_12780977 ]

Yonik Seeley commented on LUCENE-2075:
--------------------------------------

bq. I agree the test is synthetic, so the blowup we're seeing is a worse case sitatuion, but are you really sure this can never be hit in practice?

I'm personally comfortable that Solr isn't going to hit this for it's uses of the cache... it's simply the relative cost of generating a cache entry vs doing some cleaning.

bq. But still I'm more comfortable w/ the simplicity of the double-barrel approach. In my tests its performance is in the same ballpark as ConcurrentLRUCache;

But it wouldn't be the same performance in Lucene - a cache like LinkedHashMap would achieve a higher hit rate in real world scenarios.


> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 21, 2009, 12:08 PM

Post #24 of 72 (1255 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781040#action_12781040 ]

Uwe Schindler commented on LUCENE-2075:
---------------------------------------

bq. BTW the flex branch fixes this - TermsEnum.seek always checks the cache.

Can we fix this for trunk, too? But I think *this* issue talks about trunk.

> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Nov 21, 2009, 12:28 PM

Post #25 of 72 (1251 views)
Permalink
[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781045#action_12781045 ]

Robert Muir commented on LUCENE-2075:
-------------------------------------

Hi, I applied automaton patch and its benchmark (LUCENE-1606) against the flex branch, and kept with the old TermEnum api.

I tested two scenarios, an old index created with 3.0 (trunk) and a new index created with flex branch.
in both cases, its slower than trunk, but I assume this is due to flex branch not being optimized yet?... (last i saw it used new String() placeholder for utf conversion)

but i think it is fair to compare the flex branch with itself, with old idx versus new idx. I can only assume with a new idx it is using the caching.
these numbers are stable on HEAD and do not deviate much.
feel free to look at the benchmark code over there and suggest improvements if you think there is an issue with it.

||Pattern||Iter||AvgHits||AvgMS (old idx)||AvgMS (new idx)||
|N?N?N?N|10|1000.0|86.6|70.2|
|?NNNNNN|10|10.0|3.0|2.0|
|??NNNNN|10|100.0|12.5|7.2|
|???NNNN|10|1000.0|86.9|34.8|
|????NNN|10|10000.0|721.2|530.5|
|NN??NNN|10|100.0|8.3|4.0|
|NN?N*|10|10000.0|149.1|143.2|
|?NN*|10|100000.0|1061.4|836.7|
|*N|10|1000000.0|16329.7|11480.0|
|NNNNN??|10|100.0|2.7|2.2|



> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
> Key: LUCENE-2075
> URL: https://issues.apache.org/jira/browse/LUCENE-2075
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 3.1
>
> Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage. You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap. One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary). You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary. Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

First page Previous page 1 2 3 Next page Last page  View All Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.