Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Feb 9, 2010, 1:17 PM

Post #1 of 5 (552 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829163#action_12829163 ]

Fuad Efendi edited comment on LUCENE-2230 at 2/9/10 9:17 PM:
-------------------------------------------------------------

After long-run load-stress tests...

I used 2 boxes, one with SOLR, another one with simple multithreaded stress simulator (with randomply generated fuzzy query samples); each box is 2x AMD Opteron 2350 (8 core per box); 64-bit.

I disabled all SOLR caches except Document Cache (I want isolated tests; I want to ignore time taken by disk I/O to load document).

Performance boosted accordingly to number of load-stress threads (on "client" computer), then dropped:

9 Threads:
==========
TPS: 200 - 210
Response: 45 - 50 (ms)

10 Threads:
===========
TPS: 200 - 215
Response: 45 - 55 (ms)

12 Threads:
===========
TPS: 180 - 220
Response: 50 - 90 (ms)

16 Threads:
===========
TPS: 60 - 65
Response: 230 - 260 (ms)


It can be explained by CPU-bound processing and 8 cores available; "top" command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step (12 stressing threads), and 200% on 4th step (16 stressing threads) - due probably to Network I/O, Tomcat internals, etc.

It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp (persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads according to use case.

BTW, my best counters for default SOLR/Lucene were:
TPS: 12
Response: 750ms

"Fuzzy" queries were tuned such a way that distance threshold was less than or equal two. I used "StrikeAMatch" distance...

Thanks,
http://www.tokenizer.ca
+1 416-993-2060(cell)

P.S.
Before performing load-stress tests, I established the baseline in my environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ (static JSP).
And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default Lucene/SOLR)...



was (Author: funtick):
After long-run load-stress tests...

I used 2 boxes, one with SOLR, another one with simple multithreaded stress simulator (with randomply generated fuzzy query samples); each box is 2x AMD Opteron 2350 (8 core per box); 64-bit.

I disabled all SOLR caches except Document Cache (I want isolated tests; I want to ignore time taken by disk I/O to load document).

Performance boosted accordingly to number of load-stress threads (on "client" computer), then dropped:

9 Threads:
==========
TPS: 200 - 210
Response: 45 - 50 (ms)

10 Threads:
===========
TPS: 200 - 215
Response: 45 - 55 (ms)

12 Threads:
===========
TPS: 180 - 220
Response: 50 - 90 (ms)

16 Threads:
===========
TPS: 60 - 65
Response: 230 - 260 (ms)


It can be explained by CPU-bound processing and 8 cores available; "top" command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step (12 stressing threads), and 200% on 4th step (16 stressing threads) - due probably to Network I/O, Tomcat internals, etc.

It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp (persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads according to use case.

BTW, my best counters for default SOLR/Lucene were:
TPS: 12
Response: 750ms

"Fuzzy" queries were tuned such a way that distance threshold was less than or equal two. I used "StrikeAMatch" distance...

Thanks,
http://www.tokenizer.ca
+1 416-993-2060(cell)

> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> ----------------------------------------------------------------
>
> Key: LUCENE-2230
> URL: https://issues.apache.org/jira/browse/LUCENE-2230
> Project: Lucene - Java
> Issue Type: Improvement
> Affects Versions: 3.0
> Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms.
> New algo uses integer distances between objects.
> Reporter: Fuad Efendi
> Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java
>
> Original Estimate: 0.02h
> Remaining Estimate: 0.02h
>
> W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973
> http://portal.acm.org/citation.cfm?doid=362003.362025
> I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google).
> Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests).
> Big list od distance implementations:
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Feb 9, 2010, 1:35 PM

Post #2 of 5 (528 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829163#action_12829163 ]

Fuad Efendi edited comment on LUCENE-2230 at 2/9/10 9:35 PM:
-------------------------------------------------------------

After long-run load-stress tests...

I used 2 boxes, one with SOLR, another one with simple multithreaded stress simulator (with randomply generated fuzzy query samples); each box is 2x AMD Opteron 2350 (8 core per box); 64-bit.

I disabled all SOLR caches except Document Cache (I want isolated tests; I want to ignore time taken by disk I/O to load document).

Performance boosted accordingly to number of load-stress threads (on "client" computer), then dropped:

9 Threads:
==========
TPS: 200 - 210
Response: 45 - 50 (ms)

10 Threads:
===========
TPS: 200 - 215
Response: 45 - 55 (ms)

12 Threads:
===========
TPS: 180 - 220
Response: 50 - 90 (ms)

16 Threads:
===========
TPS: 60 - 65
Response: 230 - 260 (ms)


It can be explained by CPU-bound processing and 8 cores available; "top" command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step (12 stressing threads), and 200% on 4th step (16 stressing threads) - due probably to Network I/O, Tomcat internals, etc.

It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp (persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads according to use case.

BTW, my best counters for default SOLR/Lucene were:
TPS: 12
Response: 750ms

"Fuzzy" queries were tuned such a way that distance threshold was less than or equal two. I used "StrikeAMatch" distance...

Thanks,
http://www.tokenizer.ca
+1 416-993-2060(cell)

P.S.
Before performing load-stress tests, I established the baseline in my environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ (static JSP).
And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default Lucene/SOLR)...

P.P.S.
Distance function must follow 3 'axioms':
{code}
D(a,a) = 0
D(a,b) = D(b,a)
D(a,b) + D(b,c) >= D(a,c)
{code}

And, function must return Integer value.

Otherwise, BKTree will produce wrong results.


Also, it's mentioned somewhere in Levenstein Algo Java Docs (in contrib folder I believe) that instance method runs faster than static method; need to test with Java 6... most probably 'yes', depends on JVM implementation; I can guess only that CPU-internals are better optimized for instance method...

was (Author: funtick):
After long-run load-stress tests...

I used 2 boxes, one with SOLR, another one with simple multithreaded stress simulator (with randomply generated fuzzy query samples); each box is 2x AMD Opteron 2350 (8 core per box); 64-bit.

I disabled all SOLR caches except Document Cache (I want isolated tests; I want to ignore time taken by disk I/O to load document).

Performance boosted accordingly to number of load-stress threads (on "client" computer), then dropped:

9 Threads:
==========
TPS: 200 - 210
Response: 45 - 50 (ms)

10 Threads:
===========
TPS: 200 - 215
Response: 45 - 55 (ms)

12 Threads:
===========
TPS: 180 - 220
Response: 50 - 90 (ms)

16 Threads:
===========
TPS: 60 - 65
Response: 230 - 260 (ms)


It can be explained by CPU-bound processing and 8 cores available; "top" command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step (12 stressing threads), and 200% on 4th step (16 stressing threads) - due probably to Network I/O, Tomcat internals, etc.

It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp (persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads according to use case.

BTW, my best counters for default SOLR/Lucene were:
TPS: 12
Response: 750ms

"Fuzzy" queries were tuned such a way that distance threshold was less than or equal two. I used "StrikeAMatch" distance...

Thanks,
http://www.tokenizer.ca
+1 416-993-2060(cell)

P.S.
Before performing load-stress tests, I established the baseline in my environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ (static JSP).
And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default Lucene/SOLR)...



> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> ----------------------------------------------------------------
>
> Key: LUCENE-2230
> URL: https://issues.apache.org/jira/browse/LUCENE-2230
> Project: Lucene - Java
> Issue Type: Improvement
> Affects Versions: 3.0
> Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms.
> New algo uses integer distances between objects.
> Reporter: Fuad Efendi
> Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java
>
> Original Estimate: 0.02h
> Remaining Estimate: 0.02h
>
> W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973
> http://portal.acm.org/citation.cfm?doid=362003.362025
> I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google).
> Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests).
> Big list od distance implementations:
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Feb 10, 2010, 6:24 AM

Post #3 of 5 (515 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832001#action_12832001 ]

Uwe Schindler edited comment on LUCENE-2230 at 2/10/10 2:22 PM:
----------------------------------------------------------------

Hi Fuad,

thanks for submitting your changed FuzzyQuery. After quickly looking through the classes I found the following problems:

- The cache is incorrectly synchronized: The cache is static but access is synchronized against the instance.
- The cache is not limited, maybe it should be a WeakHashMap. It can easily overflow the memory (as it consumes lots of memory).
- When you build the tree, you use a class from spellchecker: org.apache.lucene.search.spell.LuceneDictionary. This adds an additional memory consumption, esp. if the index has a large term dict. Why not simply iterate over the IndexReaders's TermEnum?
- The cache cannot work correctly with per segment search (since 2.9) or reopened IndexReaders, because it is only bound to the field name but not an index reader. To have a correct cache, do it like FieldCache and use a combined key from field name and IndexReader.getFieldCacheKey().

Else it looks like a good approach, but the memory consumption is immense for large term dicts. We currently develop a DFA-based FuzzyQuery, which will be provided, when the new flex branch gets out: LUCENE-2089

If you fix the problems in your classes, we can add this patch to contrib.

was (Author: thetaphi):
Hi Fuad,

thanks for submitting your changed FuzzyQuery. After quickly looking through the classes I found the following problems:

- The cache is incorrectly synchronized: The cache is static but access is synchronized against the instance.
- The cache is not limited, maybe it should be a WeakHashMap. It can easily overflow the memory (as it consumes lots of memory).
- When you build the tree, you use a class from spellchecker: org.apache.lucene.search.spell.LuceneDictionary. This adds an additional memory consumption, esp. if the index has a large term dict. Why not simply iterate over the IndexReaders's TermEnum?
- The cache cannot work correctly with per segment search (since 2.9) or reopened IndexReaders, because it is only bound to the field name but not an index reader. To have a correct cache, do it like FieldCache and use a combined key from field name and IndexReader.getFieldCacheKey().

Else it looks like a good approach, but the memory consumption is immense for large term dicts. We currently develop a DFA-based FuzzyQuery, which will be provided, when the nex flex branch gets out: LUCENE-2089

If you fix the problems in your classes, we can add this patch to contrib.

> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> ----------------------------------------------------------------
>
> Key: LUCENE-2230
> URL: https://issues.apache.org/jira/browse/LUCENE-2230
> Project: Lucene - Java
> Issue Type: Improvement
> Affects Versions: 3.0
> Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms.
> New algo uses integer distances between objects.
> Reporter: Fuad Efendi
> Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java
>
> Original Estimate: 0.02h
> Remaining Estimate: 0.02h
>
> W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973
> http://portal.acm.org/citation.cfm?doid=362003.362025
> I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google).
> Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests).
> Big list od distance implementations:
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Feb 10, 2010, 9:56 AM

Post #4 of 5 (513 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832096#action_12832096 ]

Fuad Efendi edited comment on LUCENE-2230 at 2/10/10 5:56 PM:
--------------------------------------------------------------

Hi Uwe,


I am trying to study LUCENE-2258 right now...

bq. BKTree contains terms no longer available

BKTree contains objects, not terms; in my sample it contains Strings, new BKTree<String>(new Distance()). It is a structure for fast lookup of close objects from a set of objects, with predefined distance algorithm.

It won't hurt if String appears in BKTree structure, and corresponding Term disappeared from Index; search results will be the same. Simply, search for <DisappearedTerm> OR <AnotherTerm> is the same as search for <AnotherTerm>.
At least, we can run background thread which will create new BKTree instance, without hurting end users.

Yes, Term<->String is another thing to do... I recreate fake terms in TermEnum...



BKTree allows to iterate about 5-10% of whole structure in order to find closest matches only if distance threshold is small, 2. If it is 4, almost no any improvement. And, classic Levenshtein distance is slow...

was (Author: funtick):
Hi Uwe,


I am trying to study Lucene-2258 right now...

bq. BKTree contains terms no longer available

BKTree contains objects, not terms; in my sample it contains Strings, new BKTree<String>(new Distance()). It is a structure for fast lookup of close objects from a set of objects, with predefined distance algorithm.

It won't hurt if String appears in BKTree structure, and corresponding Term disappeared from Index; search results will be the same. Simply, search for <DisappearedTerm> OR <AnotherTerm> is the same as search for <AnotherTerm>.
At least, we can run background thread which will create new BKTree instance, without hurting end users.

Yes, Term<->String is another thing to do... I recreate fake terms in TermEnum...



BKTree allows to iterate about 5-10% of whole structure in order to find closest matches only if distance threshold is small, 2. If it is 4, almost no any improvement. And, classic Levenshtein distance is slow...

> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> ----------------------------------------------------------------
>
> Key: LUCENE-2230
> URL: https://issues.apache.org/jira/browse/LUCENE-2230
> Project: Lucene - Java
> Issue Type: Improvement
> Affects Versions: 3.0
> Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms.
> New algo uses integer distances between objects.
> Reporter: Fuad Efendi
> Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java
>
> Original Estimate: 0.02h
> Remaining Estimate: 0.02h
>
> W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973
> http://portal.acm.org/citation.cfm?doid=362003.362025
> I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google).
> Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests).
> Big list od distance implementations:
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


jira at apache

Feb 10, 2010, 10:22 AM

Post #5 of 5 (512 views)
Permalink
[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times. [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832096#action_12832096 ]

Fuad Efendi edited comment on LUCENE-2230 at 2/10/10 6:22 PM:
--------------------------------------------------------------

Hi Uwe,


I am trying to study LUCENE-2258 right now...

bq. BKTree contains terms no longer available

BKTree contains objects, not terms; in my sample it contains Strings, new BKTree<String>(new Distance()). It is a structure for fast lookup of close objects from a set of objects, with predefined distance algorithm.

It won't hurt if String appears in BKTree structure, and corresponding Term disappeared from Index; search results will be the same. Simply, search for <DisappearedTerm> OR <AnotherTerm> is the same as search for <AnotherTerm>.
At least, we can run background thread which will create new BKTree instance, without hurting end users.

Yes, Term<->String is another thing to do... I recreate fake terms in TermEnum...



BKTree allows to iterate about 5-10% of whole structure in order to find closest matches only if distance threshold is small, 2. If it is 4, almost no any improvement. And, classic Levenshtein distance is slow...

Edited: trying to study LUCENE-2089...

was (Author: funtick):
Hi Uwe,


I am trying to study LUCENE-2258 right now...

bq. BKTree contains terms no longer available

BKTree contains objects, not terms; in my sample it contains Strings, new BKTree<String>(new Distance()). It is a structure for fast lookup of close objects from a set of objects, with predefined distance algorithm.

It won't hurt if String appears in BKTree structure, and corresponding Term disappeared from Index; search results will be the same. Simply, search for <DisappearedTerm> OR <AnotherTerm> is the same as search for <AnotherTerm>.
At least, we can run background thread which will create new BKTree instance, without hurting end users.

Yes, Term<->String is another thing to do... I recreate fake terms in TermEnum...



BKTree allows to iterate about 5-10% of whole structure in order to find closest matches only if distance threshold is small, 2. If it is 4, almost no any improvement. And, classic Levenshtein distance is slow...

> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> ----------------------------------------------------------------
>
> Key: LUCENE-2230
> URL: https://issues.apache.org/jira/browse/LUCENE-2230
> Project: Lucene - Java
> Issue Type: Improvement
> Affects Versions: 3.0
> Environment: Lucene currently uses brute force full-terms scanner and calculates distance for each term. New BKTree structure improves performance in average 20 times when distance is 1, and 3 times when distance is 3. I tested with index size several millions docs, and 250,000 terms.
> New algo uses integer distances between objects.
> Reporter: Fuad Efendi
> Attachments: BKTree.java, Distance.java, DistanceImpl.java, FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java
>
> Original Estimate: 0.02h
> Remaining Estimate: 0.02h
>
> W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM, 1973
> http://portal.acm.org/citation.cfm?doid=362003.362025
> I was inspired by http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick Johnson, Google).
> Additionally, simplified algorythm at http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more logically correct than Levenstein distance, and it is 3-5 times faster (isolated tests).
> Big list od distance implementations:
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.