
jira at apache
Jun 30, 2008, 5:48 PM
Post #2 of 3
(179 views)
Permalink
|
|
[jira] Updated: (LUCENE-1320) ShingleMatrixFilter, a three dimensional permutating shingle filter
[In reply to]
|
|
[ https://issues.apache.org/jira/browse/LUCENE-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1320: -------------------------------- Attachment: LUCENE-1320.txt This works pretty well, I'll commit it soon. * javadocs * improved default shingle token weights (still not that great) Also optimized and refactored some that resulted in nicer looking code in the tests and: * PrefixAwareTokenFilter * PrefixAndSuffixAwareTokenFilter * SingleTokenTokenStream {code:java} /** * Joins two token streams and leaves the last token of the prefix stream available * to be used when updating the token values in the second stream based on that token. */ public class PrefixAwareTokenFilter extends TokenStream { /** The default implementation adds last prefix token end offset to the suffix token start and end offsets. */ public Token updateSuffixToken(Token suffixToken, Token lastPrefixToken) { {code} {code:java} /** Links two PrefixAndSuffixAwareTokenFilter */ public class PrefixAndSuffixAwareTokenFilter extends TokenStream { public Token updateInputToken(Token inputToken, Token lastPrefixToken) { public Token updateSuffixToken(Token suffixToken, Token lastInputToken) { {code} > ShingleMatrixFilter, a three dimensional permutating shingle filter > ------------------------------------------------------------------- > > Key: LUCENE-1320 > URL: https://issues.apache.org/jira/browse/LUCENE-1320 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers > Affects Versions: 2.3.2 > Reporter: Karl Wettin > Assignee: Karl Wettin > Attachments: LUCENE-1320.txt, LUCENE-1320.txt > > > Backed by a column focused matrix that creates all permutations of shingle tokens in three dimensions. I.e. it handles multi token synonyms. > Could for instance in some cases be used to replaces 0-slop phrase queries with something speedier. > {code:java} > Token[][][]{ > {{hello}, {greetings, and, salutations}}, > {{world}, {earth}, {tellus}} > } > {code} > passes the following test with 2-3 grams: > {code:java} > assertNext(ts, "hello_world"); > assertNext(ts, "greetings_and"); > assertNext(ts, "greetings_and_salutations"); > assertNext(ts, "and_salutations"); > assertNext(ts, "and_salutations_world"); > assertNext(ts, "salutations_world"); > assertNext(ts, "hello_earth"); > assertNext(ts, "and_salutations_earth"); > assertNext(ts, "salutations_earth"); > assertNext(ts, "hello_tellus"); > assertNext(ts, "and_salutations_tellus"); > assertNext(ts, "salutations_tellus"); > {code} > Contains more and less complex tests that demonstrate offsets, posincr, payload boosts calculation and construction of a matrix from a token stream. > The matrix attempts to hog as little memory as possible by seeking no more than maximumShingleSize columns forward in the stream and clearing up unused resources (columns and unique token sets). Can still be optimized quite a bit though. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org For additional commands, e-mail: java-dev-help[at]lucene.apache.org
|