Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

Search agents

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


kalle at snigel

Jan 4, 2006, 6:03 AM

Post #1 of 4 (915 views)
Permalink
Search agents

Hello list,

I wrote a search agent thingy for Lucene. It was built to handle huge
amounts of agents.

Rather than one query per agent to find out if the new document is
interesting or not, agent trigger queries are stored in an index that
is queried with the tokens of a new document.

Since it uses the index a bit backwards the agent trigger queries
are somewhat limited:

At least one token in a OR or FUZZY OR per agent field must match the
new document.
Any NOT token in agent must not match the new document.

It is fairly easy to add more query types, but is limited to single
token and non-wildcard types since the query if created from the new
document tokens.

Agents are clustered by required fields by agent, and each cluster is
stored in an own index. When a new document is sent to the
AgentManager it creates one query per possible cluster. I'm not sure
this actually speeds things up, just a gut feeling.

Example agents in psuedo trigger query language:

Possible agent:

AND (OR ("category","media"))
AND (OR ("name", "hotel") OR ("name","rowanda"))
AND (NOT("name", "paradise"))

Impossible agent:

AND (OR ("category","media"))
AND (("name", "hotel") AND ("name","rowanda"))
AND (NOT("name", "paradise"))

In effect the agents can't trigger on AND queries of the same field.

One could of couse place a more complex query on the new document as
the agent triggers, use some classifier or whatever if speed is not a
big deal. The agent triggers could then be built from the original
query. I probably won't implement such a thing my self.

Should I post the code to the sandbox when I've tested it? Are there
any restrictions to the code if I do that?

--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


erik at ehatchersolutions

Jan 4, 2006, 6:41 AM

Post #2 of 4 (843 views)
Permalink
Re: Search agents [In reply to]

Karl,

Have you considered the MemoryIndex for this sort of thing? I've
thought that it would make for an elegant way to handle this sort of
"agent" or notification service such that new documents get indexed
normally, but also a single document goes into a MemoryIndex and is
matched against many queries.

It would be great to see the code you've developed. You are free to
contribute it to the Lucene contrib codebase. If it is a substantial
contribution it needs further discussion and perhaps incubation for
it to be accepted. There are no restrictions other than the Apache
Software License on the code in the contrib area. The "sandbox" is a
deprecated term for what we now call "contrib".

Erik


On Jan 4, 2006, at 9:03 AM, karl wettin wrote:

> Hello list,
>
> I wrote a search agent thingy for Lucene. It was built to handle
> huge amounts of agents.
>
> Rather than one query per agent to find out if the new document is
> interesting or not, agent trigger queries are stored in an index
> that is queried with the tokens of a new document.
>
> Since it uses the index a bit backwards the agent trigger queries
> are somewhat limited:
>
> At least one token in a OR or FUZZY OR per agent field must match
> the new document.
> Any NOT token in agent must not match the new document.
>
> It is fairly easy to add more query types, but is limited to single
> token and non-wildcard types since the query if created from the
> new document tokens.
>
> Agents are clustered by required fields by agent, and each cluster
> is stored in an own index. When a new document is sent to the
> AgentManager it creates one query per possible cluster. I'm not
> sure this actually speeds things up, just a gut feeling.
>
> Example agents in psuedo trigger query language:
>
> Possible agent:
>
> AND (OR ("category","media"))
> AND (OR ("name", "hotel") OR ("name","rowanda"))
> AND (NOT("name", "paradise"))
>
> Impossible agent:
>
> AND (OR ("category","media"))
> AND (("name", "hotel") AND ("name","rowanda"))
> AND (NOT("name", "paradise"))
>
> In effect the agents can't trigger on AND queries of the same field.
>
> One could of couse place a more complex query on the new document
> as the agent triggers, use some classifier or whatever if speed is
> not a big deal. The agent triggers could then be built from the
> original query. I probably won't implement such a thing my self.
>
> Should I post the code to the sandbox when I've tested it? Are
> there any restrictions to the code if I do that?
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


markharw00d at yahoo

Jan 4, 2006, 6:50 AM

Post #3 of 4 (857 views)
Permalink
Re: Search agents [In reply to]

Yes, I've found MemoryIndex to be very fast for this
kind of thing. This contribution can be used to
further optimize and shortlist the queries to be run
against the new document sat in MemoryIndex.




___________________________________________________________
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


wolfgang.hoschek at mac

Jan 4, 2006, 12:53 PM

Post #4 of 4 (855 views)
Permalink
Re: Search agents [In reply to]

If you'd consider using a MemoryIndex for this, I'd recommend also
having a look at nux.xom.pool.FullTextUtil and
nux.xom.pool.FullTextPool, adding smart caching for indexes, queries
and results on top of a MemoryIndex. With some luck this (or some
variant of it) could help speed up your use cases, at least as far as
I gather.

[It's part of the Nux download]

Wolfgang.

Snippet from the javadoc:

/**
* Thread-safe XQuery/XPath fulltext search utilities; implemented
with the
* Lucene engine and a custom high-performance adapter for
* on-the-fly main memory indexing with smart caching for indexes,
queries and results.
* <p>
* Complementing the standard XPath string and regular
* expression matching functionality, Lucene has a powerful query
syntax with support
* for word stemming, fuzzy searches, similarity searches,
approximate searches,
* boolean operators, wildcards, grouping, range searches, term
boosting, etc.
* For details see the <a target="_blank"
* href="http://lucene.apache.org/java/docs/
queryparsersyntax.html">Lucene Query
* Syntax and Examples</a>.
* Also see {@link org.apache.lucene.index.memory.MemoryIndex}
* and {@link PatternAnalyzer} for detailed documentation.
* <p>
* Example Java usage:
* <pre>
* Analyzer analyzer = PatternAnalyzer.DEFAULT_ANALYZER;
* float score = FullTextUtil.match(
* "Readings about Salmons and other select Alaska fishing Manuals",
* "+salmon~ +fish* manual~",
* analyzer, analyzer);
* if (score &gt; 0.0f) {
* // query matches text
* } else {
* // query does not match text
* }
* </pre>


On Jan 4, 2006, at 6:03 AM, karl wettin wrote:

> Hello list,
>
> I wrote a search agent thingy for Lucene. It was built to handle
> huge amounts of agents.
>
> Rather than one query per agent to find out if the new document is
> interesting or not, agent trigger queries are stored in an index
> that is queried with the tokens of a new document.
>
> Since it uses the index a bit backwards the agent trigger queries
> are somewhat limited:
>
> At least one token in a OR or FUZZY OR per agent field must match
> the new document.
> Any NOT token in agent must not match the new document.
>
> It is fairly easy to add more query types, but is limited to single
> token and non-wildcard types since the query if created from the
> new document tokens.
>
> Agents are clustered by required fields by agent, and each cluster
> is stored in an own index. When a new document is sent to the
> AgentManager it creates one query per possible cluster. I'm not
> sure this actually speeds things up, just a gut feeling.
>
> Example agents in psuedo trigger query language:
>
> Possible agent:
>
> AND (OR ("category","media"))
> AND (OR ("name", "hotel") OR ("name","rowanda"))
> AND (NOT("name", "paradise"))
>
> Impossible agent:
>
> AND (OR ("category","media"))
> AND (("name", "hotel") AND ("name","rowanda"))
> AND (NOT("name", "paradise"))
>
> In effect the agents can't trigger on AND queries of the same field.
>
> One could of couse place a more complex query on the new document
> as the agent triggers, use some classifier or whatever if speed is
> not a big deal. The agent triggers could then be built from the
> original query. I probably won't implement such a thing my self.
>
> Should I post the code to the sandbox when I've tested it? Are
> there any restrictions to the code if I do that?
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.