Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: General

RFO- Indexing 'meaningfull' xml

 

 

Lucene general RSS feed   Index | Next | Previous | View Threaded


wgedeon at gmail

Aug 2, 2008, 4:37 AM

Post #1 of 3 (393 views)
Permalink
RFO- Indexing 'meaningfull' xml

Hello!

This is a Request for Opinion targeted for the Lucene experts out there :-)

I'm trying to get to know Lucene a bit better: After having played with the
'getting started', I moved onto trying indexing of xml files.

The simple (?) project would be to index chat sessions, each session stored
in a file and containing many entries of the form:

<message type="incoming_privateMessage" timestamp="200808021312"
to="someone%40domain1%2Ecom"
from="someoneelse%40domain2%2Ecom"><body>Hello</body></message>

(it's jabber-client protocol with timestamp)

In addition to the full text search, I'd like to be able to perform searches
such as:
- list sessions from:xxx timestamp:200808*
- list sessions (from:xxx OR from:yyy)
- etc

Would it be better to store each message as a separate document with its
fields, adding the 'filename' (session identifier) as an extra field? or
maybe is there a better way of doing it making the session file a document?

All comments appreciated, thanks! :-)

PS: Of course, the actual goal isn't to index chat history (there are many
chat searches available) but use this to learn the API ;-)


hossman_lucene at fucit

Aug 6, 2008, 3:59 PM

Post #2 of 3 (339 views)
Permalink
Re: RFO- Indexing 'meaningfull' xml [In reply to]

: In addition to the full text search, I'd like to be able to perform searches
: such as:
: - list sessions from:xxx timestamp:200808*
: - list sessions (from:xxx OR from:yyy)
: - etc
:
: Would it be better to store each message as a separate document with its
: fields, adding the 'filename' (session identifier) as an extra field? or
: maybe is there a better way of doing it making the session file a document?

As a general rule of thumb, you make 1 document for each result you want
to get back when you execute a search ... if you want to be able to search
for "foo" and get back a list of all sessions where the word "foo" was
used, then each session should be a document. If you also want to be able
to search for "foo" and get back a list of each message thta contained the
word "foo", then each message can also be a document -- either in another
index, or even in the same index (here's no rule that says all documents
must have the same fields)

BTW: If you are planning on experimenting with the Java API, i would
suggest sending any specific followup questions to the java-user[at]lucene
list. But you may also want to consider checking out Solr, and the
solr-user list. Depends on what level of abstraction you want to deal
with (Solr provides a config based web service type front end for dealing
with Lucene indexes, but also has a Java API both for indexing and for
hoooking in custom functionality when executing searches)


-Hoss


wgedeon at gmail

Aug 7, 2008, 6:39 AM

Post #3 of 3 (338 views)
Permalink
Re: RFO- Indexing 'meaningfull' xml [In reply to]

Hello Hoss,
Thanks for your reply :-)
I believe I'm in the first case: "to be able to search for 'foo' and get
back a list of all sessions where the word 'foo' was used". However, I want
to be able to separate free text search from field-based search.

I have put both the session and messages as documents, the session document
for free text search and the messages for field based search:
The algorithm that I've ended up using since I posted the initial message
is:
o execute the search on messages and documents, then on all hits
o construct a list of 'filename's that match and show the last 10 results
by newest.

This works, but I'm afraid is not going to be performant when I end up
indexing all sessions. There must be a way to get the right hit-set from a
search.

But in all cases, I'm looking at Solr for potential answers, thanks for
mentioning it :-)

Ta.
Jo

On Thu, Aug 7, 2008 at 12:59 AM, Chris Hostetter
<hossman_lucene[at]fucit.org>wrote:

>
> : In addition to the full text search, I'd like to be able to perform
> searches
> : such as:
> : - list sessions from:xxx timestamp:200808*
> : - list sessions (from:xxx OR from:yyy)
> : - etc
> :
> : Would it be better to store each message as a separate document with its
> : fields, adding the 'filename' (session identifier) as an extra field? or
> : maybe is there a better way of doing it making the session file a
> document?
>
> As a general rule of thumb, you make 1 document for each result you want
> to get back when you execute a search ... if you want to be able to search
> for "foo" and get back a list of all sessions where the word "foo" was
> used, then each session should be a document. If you also want to be able
> to search for "foo" and get back a list of each message thta contained the
> word "foo", then each message can also be a document -- either in another
> index, or even in the same index (here's no rule that says all documents
> must have the same fields)
>
> BTW: If you are planning on experimenting with the Java API, i would
> suggest sending any specific followup questions to the java-user[at]lucene
> list. But you may also want to consider checking out Solr, and the
> solr-user list. Depends on what level of abstraction you want to deal
> with (Solr provides a config based web service type front end for dealing
> with Lucene indexes, but also has a Java API both for indexing and for
> hoooking in custom functionality when executing searches)
>
>
> -Hoss
>
>

Lucene general RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.