Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: kinosearch: discuss

How do you index ms office (.doc, .xls, .ppt) files with kinosearch

 

 

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded


ben.aurel at gmail

Aug 25, 2008, 4:12 AM

Post #1 of 3 (4160 views)
Permalink
How do you index ms office (.doc, .xls, .ppt) files with kinosearch

hi
I've red through most of the documentation trying to understand what
filetypes KS supports. There is the interesting oscon presentation on
http://www.rectangular.com/downloads/KinoSearch_OSCON2006.pdf, where
you can find the statement on page 13:

What is KinoSearch not?
...
- Not a file parser
...

So if I get this right, kinosearch doesn't care about your .doc, .xls,
.ppt files. As much as I personally try to avoid this formats, I think
its realistic to assume that you have to index such files when
creating something like an intranet search.

My question is, what would you suggest for indexing office formats ?
How do you extract text without ole and and an office installation on
the server?

thanks in advance
ben

_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


henka at cityweb

Aug 25, 2008, 6:42 AM

Post #2 of 3 (3913 views)
Permalink
Re: How do you index ms office (.doc, .xls, .ppt) files with kinosearch [In reply to]

On Mon, August 25, 2008 1:12 pm, Ben Aurel wrote:
> My question is, what would you suggest for indexing office formats ?
> How do you extract text without ole and and an office installation on
> the server?

You use file conversion utilities such as pdftotext, xlhtml, wvHtml etc.
Most of these are far from perfect, sometimes crashing, etc.

Regards
Henry


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


peter at peknet

Aug 25, 2008, 7:00 AM

Post #3 of 3 (3915 views)
Permalink
Re: How do you index ms office (.doc, .xls, .ppt) files with kinosearch [In reply to]

On 08/25/2008 08:42 AM, Henry wrote:
> On Mon, August 25, 2008 1:12 pm, Ben Aurel wrote:
>> My question is, what would you suggest for indexing office formats ?
>> How do you extract text without ole and and an office installation on
>> the server?
>
> You use file conversion utilities such as pdftotext, xlhtml, wvHtml etc.
> Most of these are far from perfect, sometimes crashing, etc.
>

Also, check out SWISH::Filter on CPAN, which uses many of those tools underneath but which
provides a common interface for converting them to parse-able text.
--
Peter Karman . peter [at] peknet . http://peknet.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.