peter at peknet
Aug 25, 2008, 7:00 AM
Post #3 of 3
On 08/25/2008 08:42 AM, Henry wrote:
Re: How do you index ms office (.doc, .xls, .ppt) files with kinosearch
[In reply to]
> On Mon, August 25, 2008 1:12 pm, Ben Aurel wrote:
>> My question is, what would you suggest for indexing office formats ?
>> How do you extract text without ole and and an office installation on
>> the server?
> You use file conversion utilities such as pdftotext, xlhtml, wvHtml etc.
> Most of these are far from perfect, sometimes crashing, etc.
Also, check out SWISH::Filter on CPAN, which uses many of those tools underneath but which
provides a common interface for converting them to parse-able text.
Peter Karman . peter [at] peknet . http://peknet.com/
KinoSearch mailing list
KinoSearch [at] rectangular