
peter at peknet
Aug 25, 2008, 7:00 AM
Post #3 of 3
(3915 views)
Permalink
|
|
Re: How do you index ms office (.doc, .xls, .ppt) files with kinosearch
[In reply to]
|
|
On 08/25/2008 08:42 AM, Henry wrote: > On Mon, August 25, 2008 1:12 pm, Ben Aurel wrote: >> My question is, what would you suggest for indexing office formats ? >> How do you extract text without ole and and an office installation on >> the server? > > You use file conversion utilities such as pdftotext, xlhtml, wvHtml etc. > Most of these are far from perfect, sometimes crashing, etc. > Also, check out SWISH::Filter on CPAN, which uses many of those tools underneath but which provides a common interface for converting them to parse-able text. -- Peter Karman . peter [at] peknet . http://peknet.com/ _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|