oro30 at eecs
Jul 20, 2011, 7:59 PM
Post #3 of 3
On 20/07/11 22:32, Simon Willnauer wrote:
> On Wed, Jul 20, 2011 at 3:17 PM, raphael812<oro30 [at] eecs> wrote:
>> Hello everyone,
>> I am quite new to lucene and i am using the book lucene in action to learn.
>> I need help in extracting the body content of a html page using tika. The
>> implementation from the book only extracts the html's metadata not the main
>> body content which i need. Is it possible to extract body content from htmls
>> and pdfs and how.
>> Thanks for you help.
> this seems to be a tika / extraction specific question. you should
> try to ask this question on the tika list, I bet you get a quick
> response there!
>> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-with-Lucene-tp3185409p3185409.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
i tried searching through an index i created but it gives me the
following error in Netbeans 6.9
Exception in thread "main"
org.apache.lucene.index.CorruptIndexException: Unknown format version: -11
The trouble is i am able to search that same index using the command
line. does anyone have an idea why this is so. it was working some weeks
ago on netbeans and now it throws this error.
thanks for the help.