Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: General

Indexing with Lucene

 

 

Lucene general RSS feed   Index | Next | Previous | View Threaded


oro30 at eecs

Jul 20, 2011, 6:17 AM

Post #1 of 3 (237 views)
Permalink
Indexing with Lucene

Hello everyone,

I am quite new to lucene and i am using the book lucene in action to learn.
I need help in extracting the body content of a html page using tika. The
implementation from the book only extracts the html's metadata not the main
body content which i need. Is it possible to extract body content from htmls
and pdfs and how.
Thanks for you help.

Raphael

--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-with-Lucene-tp3185409p3185409.html
Sent from the Lucene - General mailing list archive at Nabble.com.


simon.willnauer at googlemail

Jul 20, 2011, 2:32 PM

Post #2 of 3 (219 views)
Permalink
Re: Indexing with Lucene [In reply to]

On Wed, Jul 20, 2011 at 3:17 PM, raphael812 <oro30 [at] eecs> wrote:
> Hello everyone,
>
> I am quite new to lucene and i am using the book lucene in action to learn.
> I need help in extracting the body content of a html page using tika. The
> implementation from the book only extracts the html's metadata not the main
> body content which i need. Is it possible to extract body content from htmls
> and pdfs and how.
> Thanks for you help.

hey,
this seems to be a tika / extraction specific question. you should
try to ask this question on the tika list, I bet you get a quick
response there!

simon
>
> Raphael
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-with-Lucene-tp3185409p3185409.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>


oro30 at eecs

Jul 20, 2011, 7:59 PM

Post #3 of 3 (244 views)
Permalink
Re: Indexing with Lucene [In reply to]

On 20/07/11 22:32, Simon Willnauer wrote:
> On Wed, Jul 20, 2011 at 3:17 PM, raphael812<oro30 [at] eecs> wrote:
>> Hello everyone,
>>
>> I am quite new to lucene and i am using the book lucene in action to learn.
>> I need help in extracting the body content of a html page using tika. The
>> implementation from the book only extracts the html's metadata not the main
>> body content which i need. Is it possible to extract body content from htmls
>> and pdfs and how.
>> Thanks for you help.
> hey,
> this seems to be a tika / extraction specific question. you should
> try to ask this question on the tika list, I bet you get a quick
> response there!
>
> simon
>> Raphael
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-with-Lucene-tp3185409p3185409.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
Hello all,
i tried searching through an index i created but it gives me the
following error in Netbeans 6.9
Exception in thread "main"
org.apache.lucene.index.CorruptIndexException: Unknown format version: -11
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:249)
at
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:73)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:677)
at
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:202)
at
org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:63)
at Searcher.search(Searcher.java:66)
at Searcher.main(Searcher.java:59)

The trouble is i am able to search that same index using the command
line. does anyone have an idea why this is so. it was working some weeks
ago on netbeans and now it throws this error.
thanks for the help.

Lucene general RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.