Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

about PDF / HTML index

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


alvaro2882 at yahoo

Jul 15, 2003, 3:21 PM

Post #1 of 3 (1840 views)
Permalink
about PDF / HTML index

im using lucene with TXT and HTML files , its working.

the only problem with HTML files is that i have to index html files as txt first , before to index them as HTML.

do anyone have try to index pdf files ?

im trying the pdfbox , is there any samples for indexing pdf files ? (i dont find any samples to do that) with any of the parsers (pdfbox, jpedal ,etc).

thanks for helping,

Alvaro. from Lima - Peru


---------------------------------
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!


pbecker at dstc

Jul 15, 2003, 4:07 PM

Post #2 of 3 (1752 views)
Permalink
Re: about PDF / HTML index [In reply to]

Hi Alvaro,

there are some examples in our code here -- working with a slightly
similar interface to the Ant task in the Lucene contributions.


http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/indexer/documenthandler/

The actual step of turning it into a Lucene Document happens here:


http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/toscanaj/docco/source/org/tockit/docco/indexer/DocumentProcessingFactory.java?rev=1.30&content-type=text/vnd.viewcvs-markup

This code is still work in progress, but it does work -- we are running
it on a few ten thousand documents from time to time. Both PDFBox and
Multivalent fail to read some PDF documents in the collection, but so
does Acrobat Reader. We still have to do a more formal test to see which
one does a better job, at the moment we are still coding the core bits,
then we test properly.

HTH,
Peter



alvaro z wrote:

>im using lucene with TXT and HTML files , its working.
>
>the only problem with HTML files is that i have to index html files as txt first , before to index them as HTML.
>
>do anyone have try to index pdf files ?
>
>im trying the pdfbox , is there any samples for indexing pdf files ? (i dont find any samples to do that) with any of the parsers (pdfbox, jpedal ,etc).
>
>thanks for helping,
>
>Alvaro. from Lima - Peru
>
>
>---------------------------------
>Do you Yahoo!?
>SBC Yahoo! DSL - Now only $29.95 per month!
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-user-help [at] jakarta


ben at csh

Jul 16, 2003, 3:29 AM

Post #3 of 3 (1749 views)
Permalink
Re: about PDF / HTML index [In reply to]

PDFBox comes with the class
org.pdfbox.searchengine.lucene.LucenePDFDocument which shows how to
parse /index a pdf document.

Ben


On Tue, 15 Jul 2003, alvaro z wrote:

>
> im using lucene with TXT and HTML files , its working.
>
> the only problem with HTML files is that i have to index html files as txt first , before to index them as HTML.
>
> do anyone have try to index pdf files ?
>
> im trying the pdfbox , is there any samples for indexing pdf files ? (i dont find any samples to do that) with any of the parsers (pdfbox, jpedal ,etc).
>
> thanks for helping,
>
> Alvaro. from Lima - Peru
>
>
> ---------------------------------
> Do you Yahoo!?
> SBC Yahoo! DSL - Now only $29.95 per month!

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-user-help [at] jakarta

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.