Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: General

is solr the right choice for my pdf indexing purpose?

 

 

Lucene general RSS feed   Index | Next | Previous | View Threaded


pierluca.sangiorgi at gmail

Jun 11, 2012, 11:42 AM

Post #1 of 3 (634 views)
Permalink
is solr the right choice for my pdf indexing purpose?

Hi, I'm new in Solr usage and I want to know if it's the right choice
for my problem.
I need to index pdf documents stored in filesystem and make query over them.
So i used solr with solrj as extractingrequesthandler and all works,
but I'm not interested in index pdf metadata, while in the content
text of the document.
I saw that the content is indexed entirely in a single field
("attr_content" in my case), but what i want is to index fields that
are inside the field content.

As example: I've a pdf document that contain an invoice. I need to
extract and index informations relative to recipient, price, sold
items, items description, and so on.

Is Solr the right choice for this purpose or do i need to use other
framework in addiction before posting document to Solr?

thanks in advance


loompa at gmail

Jun 11, 2012, 2:33 PM

Post #2 of 3 (609 views)
Permalink
Re: is solr the right choice for my pdf indexing purpose? [In reply to]

Hi,

On 11 June 2012 19:42, Pierluca Sangiorgi <pierluca.sangiorgi [at] gmail>wrote:

> As example: I've a pdf document that contain an invoice. I need to
> extract and index informations relative to recipient, price, sold
> items, items description, and so on.
>
> Is Solr the right choice for this purpose or do i need to use other
> framework in addiction before posting document to Solr?
>

Solr is a good choice, especially if you want to start to leverage the
power of search, but you will need to do a bit of work before hand if you
want to split the information out to give you the power to make best use of
it later.

To achieve this you will first want to update the schema.xml [1] to model
your target fields - i.e. the ones you mention above. You will need to
parse the PDF documents using something like Apache PDFBox[2] - good for if
the documents are Acrobat Forms as you can get the form field contents - or
Apache Tika[3] - if you want it as a String - to get the contents. This
will allow you to extract the field values from content using pattern
matching. The fields can then be added to a document and posted to Solr
using Solrj.

Cheers,
Dave

[1] http://wiki.apache.org/solr/SchemaXml
[2] http://pdfbox.apache.org/
[3] http://tika.apache.org


pierluca.sangiorgi at gmail

Jun 12, 2012, 1:38 AM

Post #3 of 3 (609 views)
Permalink
Re: is solr the right choice for my pdf indexing purpose? [In reply to]

> To achieve this you will first want to update the schema.xml [1] to model
> your target fields - i.e. the ones you mention above. You will need to
> parse the PDF documents using something like Apache PDFBox[2] - good for if
> the documents are Acrobat Forms as you can get the form field contents - or
> Apache Tika[3] - if you want it as a String -  to get the contents. This
> will allow you to extract the field values from content using pattern
> matching. The fields can then be added to a document and posted to Solr
> using Solrj.

Thanks for the answer.
I'm currently using the Solr Cell Update Request Handler as
ContentStreamUpdateRequest in the Solrj, so Tika is used
"automatically" but it extracts the content directly into field.
Do i must use Tika in "standalone way" to capture content as a string,
built my custom document (xml o json) and then use the correspondig
Update Request Handler, right?
Any suggestions on pattern matching / information retrieval /
information extraction module to create my custom document from string
extracted by Tika?

thanks
Luca

Lucene general RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.