Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: General

Subjects DB Matching

 

 

Lucene general RSS feed   Index | Next | Previous | View Threaded


mauro.fraboni at gmail

Sep 29, 2008, 6:12 AM

Post #1 of 3 (226 views)
Permalink
Subjects DB Matching

I am studying the possibility to use Lucene in order to build a
matching system for a database of subjects.
The subjects are stored in records of database with different fields
like name, surname, address and I would like to build a proximity
matcher that found an input subject in DB.
The idea is to map the concept of document with the record , fields of
record will be the fields of document.

The problem is that my matching system should be quite accurate and
should be able to return only one subject matched (the most near to
the input) and no subject matched in other cases. I am not able to
find a valid rule for the No-matching. Is it possible to find a rule
based on Score that tells that the subject in input is not near enough
to the subject in DB , so it should not be matched? Is it possible to
find a minimum score for this purpose?
Any suggestion will be appreciated.

ciao mauro


hossman_lucene at fucit

Oct 6, 2008, 4:37 PM

Post #2 of 3 (183 views)
Permalink
Re: Subjects DB Matching [In reply to]

mauro: I assume you are working with the "Lucene-Java" package to build
your software? (as opposed to one of the other subprojects like
Solr, Mahout, or Tika which are the other possibilities that aren't ruled
out by your problem description). If so you will probably get more
feedback using hte java-user[at]lucene mailing list in the future.

In general the problem you are describing isn't easily solvable. in order
to determine a good "minimum cut off" score you have to be able to
normalize your scores in a meaningful way -- to do that you have to be
able to define what the "best" (or baseline) possible score for any query
is. this isn't something lucene can tell you for any arbitrary query, but
it can be determined in special cases. (ie: for a simple TermQuery you can
figure it out based on the idf and the document with hte highest tf; for
"document similarity" type problems like MoreLIkeThis solves, you can get
a good baseline by finding the score for document used to generate the
MoreLikeThisQuery (but that requires that it be indexed)


-Hoss


gsingers at apache

Oct 7, 2008, 8:43 AM

Post #3 of 3 (176 views)
Permalink
Re: Subjects DB Matching [In reply to]

Hi Mauro,

I'd go to one of the Lucene mail archives, and search "record
linkage", there you will find various conversations on the topic [1].
Also, try googling for that. In particular, you might look for stuff
by W. Winkler at the census bureau, amongst others. There is also
the Second String package by William Cohen at CMU that may help, but I
don't know if it scales or how well supported it is.

Also see http://en.wikipedia.org/wiki/Jaro-Winkler as a starting
point. In short, I think Lucene could facilitate such a system, but
it probably isn't going to be the main piece.

-Grant

[1] http://lucene.markmail.org/message/nyz7hrmzgzkwporq?q=record+linkage

On Sep 29, 2008, at 9:12 AM, mauro fraboni wrote:

> I am studying the possibility to use Lucene in order to build a
> matching system for a database of subjects.
> The subjects are stored in records of database with different fields
> like name, surname, address and I would like to build a proximity
> matcher that found an input subject in DB.
> The idea is to map the concept of document with the record , fields of
> record will be the fields of document.
>
> The problem is that my matching system should be quite accurate and
> should be able to return only one subject matched (the most near to
> the input) and no subject matched in other cases. I am not able to
> find a valid rule for the No-matching. Is it possible to find a rule
> based on Score that tells that the subject in input is not near enough
> to the subject in DB , so it should not be matched? Is it possible to
> find a minimum score for this purpose?
> Any suggestion will be appreciated.
>
> ciao mauro

Lucene general RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.