Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Document Similarity

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


J.Su at cs

Dec 2, 2003, 10:12 AM

Post #1 of 7 (1327 views)
Permalink
Document Similarity

Hi,

I have read some posts in user/developer archives about Lucene-based
document similarity comparison. In summary there are two approaches are
mentioned:

1 - Construct document to a query;
2 - Calculate each document to be a vector, then rank accoring to their
distance (cosine).

Do they produce same ranking results? Is there any other way to do so?
I guess the 2nd approach will be more precise but slow.

Thanks.

Jing

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-user-help [at] jakarta


Karsten.Konrad at xtramind

Dec 3, 2003, 3:24 PM

Post #2 of 7 (1290 views)
Permalink
AW: Document Similarity [In reply to]

Hi,

>> Do they produce same ranking results?

No; Lucene's operations on query weight and length normalization is not
equivalent to a vanilla cosine in vector space.

>> I guess the 2nd approach will be more precise but slow.

Query similarity
will indeed be faster, but may actually not be worse. A straightforward
cosine without IDF weighting of terms (as Lucene does) will almost certainly
be less precise if you have documents of different length - word
occurence probabilities in texts of different lengths vary greatly,
and the cosine of independent longer texts will often be greater than
those that actually have the same topic, but are short, just because
of randomly found non-content words.

If, on the other hand, you choose the right TF/IDF weighting of
terms, the cosine in this warped vector space could be (a)
equivalent to the one Lucene does - requires some work to do so, or
(b) might even get better on average.

However, the last time I counted, there where about 250 different
TF/IDF formulas around in IR publications, machine learning,
computational linguistics and so on. Performance depends on domain
and language.

But if I was you, I just would start playing and have fun with
the stuff...

Karsten


-----Ursprüngliche Nachricht-----
Von: Jing Su [mailto:J.Su [at] cs]
Gesendet: Dienstag, 2. Dezember 2003 18:12
An: lucene-user [at] jakarta
Betreff: Document Similarity



Hi,

I have read some posts in user/developer archives about Lucene-based document similarity comparison. In summary there are two approaches are
mentioned:

1 - Construct document to a query;
2 - Calculate each document to be a vector, then rank accoring to their distance (cosine).

Do they produce same ranking results? Is there any other way to do so? I guess the 2nd approach will be more precise but slow.

Thanks.

Jing

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-user-help [at] jakarta


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-user-help [at] jakarta


sg at media-style

Dec 8, 2003, 9:43 AM

Post #3 of 7 (1279 views)
Permalink
Document Similarity [In reply to]

Hi Jing,

do you work on the task of document similarity?
I see nobody was answering your question.

To create a query out of an document would be very easy, but would it
provide well results?
Document term vectors would provide more possibilities to use different
data mining algorithms for clustering or classification.

Stefan


--
open technology: www.media-style.com
open source: www.weta-group.net
open discussion: www.text-mining.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe [at] jakarta
For additional commands, e-mail: lucene-user-help [at] jakarta


klaus at vommond

Jan 20, 2006, 7:19 AM

Post #4 of 7 (1270 views)
Permalink
AW: Document similarity [In reply to]

>In my case, i need to filter similar documents in search results and
>therefore determine document similarity during indexing process using
>term vectors. Obviously, i can't compare currently indexing document
>with all documents in my collection.

Yes you can. Right after indexing the new documents fetch the termvector for
this document from the index. Computer some kind of weight for each term,
und construct a Boolean query from all terms. You can use the termweights to
boost the termqueries. The hits will be scored, this score is a measure for
the similarity between the documents.

peace


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


yseeley at gmail

Jan 20, 2006, 7:50 AM

Post #5 of 7 (1267 views)
Permalink
Re: Document similarity [In reply to]

If you didn't want to store term vectors you could also run the
document fields through the analyzer yourself and collect the Tokens
(you should still have the fields you just indexed... no need to
retrieve it again).

-Yonik

On 1/20/06, Klaus <klaus [at] vommond> wrote:
>
> >In my case, i need to filter similar documents in search results and
> >therefore determine document similarity during indexing process using
> >term vectors. Obviously, i can't compare currently indexing document
> >with all documents in my collection.
>
> Yes you can. Right after indexing the new documents fetch the termvector for
> this document from the index. Computer some kind of weight for each term,
> und construct a Boolean query from all terms. You can use the termweights to
> boost the termqueries. The hits will be scored, this score is a measure for
> the similarity between the documents.
>
> peace

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


aserba at gmail

Jan 20, 2006, 10:31 AM

Post #6 of 7 (1269 views)
Permalink
Re: Document similarity [In reply to]

Yonik, Klaus, thanks for your quick response.

Let me rephrase, i can't compare currently processed document with all
documents in my collection using angle between documents in
terms-vector space because of performance issues. As far as i can see,
i can avoid unnecessary operations. At first, i can build query from
document terms, fetch top N results and compute angle only for them.
Is it ok?

The second question is
How to generate some information about documents similarity to store
in lucene index?
For example, hash with the same values for similar documents or
something like that.
Thus it would be easy to filter "supplemental" results.


On 1/20/06, Yonik Seeley <yseeley [at] gmail> wrote:
> If you didn't want to store term vectors you could also run the
> document fields through the analyzer yourself and collect the Tokens
> (you should still have the fields you just indexed... no need to
> retrieve it again).
>
> -Yonik
>
> On 1/20/06, Klaus <klaus [at] vommond> wrote:
> >
> > >In my case, i need to filter similar documents in search results and
> > >therefore determine document similarity during indexing process using
> > >term vectors. Obviously, i can't compare currently indexing document
> > >with all documents in my collection.
> >
> > Yes you can. Right after indexing the new documents fetch the termvector for
> > this document from the index. Computer some kind of weight for each term,
> > und construct a Boolean query from all terms. You can use the termweights to
> > boost the termqueries. The hits will be scored, this score is a measure for
> > the similarity between the documents.
> >
> > peace
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


elshaimaa.ali at hotmail

Jul 30, 2012, 10:26 AM

Post #7 of 7 (596 views)
Permalink
RE: Document Similarity [In reply to]

thank you so much for the prompt reply
I need to extract a document from the index that is similar to an Html document, and I need to use cosine similarity or latent semantic analysis which means that I need to generate term vector for the html document, the link you sent me doesn't contain any code
any help will be greatly apreciated
regardsshaimaa

> Date: Mon, 30 Jul 2012 07:32:49 -0700
> From: in.abdul [at] gmail
> To: java-user [at] lucene
> Subject: Re: Document Similarity
>
> Hi ELshaimaa,
> I couldnt able understood what is your need . Can you please explain your
> use case.
>
> If this is case "I need to use Lucene to find the most similar documents
> from the generated index"
> then go for morelikethis[1] components .
>
> Based on your use case people can suggest some good ways.
>
>
>
> [1] http://wiki.apache.org/solr/MoreLikeThis
>
>
>
>
> Thanks and Regards,
> S SYED ABDUL KATHER
>
>
>
> On Mon, Jul 30, 2012 at 7:30 PM, Elshaimaa Ali [via Lucene] <
> ml-node+s472066n3998082h68 [at] n3> wrote:
>
> >
> > Hi All
> > I created a Lucene index for over 3 million document, and I used term
> > vectors to create the index.now for an external document I need to use
> > Lucene to find the most similar documents from the generated index.how can
> > I process the document to generate a term vector to this document and what
> > search technique I can use to map the document to one of the documents in
> > the index
> > regardsshaimaa
> >
> > ------------------------------
> > If you reply to this email, your message will be added to the discussion
> > below:
> > http://lucene.472066.n3.nabble.com/Document-Similarity-tp3998082.html
> > To unsubscribe from Lucene, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw>
> > .
> > NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
> >
>
>
>
>
> -----
> THANKS AND REGARDS,
> SYED ABDUL KATHER
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Document-Similarity-tp3998082p3998095.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.