
marvin at rectangular
Feb 25, 2008, 6:49 AM
Post #2 of 4
(1124 views)
Permalink
|
|
Re: KinoSearch & Similar/Duplicate Documents
[In reply to]
|
|
On Feb 25, 2008, at 3:00 AM, Vladimir Vlach wrote: > I wonder if you could suggest me a way how to retrieve > Similar documents and Duplicates. We index few web-sites and sometimes > the documents are posted with different URLs. How to solve this? Off the top of my head, I don't know of an easy or reliable approach. I'm sure that there is academic research out there on the subject. Brainstorming... This is a two-stage problem. The hard part is identifying candidates which may be similar to each other. After you have candidates, then you can roll through the seemingly matching docs and see what kind of matching content is really there. Is it boilerplate template code (e.g. nav menus) that ought to be discarded? Or is this truly meaningful content which has been duplicated in multiple locations? Say you were to build a pure vector space search engine, as described at <http://www.perl.com/pub/a/2003/02/19/engine.html>. Then you perform a search using the entire contents of one document as a query. Documents with duplicate content will appear nearly on top of each other in vector space. An uncompressed vector space search engine is not feasible for large document collections; however, I suspect that a decomposed vector engine a la LSA (latent semantic analysis) would do a good job at picking candidates. An excellent introduction to LSA is available at <http://www.knowledgesearch.org/lsi/cover_page.htm >. (I've started collecting these links on a wiki page at <http://www.rectangular.com/kinosearch/wiki/VectorSpaceModel >.) The patent on Latent Semantic Analysis expires this year. It ought to be possible to extend KinoSearch with a KSx::LSA distro, which would include KSx::LSA::LSAWriter, KSx::LSA::LSAQuery and so on. > One of the issues we also have is not related to KinoSearch. We would > like to remove some parts of the page which are similar (let's say we > want to remove navigation menu shared on all pages). Remove the > content is quite easy, but how would you detect what parts are > repeated across pages? Diff algorithm? What kind of approach would you > suggest? I haven't studied this one in depth; from what I understand it's quite a difficult problem. (I vaguely recall a discussion in some Lucene forum where Andrzej Bialecki, one of Lucene's biggest contributors, threw up his hands.) Especially annoying is template code which varies subtly, making verification of suspected boilerplate a challenging prospect. I can think of some vector-based techniques I might try, but hunting down academic research on the topic is likely to be more fruitful. Best, Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|