Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: General
index per-user basis and document frequency
 

Index | Next | Previous | View Flat


lionel.duboeuf at boozter

Jun 15, 2009, 2:06 PM


Views: 363
Permalink
index per-user basis and document frequency

Hi,

I use Lucene to index user's documents. I have a potential of 2 or more
millions users so that i think a per-user index will not be a scalable
solution. All my searches are filtered with a user UID field.
As far as i know the default similarity calculate Inverse Document
Frequency as follow:
Math.log(numDocs/(double)(docFreq+1)) + 1.0)
where numDocs stands for the number of documents within the whole
collection and docFreq for the number of times Term t appear in the
whole collection.
My problem here is that this formula seems not to be reliable for my
system because numDocs should correspond to the number of documents in
the user's collection and docFreq for the number of times the Term T
appears in the user's collection.
Because Terms are stored as a single token i was thinking of
concatenating terms with a UID in order to separate them because :
Term "car" for user1 is different to term "car" for user2. My solution
would index "carUSERUID1" "carUSERUID2".

What would you suggest ?

Regards,

Lionel

Subject User Time
index per-user basis and document frequency lionel.duboeuf at boozter Jun 15, 2009, 2:06 PM
    Re: index per-user basis and document frequency ted.dunning at gmail Jun 15, 2009, 3:16 PM
    Re: index per-user basis and document frequency lionel.duboeuf at boozter Jun 16, 2009, 1:51 AM

  Index | Next | Previous | View Flat
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.