Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev
Beyond Lucene 2.0 Index Design
 

Index | Next | Previous | View Flat


jdalton at globalspec

Jan 9, 2007, 6:25 AM


Views: 4366
Permalink
Beyond Lucene 2.0 Index Design

Hi,

I wanted to start some discussion about possible future Lucene file /
index formats. This is an extension to the discussion on Flexible
Lucene Indexing discussed on the wiki:
http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

Note: Related sources are listed at the end.

I would like to have the ability to create term frequency [Persin, et
al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists (freq
data) . A posting list sorted by Term frequency rather than document id
is straightforward (posting design below). An Impact sorted list is
relatively new (and perhaps unfamiliar). An Impact is a single integer
value for a term in a document that is stored in the posting list and is
computed from the combination of the term frequency, document boost,
field boost, length norms, and other arbitrary scoring features (word
position, font, etc...) -- all local information.

The driving motivation for this change is to avoid reading the entire
posting list from disk for very long posting lists (it also leads to
simplified query-time scoring because the tf, norms, and boosts are
built into the impact). This would address scalability issues with
large collections that have been seen in the past; back in December 2005
there were two threads: "Lucene Performance Bottlenecks" (Lucene User)
and "IndexOptimizer Re: Lucene Performance Bottlenecks" (Nutch Dev)
where Doug and Andrzej addressed some speed concerns by sorting the
Nutch index based on Document Boost (IndexSorter and a TopDocsCollector)
[inpsired by Long, Suel]. The goal is that an impact sorted posting list
would address these and other concerns in a generic manner.

Allowing a frequency or impact sorted posting list format would lead to
a posting list with the following structure:
(Frequency or impact could be used interchangeably in the structure.
Lettering continued from Wiki)

e. <impact, num_docs, (doc1,...docN)>
f. <impact, num_docs, ([doc1, freq ,<positions>],...[docN, freq
,<positions>])

The positions are delta encoded for compression. Similarly, the
document numbers for a given frequency/impact are delta encoded. If you
read Moffat and Persin, the papers show that this achieves compression
comparable to, or even better than, a standard delta encoded docId
sorted index. The structure lends itself well to early termination,
pruning, etc... where the entire posting list is not read from disk.

This type of Impact sorted structure (or similar concept) seems to be
becoming a "standard" solution described in a lot of new research / text
books on IR for large scale indexes. It would be great if Lucene
supported something like this someday ;-).

Thanks,

Jeff Dalton

References:
Anh, Moffat. Vector-Space Ranking with Effective Early Termination.
2001.
Anh, Moffat. Impact Transformation: Effective and Efficient Web
Retrieval. 2006.
Anh, Moffat. Pruned Query Evaluation Using Pre-Computed Impacts. 2006.
Long, Suel. Optimized Query Execution in Large Search Engine with Global
Page Ordering.
Manning, Raghavan, Schutze. Introduction to Information Retrieval,
Chapters 2,7.
http://www-csli.stanford.edu/%7Eschuetze/information-retrieval-book.html

Persin, et al. Filtered Document Retrieval with Frequency-Sorted
Indexes. 1996.






---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org

Subject User Time
Beyond Lucene 2.0 Index Design jdalton at globalspec Jan 9, 2007, 6:25 AM
    Re: Beyond Lucene 2.0 Index Design marvin at rectangular Jan 9, 2007, 11:57 AM
    Re: Beyond Lucene 2.0 Index Design DORONC at il Jan 9, 2007, 2:27 PM
    RE: Beyond Lucene 2.0 Index Design jdalton at globalspec Jan 9, 2007, 3:15 PM
    RE: Beyond Lucene 2.0 Index Design jdalton at globalspec Jan 9, 2007, 3:20 PM
    Re: Beyond Lucene 2.0 Index Design chenjian1227 at gmail Jan 10, 2007, 2:12 PM
    Re: Beyond Lucene 2.0 Index Design chenjian1227 at gmail Jan 10, 2007, 2:21 PM
    Re: Beyond Lucene 2.0 Index Design mingluen at yahoo Jan 10, 2007, 2:41 PM
    Re: Beyond Lucene 2.0 Index Design mingluen at yahoo Jan 10, 2007, 2:52 PM
    Re: Beyond Lucene 2.0 Index Design mingluen at yahoo Jan 10, 2007, 3:14 PM
    Re: Beyond Lucene 2.0 Index Design mingluen at yahoo Jan 10, 2007, 3:35 PM
    Re: Beyond Lucene 2.0 Index Design grant.ingersoll at gmail Jan 11, 2007, 5:11 AM
    Re: Beyond Lucene 2.0 Index Design marvin at rectangular Jan 11, 2007, 1:51 PM
    Re: Beyond Lucene 2.0 Index Design chenjian1227 at gmail Jan 11, 2007, 2:30 PM
    Re: Beyond Lucene 2.0 Index Design marvin at rectangular Jan 11, 2007, 4:43 PM
    Re: Beyond Lucene 2.0 Index Design mingluen at yahoo Jan 11, 2007, 8:37 PM
        Re: Beyond Lucene 2.0 Index Design marvin at rectangular Jan 11, 2007, 11:24 PM
    RE: Beyond Lucene 2.0 Index Design jdalton at globalspec Jan 12, 2007, 5:45 AM
    RE: Beyond Lucene 2.0 Index Design jdalton at globalspec Jan 12, 2007, 5:54 AM
    RE: Beyond Lucene 2.0 Index Design jdalton at globalspec Jan 12, 2007, 5:58 AM
    Re: Beyond Lucene 2.0 Index Design cutting at apache Jan 12, 2007, 11:49 AM
        Re: Beyond Lucene 2.0 Index Design chuck at manawiz Jan 12, 2007, 12:00 PM
            Re: Beyond Lucene 2.0 Index Design paul.elschot at xs4all Jan 12, 2007, 1:39 PM

  Index | Next | Previous | View Flat
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.