Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User
Re: Small Vocabulary
 

Index | Next | Previous | View Flat


schnober at ids-mannheim

Aug 7, 2012, 12:29 AM


Views: 992
Permalink
Re: Small Vocabulary [In reply to]

Am 06.08.2012 20:29, schrieb Mike Sokolov:

Hi Mike,

> There was some interesting work done on optimizing queries including
> very common words (stop words) that I think overlaps with your problem.
> See this blog post
> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
> from the Hathi Trust.
>
> The upshot in a nutshell was that queries including terms with very
> large postings lists (ie high occurrences) were slow, and the approach
> they took to dealing with this was to index n-grams (ie pairs and
> triplets of adjacent tokens). However I'm not sure this would help much
> if your queries will typically include only a single token.

This is very interesting for our use case indeed. However, you are right
that indexing n-grams is not (per sé) a solution for my given problem
because I'm working on an application using multiple indexes. A query
for one isolated frequent term will indeed be rare presumably, or at
least rare enough to tolerate slow response times, but the results will
typically be intersected with results from other indexes.

To illustrate this more practically: the index I described having
relatively few distinct and partially extremely frequent tokens indexes
part-of-speech (POS) tags with positional information stored in the
payload. A parallel index indexes actual text; a typical query may look
for a certain POS tag in one index and a word X at the same position
with a matching payload in the other index. So both indexes need to be
queries completely before the intersection can be performed.

Best,
Carsten



--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789 | schnober [at] ids-mannheim
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Subject User Time
Small Vocabulary schnober at ids-mannheim Jul 30, 2012, 6:07 AM
    Re: Small Vocabulary ian.lea at gmail Jul 31, 2012, 3:10 AM
    Re: Small Vocabulary schnober at ids-mannheim Aug 2, 2012, 1:19 AM
    Re: Small Vocabulary sokolov at ifactory Aug 6, 2012, 11:29 AM
        Re: Small Vocabulary schnober at ids-mannheim Aug 7, 2012, 12:29 AM
            Re: Small Vocabulary torindan at gmail Aug 7, 2012, 1:20 AM
    Re: Small Vocabulary schnober at ids-mannheim Aug 7, 2012, 2:13 AM
        Re: Small Vocabulary schnober at ids-mannheim Aug 7, 2012, 2:31 AM
            Re: Small Vocabulary torindan at gmail Aug 7, 2012, 4:15 AM
        Re: Small Vocabulary torindan at gmail Aug 7, 2012, 2:31 AM

  Index | Next | Previous | View Flat
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.