Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Creating tag clouds with lucene

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


mathias.bank at gmail

Nov 5, 2009, 7:21 AM

Post #1 of 9 (2036 views)
Permalink
Creating tag clouds with lucene

Hi,

I want to calculate a tag cload for search results. I have seen, that
it is possible to extract the top 20 words out of the lucene index. Is
there also a possibility to extract the top 20 words out of search
results (or filter results) in lucene?

Mathias

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


glen.newton at gmail

Nov 5, 2009, 7:27 AM

Post #2 of 9 (1991 views)
Permalink
Re: Creating tag clouds with lucene [In reply to]

Yes. I do it here in Ungava, on a search of "cancer" and "cell" in title:
http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava01/Search?tagCloud=true&collection=csu&tagField=keyword&title=cell&title=cancer&numCloudDocs=200&numCloudTags=50

and here on full-text:
http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava/Search?tagCloud=true&collection=jos&tagField=keyword&contents=cell&contents=cancer&numCloudDocs=200&numCloudTags=50

-glen

2009/11/5 Mathias Bank <mathias.bank [at] gmail>:
> Hi,
>
> I want to calculate a tag cload for search results. I have seen, that
> it is possible to extract the top 20 words out of the lucene index. Is
> there also a possibility to extract the top 20 words out of search
> results (or filter results) in lucene?
>
> Mathias
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>



--

-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


mathias.bank at gmail

Nov 5, 2009, 7:35 AM

Post #3 of 9 (1990 views)
Permalink
Re: Creating tag clouds with lucene [In reply to]

Hi Glen,

great, that is exactly what I'm looking for. How are you doing this?

Mathias

2009/11/5 Glen Newton <glen.newton [at] gmail>:
> Yes. I do it here in Ungava, on a search of "cancer" and "cell" in title:
>  http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava01/Search?tagCloud=true&collection=csu&tagField=keyword&title=cell&title=cancer&numCloudDocs=200&numCloudTags=50
>
> and here on full-text:
>  http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava/Search?tagCloud=true&collection=jos&tagField=keyword&contents=cell&contents=cancer&numCloudDocs=200&numCloudTags=50
>
> -glen
>
> 2009/11/5 Mathias Bank <mathias.bank [at] gmail>:
>> Hi,
>>
>> I want to calculate a tag cload for search results. I have seen, that
>> it is possible to extract the top 20 words out of the lucene index. Is
>> there also a possibility to extract the top 20 words out of search
>> results (or filter results) in lucene?
>>
>> Mathias
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
>
>
> --
>
> -
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


glen.newton at gmail

Nov 5, 2009, 8:03 AM

Post #4 of 9 (1990 views)
Permalink
Re: Creating tag clouds with lucene [In reply to]

Mathias,

This is a special case: I am doing this on fields that have single
term entries, or are treated like a single term (like 'author') and
not parsed.
Here is what I am doing:
//Loop through the top N Hits (this application is Lucene 2.2, so it
is still using Hits).
Foreach Document
{
add the term in the Field to a HashMap<String,Integer>, <term,
}
Sort the hash by count, taking the top M
Sort top M by term
Display tag cloud.
----------

If you were using full-text, I think you would instead would use the
TermDocs IndexReader.termDocs(Term term)
and take the top N terms & add to the hash.

An issue is likely whether you want to have multiple word phrases in
your tag cloud, like in the above example "Cell adhesion molecules".
You would have to play with your Analyzer to get things like this.
N-grams for terms would handle this? Anyone?

thanks,

Glen



2009/11/5 Mathias Bank <mathias.bank [at] gmail>:
> Hi Glen,
>
> great, that is exactly what I'm looking for. How are you doing this?
>
> Mathias
>
> 2009/11/5 Glen Newton <glen.newton [at] gmail>:
>> Yes. I do it here in Ungava, on a search of "cancer" and "cell" in title:
>>  http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava01/Search?tagCloud=true&collection=csu&tagField=keyword&title=cell&title=cancer&numCloudDocs=200&numCloudTags=50
>>
>> and here on full-text:
>>  http://lab.cisti-icist.nrc-cnrc.gc.ca/ungava/Search?tagCloud=true&collection=jos&tagField=keyword&contents=cell&contents=cancer&numCloudDocs=200&numCloudTags=50
>>
>> -glen
>>
>> 2009/11/5 Mathias Bank <mathias.bank [at] gmail>:
>>> Hi,
>>>
>>> I want to calculate a tag cload for search results. I have seen, that
>>> it is possible to extract the top 20 words out of the lucene index. Is
>>> there also a possibility to extract the top 20 words out of search
>>> results (or filter results) in lucene?
>>>
>>> Mathias
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>
>>
>>
>> --
>>
>> -
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>



--

-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


chris.lu at gmail

Nov 5, 2009, 7:01 PM

Post #5 of 9 (1967 views)
Permalink
Re: Creating tag clouds with lucene [In reply to]

Isn't the tag cloud just another facet search? Only difference is the
tag is multi-valued.

Basically just go through the search results and find all unique tag values.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding!


Mathias Bank wrote:
> Hi,
>
> I want to calculate a tag cload for search results. I have seen, that
> it is possible to extract the top 20 words out of the lucene index. Is
> there also a possibility to extract the top 20 words out of search
> results (or filter results) in lucene?
>
> Mathias
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


jake.mannix at gmail

Nov 5, 2009, 7:15 PM

Post #6 of 9 (1963 views)
Permalink
Re: Creating tag clouds with lucene [In reply to]

Well you can do it as a facet search, but in addition to doing multi-valued
faceting, you can also normalize the counts by dividing by the docFreq of
the term, which instead of getting you the most popular tags which overlap
your query, you get the tags which are more popular for documents matching
your query relative to how popular those tags are in general, which is a way
of getting the tags "related to" your query. This can be done pretty easily
within bobo (I just whipped up the code to do this while eating dinner just
now, in fact, let me know if you want to try out that way of doing it, and I
can walk you through the bobo code you'd need to write for this), and it's
probably not too hard to do in Solr either.

How big your index is (and how many tags per document there are, and how
many unique tags there are) will have a big impact on how performant this
query is, of course, but in my experience as long as this is a typical tag
case (with only a handful of values per document), this can be done not much
slower than your original query.

-jake

On Thu, Nov 5, 2009 at 7:01 PM, Chris Lu <chris.lu [at] gmail> wrote:

> Isn't the tag cloud just another facet search? Only difference is the tag
> is multi-valued.
>
> Basically just go through the search results and find all unique tag
> values.
>
> --
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!
>
>
>
> Mathias Bank wrote:
>
>> Hi,
>>
>> I want to calculate a tag cload for search results. I have seen, that
>> it is possible to extract the top 20 words out of the lucene index. Is
>> there also a possibility to extract the top 20 words out of search
>> results (or filter results) in lucene?
>>
>> Mathias
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


chris.lu at gmail

Nov 5, 2009, 7:44 PM

Post #7 of 9 (1970 views)
Permalink
Re: Creating tag clouds with lucene [In reply to]

Interesting idea! Kind of "cheating" because the word frequency in the
whole index is simply mapped to the search results, which is arguable.

But maybe in practice it could work just fine, since nobody really cares
about the counts anyway.
When users click the tag cloud, did anyone really have cared about the
frequency in the search results?

DBSight uses the multi-valued facet search approach to do tag cloud.
Maybe I can "cheat" it this way also... It does save some memory.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding!


Jake Mannix wrote:
> Well you can do it as a facet search, but in addition to doing multi-valued
> faceting, you can also normalize the counts by dividing by the docFreq of
> the term, which instead of getting you the most popular tags which overlap
> your query, you get the tags which are more popular for documents matching
> your query relative to how popular those tags are in general, which is a way
> of getting the tags "related to" your query. This can be done pretty easily
> within bobo (I just whipped up the code to do this while eating dinner just
> now, in fact, let me know if you want to try out that way of doing it, and I
> can walk you through the bobo code you'd need to write for this), and it's
> probably not too hard to do in Solr either.
>
> How big your index is (and how many tags per document there are, and how
> many unique tags there are) will have a big impact on how performant this
> query is, of course, but in my experience as long as this is a typical tag
> case (with only a handful of values per document), this can be done not much
> slower than your original query.
>
> -jake
>
> On Thu, Nov 5, 2009 at 7:01 PM, Chris Lu <chris.lu [at] gmail> wrote:
>
>
>> Isn't the tag cloud just another facet search? Only difference is the tag
>> is multi-valued.
>>
>> Basically just go through the search results and find all unique tag
>> values.
>>
>> --
>> Chris Lu
>> -------------------------
>> Instant Scalable Full-Text Search On Any Database/Application
>> site: http://www.dbsight.net
>> demo: http://search.dbsight.com
>> Lucene Database Search in 3 minutes:
>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>> DBSight customer, a shopping comparison site, (anonymous per request) got
>> 2.6 Million Euro funding!
>>
>>
>>
>> Mathias Bank wrote:
>>
>>
>>> Hi,
>>>
>>> I want to calculate a tag cload for search results. I have seen, that
>>> it is possible to extract the top 20 words out of the lucene index. Is
>>> there also a possibility to extract the top 20 words out of search
>>> results (or filter results) in lucene?
>>>
>>> Mathias
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>>
>
>


mathias.bank at gmail

Nov 6, 2009, 12:25 AM

Post #8 of 9 (1972 views)
Permalink
Re: Creating tag clouds with lucene [In reply to]

Well, it could be a facet search, if there would be tags available but
if you just wanna have a "tag cloud" generated by full-text, I don't
see how a facet search could help to generate this cloud.
Unfortunatelly, I don't have tags in my data. What I need is the
information, what are the most used terms (or multi terms) in this
data. First I have thought of using carrot2, which uses a specialed
clustering algorithm. But I have wondered, if it is not possible to
get the most used terms out of lucene directly.

Glen has mentioned, that he is doing this for full-text data. He
mentioned that he is using the IndexReader.termDocs(Term term) method.
So I think he iterates all terms and looks in how many documents this
term exists. But what I don't see is: how does this method work with a
filter? Do you first look for all documents which are valid for the
used filter and than iterate all terms only counting documents in this
filtered set? I cannot imagine, that this is performant because I have
more than 10 mio documents (fast growing).

Mathias

2009/11/6 Chris Lu <chris.lu [at] gmail>:
> Isn't the tag cloud just another facet search? Only difference is the tag is
> multi-valued.
>
> Basically just go through the search results and find all unique tag values.
>
> --
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!
>
>
> Mathias Bank wrote:
>>
>> Hi,
>>
>> I want to calculate a tag cload for search results. I have seen, that
>> it is possible to extract the top 20 words out of the lucene index. Is
>> there also a possibility to extract the top 20 words out of search
>> results (or filter results) in lucene?
>>
>> Mathias
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


jake.mannix at gmail

Nov 6, 2009, 12:39 AM

Post #9 of 9 (1965 views)
Permalink
Re: Creating tag clouds with lucene [In reply to]

On Fri, Nov 6, 2009 at 12:25 AM, Mathias Bank <mathias.bank [at] gmail>wrote:

> Well, it could be a facet search, if there would be tags available but
> if you just wanna have a "tag cloud" generated by full-text, I don't
> see how a facet search could help to generate this cloud.
> Unfortunatelly, I don't have tags in my data. What I need is the
> information, what are the most used terms (or multi terms) in this
> data. First I have thought of using carrot2, which uses a specialed
> clustering algorithm. But I have wondered, if it is not possible to
> get the most used terms out of lucene directly.
>

It is a facet search because if you take the field you want the cloud for
(I called it the tags field, but it can be any field - a full-text "body"
for example),
and then set up a multi-valued facet on that field - this will return the
count
of the number of documents matching your given query which contain each
of the given terms (one integer count per term). Sorting by count
descending
and picking the top-N is what you normally do in a facet search, and then
you use the counts themselves to decide how big to make each term.

For a 10million document single index, if your field has a lot of unique
terms
and you do nothing to prune it down, this kind of query could be expensive,
yes. But you'll want to prune down full-text anyways, or else your cloud
will
have whatever words are just uncommon enough to not be stop-words (if
you're using a stoplist), or of course the common stop list itself (if you
aren't). This won't be very informative - you want the terms which are
most descriptive *of that query*, which is why I suggested doing a modified
facet query, where you normalize by the docFreq of the term as you
count, which effectively gets the amount of over/under-representation of
each
term in the documents matching your query-filter.

-jake


>
> Glen has mentioned, that he is doing this for full-text data. He
> mentioned that he is using the IndexReader.termDocs(Term term) method.
> So I think he iterates all terms and looks in how many documents this
> term exists. But what I don't see is: how does this method work with a
> filter? Do you first look for all documents which are valid for the
> used filter and than iterate all terms only counting documents in this
> filtered set? I cannot imagine, that this is performant because I have
> more than 10 mio documents (fast growing).
>
> Mathias
>
> 2009/11/6 Chris Lu <chris.lu [at] gmail>:
> > Isn't the tag cloud just another facet search? Only difference is the tag
> is
> > multi-valued.
> >
> > Basically just go through the search results and find all unique tag
> values.
> >
> > --
> > Chris Lu
> > -------------------------
> > Instant Scalable Full-Text Search On Any Database/Application
> > site: http://www.dbsight.net
> > demo: http://search.dbsight.com
> > Lucene Database Search in 3 minutes:
> >
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> > DBSight customer, a shopping comparison site, (anonymous per request) got
> > 2.6 Million Euro funding!
> >
> >
> > Mathias Bank wrote:
> >>
> >> Hi,
> >>
> >> I want to calculate a tag cload for search results. I have seen, that
> >> it is possible to extract the top 20 words out of the lucene index. Is
> >> there also a possibility to extract the top 20 words out of search
> >> results (or filter results) in lucene?
> >>
> >> Mathias
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-user-help [at] lucene
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.