Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: General

Open Relevance Project?

 

 

First page Previous page 1 2 Next page Last page  View All Lucene general RSS feed   Index | Next | Previous | View Threaded


gsingers at apache

May 11, 2009, 9:07 AM

Post #1 of 29 (3054 views)
Permalink
Open Relevance Project?

A few of us who are interested in an Open Relevance assessment project
(ala TREC) have started to put some thoughts down on "paper" over at http://wiki.apache.org/lucene-java/OpenRelevance

Thus, if you'd like to somehow participate (TBD what that actually
means just yet) in developing a set of open collections, queries and
assessments for relevance testing, let's discuss here and on that Wiki
page.

The basic gist of it is, we'd like to crawl Creative Commons and/or
other free content, redistribute it along with queries and judgments,
thus fueling the testing capabilities to further improve Lucene's
search quality as well as, of course, providing the means for a
completely open assessment process whereby anyone can participate
without having to fork up money to license 20 year old copyrighted
news articles that are of no other value whatsoever other than testing.

At this point, we're open to a lot of ideas. Once we solidify a bit,
then we'd like to make it an official Lucene subproject and get our
own resources as well as figure out how to crawl and host the content
using ASF infrastructure (without making the ASF infra. team upset!)

Cheers,
Grant


ted.dunning at gmail

May 11, 2009, 9:47 AM

Post #2 of 29 (2987 views)
Permalink
Re: Open Relevance Project? [In reply to]

Sounds good to me. I would be able to help a few different ways.

On Mon, May 11, 2009 at 9:07 AM, Grant Ingersoll <gsingers [at] apache>wrote:

> A few of us who are interested in an Open Relevance assessment project (ala
> TREC) have started to put some thoughts down on "paper" over at
> http://wiki.apache.org/lucene-java/OpenRelevance
>
> Thus, if you'd like to somehow participate (TBD what that actually means
> just yet) in developing a set of open collections, queries and assessments
> for relevance testing, let's discuss here and on that Wiki page.
>
>


marvin at rectangular

May 11, 2009, 11:26 AM

Post #3 of 29 (2987 views)
Permalink
Re: Open Relevance Project? [In reply to]

On Mon, May 11, 2009 at 12:07:41PM -0400, Grant Ingersoll wrote:
> Thus, if you'd like to somehow participate (TBD what that actually
> means just yet) in developing a set of open collections, queries and
> assessments for relevance testing, let's discuss here and on that Wiki
> page.

I won't be able to contribute directly in the near to medium term, but I look
forward to participating as a user. Sounds like a great project.

Marvin Humphrey


lucene at mikemccandless

May 11, 2009, 1:01 PM

Post #4 of 29 (2985 views)
Permalink
Re: Open Relevance Project? [In reply to]

I'd love to see a resource like this (it's high time!), and I'll try
to help when/where I can, starting with some initial
comments/questions:

I think it's actually quite a challenge to do well. EG it's easy to
make a corpus that's too easy because it's highly diverse (and thus
most search engines have no trouble pulling back relevant results).
Instead, I think the content set should be well & tightly scoped to a
certain topic, and not necessarily that large (ie we don't need a huge
number of documents). It would help if that scoping is towards
content that many people find "of interest" so we get "accurate"
judgements by as wide an audience as possible.

EG how about coverage of the 2009 H1N1 outbreak (that's licensed
appropriately)? Or... the 2008 US presidential election? Or...
research on Leukemia (but I fear such content is not typically
licensed appropriately, nor will it have wide interest).

What does "using Nutch to crawl Creative Commons" actually mean? Can
I browse the content that's being crawled?

Also, to help us build up the relevance judgements, I think we should
build a basic custom app for collecting queries as well as annotating
them. I should be able to go to that page and run my own queries,
which are collected. Then, I should be able to browse previously
collected queries, click on them, and add my own judgement. The site
should try to offer up queries that are "in need" of judgements. It
should run the search and let me step through the results, marking
those that are relevant; but we would then bias the results to that
search engine; maybe under the hood we rotate through search engines
each time?

Do we have anyone involved who's built similar corpora before? Or has
anyone read papers on how prior corpora were designed/created?

Mike

On Mon, May 11, 2009 at 12:07 PM, Grant Ingersoll <gsingers [at] apache> wrote:
> A few of us who are interested in an Open Relevance assessment project (ala
> TREC) have started to put some thoughts down on "paper" over at
> http://wiki.apache.org/lucene-java/OpenRelevance
>
> Thus, if you'd like to somehow participate (TBD what that actually means
> just yet) in developing a set of open collections, queries and assessments
> for relevance testing, let's discuss here and on that Wiki page.
>
> The basic gist of it is, we'd like to crawl Creative Commons and/or other
> free content, redistribute it along with queries and judgments, thus fueling
> the testing capabilities to further improve Lucene's search quality as well
> as, of course, providing the means for a completely open assessment process
> whereby anyone can participate without having to fork up money to license 20
> year old copyrighted news articles that are of no other value whatsoever
> other than testing.
>
> At this point, we're open to a lot of ideas.  Once we solidify a bit, then
> we'd like to make it an official Lucene subproject and get our own resources
> as well as figure out how to crawl and host the content using ASF
> infrastructure (without making the ASF infra. team upset!)
>
> Cheers,
> Grant
>


ted.dunning at gmail

May 11, 2009, 1:06 PM

Post #5 of 29 (2985 views)
Permalink
Re: Open Relevance Project? [In reply to]

I was involved in TREC-1 through 5 or so as a researcher. That means that I
didn't actually create the corpus but I certainly had to deal with the
results and see how things turned out.


On Mon, May 11, 2009 at 1:01 PM, Michael McCandless <
lucene [at] mikemccandless> wrote:

>
> Do we have anyone involved who's built similar corpora before? Or has
> anyone read papers on how prior corpora were designed/created?




--
Ted Dunning, CTO
DeepDyve


ted.dunning at gmail

May 11, 2009, 1:08 PM

Post #6 of 29 (2987 views)
Permalink
Re: Open Relevance Project? [In reply to]

The standard technique for this is what is called "pooled relevance" where
all the results from all the search engines are combined into a pool for
judging.

In our case, we should probably make the pool dynamic so that tests on new
search engines will enlarge the pool.

Related to that, we should not pretend that we can measure recall for any
meaningful sized corpus.

On Mon, May 11, 2009 at 1:01 PM, Michael McCandless <
lucene [at] mikemccandless> wrote:

> It should run the search and let me step through the results, marking
> those that are relevant; but we would then bias the results to that
> search engine; maybe under the hood we rotate through search engines
> each time?
>



--
Ted Dunning, CTO
DeepDyve


ab at getopt

May 11, 2009, 1:46 PM

Post #7 of 29 (2992 views)
Permalink
Re: Open Relevance Project? [In reply to]

Michael McCandless wrote:

> I think it's actually quite a challenge to do well. EG it's easy to
> make a corpus that's too easy because it's highly diverse (and thus
> most search engines have no trouble pulling back relevant results).
> Instead, I think the content set should be well & tightly scoped to a
> certain topic, and not necessarily that large (ie we don't need a huge
> number of documents). It would help if that scoping is towards
> content that many people find "of interest" so we get "accurate"
> judgements by as wide an audience as possible.
>
> EG how about coverage of the 2009 H1N1 outbreak (that's licensed
> appropriately)? Or... the 2008 US presidential election? Or...
> research on Leukemia (but I fear such content is not typically
> licensed appropriately, nor will it have wide interest).

These are good ideas. It's difficult not only to collect a meaningful
corpus, but also later to distribute it, if it weighs a hundred GBs or more.


>
> What does "using Nutch to crawl Creative Commons" actually mean? Can
> I browse the content that's being crawled?

Yes. It's easy to collect a lot of web pages starting from a seed list
and expanding the crawling frontier to linked resources, while applying
CC license filters. Nutch provides a lot of tools out of the box that we
need anyway, such as keeping track of page status, following outlinks,
parsing, working with web graph (important for scoring web documents),
indexing, searching and content browsing.


> Also, to help us build up the relevance judgements, I think we should
> build a basic custom app for collecting queries as well as annotating
> them. I should be able to go to that page and run my own queries,
> which are collected. Then, I should be able to browse previously
> collected queries, click on them, and add my own judgement. The site
> should try to offer up queries that are "in need" of judgements. It
> should run the search and let me step through the results, marking
> those that are relevant; but we would then bias the results to that
> search engine; maybe under the hood we rotate through search engines
> each time?

Comparing results across search engines is clearly a challenge. Among
others, this requires that the corpus that we use with the engines that
we operate (Lucene? KinoSearch? other open source engines?) contains at
least top-X (where X > N) URL-s returned from external engines for every
query - otherwise we won't be able to compare the results.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com


gsingers at apache

May 11, 2009, 3:12 PM

Post #8 of 29 (2978 views)
Permalink
Re: Open Relevance Project? [In reply to]

On May 11, 2009, at 4:01 PM, Michael McCandless wrote:

> I'd love to see a resource like this (it's high time!), and I'll try
> to help when/where I can, starting with some initial
> comments/questions:
>
> I think it's actually quite a challenge to do well. EG it's easy to
> make a corpus that's too easy because it's highly diverse (and thus
> most search engines have no trouble pulling back relevant results).
> Instead, I think the content set should be well & tightly scoped to a
> certain topic, and not necessarily that large (ie we don't need a huge
> number of documents). It would help if that scoping is towards
> content that many people find "of interest" so we get "accurate"
> judgements by as wide an audience as possible.

I think we will want a generic one, and then focused ones, but we
should start with generic at first.

>
>
> EG how about coverage of the 2009 H1N1 outbreak (that's licensed
> appropriately)? Or... the 2008 US presidential election? Or...
> research on Leukemia (but I fear such content is not typically
> licensed appropriately, nor will it have wide interest).
>
> What does "using Nutch to crawl Creative Commons" actually mean? Can
> I browse the content that's being crawled?

Nutch has a CC plugin that allows it to filter out non-CC content, AIUI.

>
>
> Also, to help us build up the relevance judgements, I think we should
> build a basic custom app for collecting queries as well as annotating
> them. I should be able to go to that page and run my own queries,
> which are collected. Then, I should be able to browse previously
> collected queries, click on them, and add my own judgement. The site
> should try to offer up queries that are "in need" of judgements. It
> should run the search and let me step through the results, marking
> those that are relevant; but we would then bias the results to that
> search engine; maybe under the hood we rotate through search engines
> each time?
>
> Do we have anyone involved who's built similar corpora before? Or has
> anyone read papers on how prior corpora were designed/created?

This is all good, but here I'm thinking simpler, at least at first. I
don't know that we need to be writing apps, although feel free, since
it is O/S after all. :-) I was wondering if we couldn't handle this
wiki style (how is still not clear) whereby we simply have pages that
contain the queries and judgments and over time the wisdom of the
crowds will work to maintain standards, fill in gaps, etc. Maybe,
in regards to judgments, we allow people to vote for them, which over
time will yield an appropriate result (but is subject to early
issues). Not sure what all that means just yet, but the wiki approach
allows us to get going with minimal resources while still delivering
value. Hmm, now it's starting to sound like an app... ;-)

As opposed to TREC style stuff, I don't think we need the top 1000
(although it could work). Just the top ten or twenty. Sometimes, it
can even be useful to just rate a whole page of results at once, even
at the cost of granularity. Basically, what I'm proposing we do is
carry out a pragmatic relevance test out in the open, just as people
should do in house. I think this fits with Lucene's model of
operation quite well: be practical by focusing on real data and real
feedback as opposed to obsessing over theory. (Not that you were
suggesting otherwise, I'm just stating it)

I need to find the reference, but I recall the last edition of SIGIR
having a discussion on crowdsourcing relevance judgments.

-Grant


gsingers at apache

May 13, 2009, 8:56 AM

Post #9 of 29 (2964 views)
Permalink
Re: Open Relevance Project? [In reply to]

So, I suppose the next steps are to formalize this project a little
more. I'll call a vote on a separate thread to add it as a Lucene
sub. I figured I would contact infrastructure to see what they
think. Was also thinking that maybe we should talk with iBiblio or
some other content repository to see if they can help overcome the
bandwidth problem.

-Grant

On May 11, 2009, at 6:12 PM, Grant Ingersoll wrote:

>
> On May 11, 2009, at 4:01 PM, Michael McCandless wrote:
>
>> I'd love to see a resource like this (it's high time!), and I'll try
>> to help when/where I can, starting with some initial
>> comments/questions:
>>
>> I think it's actually quite a challenge to do well. EG it's easy to
>> make a corpus that's too easy because it's highly diverse (and thus
>> most search engines have no trouble pulling back relevant results).
>> Instead, I think the content set should be well & tightly scoped to a
>> certain topic, and not necessarily that large (ie we don't need a
>> huge
>> number of documents). It would help if that scoping is towards
>> content that many people find "of interest" so we get "accurate"
>> judgements by as wide an audience as possible.
>
> I think we will want a generic one, and then focused ones, but we
> should start with generic at first.
>
>>
>>
>> EG how about coverage of the 2009 H1N1 outbreak (that's licensed
>> appropriately)? Or... the 2008 US presidential election? Or...
>> research on Leukemia (but I fear such content is not typically
>> licensed appropriately, nor will it have wide interest).
>>
>> What does "using Nutch to crawl Creative Commons" actually mean? Can
>> I browse the content that's being crawled?
>
> Nutch has a CC plugin that allows it to filter out non-CC content,
> AIUI.
>
>>
>>
>> Also, to help us build up the relevance judgements, I think we should
>> build a basic custom app for collecting queries as well as annotating
>> them. I should be able to go to that page and run my own queries,
>> which are collected. Then, I should be able to browse previously
>> collected queries, click on them, and add my own judgement. The site
>> should try to offer up queries that are "in need" of judgements. It
>> should run the search and let me step through the results, marking
>> those that are relevant; but we would then bias the results to that
>> search engine; maybe under the hood we rotate through search engines
>> each time?
>>
>> Do we have anyone involved who's built similar corpora before? Or
>> has
>> anyone read papers on how prior corpora were designed/created?
>
> This is all good, but here I'm thinking simpler, at least at first.
> I don't know that we need to be writing apps, although feel free,
> since it is O/S after all. :-) I was wondering if we couldn't
> handle this wiki style (how is still not clear) whereby we simply
> have pages that contain the queries and judgments and over time the
> wisdom of the crowds will work to maintain standards, fill in gaps,
> etc. Maybe, in regards to judgments, we allow people to vote for
> them, which over time will yield an appropriate result (but is
> subject to early issues). Not sure what all that means just yet,
> but the wiki approach allows us to get going with minimal resources
> while still delivering value. Hmm, now it's starting to sound like
> an app... ;-)
>
> As opposed to TREC style stuff, I don't think we need the top 1000
> (although it could work). Just the top ten or twenty. Sometimes,
> it can even be useful to just rate a whole page of results at once,
> even at the cost of granularity. Basically, what I'm proposing we
> do is carry out a pragmatic relevance test out in the open, just as
> people should do in house. I think this fits with Lucene's model of
> operation quite well: be practical by focusing on real data and real
> feedback as opposed to obsessing over theory. (Not that you were
> suggesting otherwise, I'm just stating it)
>
> I need to find the reference, but I recall the last edition of SIGIR
> having a discussion on crowdsourcing relevance judgments.
>
> -Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search


ted.dunning at gmail

May 13, 2009, 10:36 AM

Post #10 of 29 (2970 views)
Permalink
Re: Open Relevance Project? [In reply to]

Even if the corpus is very large, I doubt there will be all that much
aggregate bandwidth. The audience for this is relatively small.

On Wed, May 13, 2009 at 8:56 AM, Grant Ingersoll <gsingers [at] apache>wrote:

> help overcome the bandwidth problem.
>



--
Ted Dunning, CTO
DeepDyve


gsingers at apache

May 13, 2009, 10:56 AM

Post #11 of 29 (2965 views)
Permalink
Re: Open Relevance Project? [In reply to]

Good point, although you never know. We also will have some bandwidth
reqs for crawling.

On May 13, 2009, at 1:36 PM, Ted Dunning wrote:

> Even if the corpus is very large, I doubt there will be all that much
> aggregate bandwidth. The audience for this is relatively small.
>
> On Wed, May 13, 2009 at 8:56 AM, Grant Ingersoll
> <gsingers [at] apache>wrote:
>
>> help overcome the bandwidth problem.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search


ted.dunning at gmail

May 13, 2009, 11:48 AM

Post #12 of 29 (2967 views)
Permalink
Re: Open Relevance Project? [In reply to]

Crawling a reference dataset requires essentially one-time bandwidth.

Also, it is possible to download, say, wikipedia in a single go. Likewise
there are various web-crawls that are available for research purposes (I
think). See http://webascorpus.org/ for one example. These would be single
downloads.

I don't entirely see the point of redoing the spidering.

On Wed, May 13, 2009 at 10:56 AM, Grant Ingersoll <gsingers [at] apache>wrote:

> Good point, although you never know. We also will have some bandwidth reqs
> for crawling.
>
>


--
Ted Dunning, CTO
DeepDyve


aw at ice-sa

May 13, 2009, 11:51 AM

Post #13 of 29 (2982 views)
Permalink
Re: Open Relevance Project? [In reply to]

Ted Dunning wrote:
> Even if the corpus is very large, I doubt there will be all that much
> aggregate bandwidth. The audience for this is relatively small.

+1
(I mean, count me in as 1 for the audience)

As for the corpus, why not start will all the Apache projects
documentation ?
It is relatively homogenous, free, it is there, it is close, and would
insure some audience.


gsingers at apache

May 13, 2009, 12:13 PM

Post #14 of 29 (2964 views)
Permalink
Re: Open Relevance Project? [In reply to]

On May 13, 2009, at 2:48 PM, Ted Dunning wrote:

> Crawling a reference dataset requires essentially one-time bandwidth.
>

True, but we will likely evolve over time to have multiple datasets,
but no reason to get ahead of ourselves.


> Also, it is possible to download, say, wikipedia in a single go.

Wikipedia isn't always that interesting from a relevance testing
standpoint, for IR at least (QA, machine learning, etc. it is more
so). A lot of queries simply have only one or two relevant results.
While that is useful, it is not often the whole picture of what one
needs for IR.

> Likewise
> there are various web-crawls that are available for research
> purposes (I
> think). See http://webascorpus.org/ for one example. These would
> be single
> downloads.
>
> I don't entirely see the point of redoing the spidering.

I think we have to be able to control the spidering, so that we can
say we've vetted what's in it, due to copyright, etc. But, maybe
not. I've talked with quite a few people who have corpora available,
and it always comes down to copyright for redistribution in a public
way. No one wants to assume the risk, even though they all crawl and
redistribute (for money).

For instance, the Internet Archive even goes so far as to apply
robots.txt retroactively. We probably could do the same thing, but
I'm not sure if it is necessary.


simon.willnauer at googlemail

May 13, 2009, 1:38 PM

Post #15 of 29 (2954 views)
Permalink
Re: Open Relevance Project? [In reply to]

I followed the whole discussion on how obtaining a certain corpus of
document going on on this thread. I personally think that we should
first define WHAT kind of corpus or rather what kind of different
corpus should be included in this new OpenRelevance project and not
HOW this corpus is collected / aggregated. IR is not just about having
a huge corpus of full-text documents / web-pages especially when it
comes to ranking.

My understanding of OpenRelevance is to provide a set of corpus and
measurement procedures for various use cases not just to compete with
TREC. Please correct me if I'm wrong.
Beyond that the project should help to improve Lucene - Ranking itself
or at least be helpful to obtain a measurement reference for more than
just WebSearch.

Anyway, I personally feel that the discussion about how to obtain a
certain corpus are out of scope at this stage of the project.

Simon

On Wed, May 13, 2009 at 9:13 PM, Grant Ingersoll <gsingers [at] apache> wrote:
>
> On May 13, 2009, at 2:48 PM, Ted Dunning wrote:
>
>> Crawling a reference dataset requires essentially one-time bandwidth.
>>
>
> True, but we will likely evolve over time to have multiple datasets, but no
> reason to get ahead of ourselves.
>
>
>> Also, it is possible to download, say, wikipedia in a single go.
>
> Wikipedia isn't always that interesting from a relevance testing standpoint,
> for IR at least (QA, machine learning, etc. it is more so).  A lot of
> queries simply have only one or two relevant results.  While that is useful,
> it is not often the whole picture of what one needs for IR.
>
>> Likewise
>> there are various web-crawls that are available for research purposes (I
>> think).  See http://webascorpus.org/ for one example.  These would be
>> single
>> downloads.
>>
>> I don't entirely see the point of redoing the spidering.
>
> I think we have to be able to control the spidering, so that we can say
> we've vetted what's in it, due to copyright, etc.  But, maybe not.  I've
> talked with quite a few people who have corpora available, and it always
> comes down to copyright for redistribution in a public way.  No one wants to
> assume the risk, even though they all crawl and redistribute (for money).
>
> For instance, the Internet Archive even goes so far as to apply robots.txt
> retroactively.  We probably could do the same thing, but I'm not sure if it
> is necessary.
>
>


ted.dunning at gmail

May 13, 2009, 1:43 PM

Post #16 of 29 (2958 views)
Permalink
Re: Open Relevance Project? [In reply to]

Very good point.

For that matter, the suggestion of apache docs is a trenchant one because
there is some possibility of getting a sample of queries (and if feasible,
session associated clicks). Having real users provide feedback for real
queries like that would be *vastly* more useful than just dead documents.

On Wed, May 13, 2009 at 1:38 PM, Simon Willnauer <
simon.willnauer [at] googlemail> wrote:

> define WHAT kind of corpus




--
Ted Dunning, CTO
DeepDyve


otis_gospodnetic at yahoo

May 17, 2009, 7:18 PM

Post #17 of 29 (2931 views)
Permalink
Re: Open Relevance Project? [In reply to]

Not sure if this was mentioned before, but .... hm, I was going to point out http://index.isc.org/ (see http://ioiblog.wordpress.com/2008/11/07/kicking-off-the-ioi-blog/ ), but the server doesn't seem to be listening.... aha, here: http://ioiblog.wordpress.com/2009/02/

Perhaps we can get data from Dennis and Jeremie?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Ted Dunning <ted.dunning [at] gmail>
> To: general [at] lucene
> Sent: Wednesday, May 13, 2009 2:48:43 PM
> Subject: Re: Open Relevance Project?
>
> Crawling a reference dataset requires essentially one-time bandwidth.
>
> Also, it is possible to download, say, wikipedia in a single go. Likewise
> there are various web-crawls that are available for research purposes (I
> think). See http://webascorpus.org/ for one example. These would be single
> downloads.
>
> I don't entirely see the point of redoing the spidering.
>
> On Wed, May 13, 2009 at 10:56 AM, Grant Ingersoll wrote:
>
> > Good point, although you never know. We also will have some bandwidth reqs
> > for crawling.
> >
> >
>
>
> --
> Ted Dunning, CTO
> DeepDyve


aw at ice-sa

May 18, 2009, 2:25 AM

Post #18 of 29 (2923 views)
Permalink
Re: Open Relevance Project? [In reply to]

Hi.
There has been an erlier suggestion here, later endorsed by someone
else, to use the documentation of the Apache projects as a corpus.
Being far from an expert, I am just naively wondering why the experts on
this list seem to totally ignore it, without providing any argument.
Is it somehow unsuitable, unpractical, inappropriate, bad, unfeasible,
useless, uninteresting or ... ?


ab at getopt

May 18, 2009, 2:59 AM

Post #19 of 29 (2925 views)
Permalink
Re: Open Relevance Project? [In reply to]

André Warnier wrote:
> Hi.
> There has been an erlier suggestion here, later endorsed by someone
> else, to use the documentation of the Apache projects as a corpus.
> Being far from an expert, I am just naively wondering why the experts on
> this list seem to totally ignore it, without providing any argument.
> Is it somehow unsuitable, unpractical, inappropriate, bad, unfeasible,
> useless, uninteresting or ... ?

The documentation is mostly on a single topic - programming. The
vocabulary is, let's not deceive ourselves, limited ;) Pages contain a
lot of noise (Forrest navigation, javadoc dressing, common class names,
code snippets, etc).

For a general-purpose corpus you would want to have several topics, with
a well-balanced representation, and using a broad vocabulary and low
level of noise.

Additionally, this collection gets relatively little endorsement (links
with meaningful anchors) from within apache.org, so the typical PageRank
scoring wouldn't work too well (on the other hand, it resembles intranet
linkage, so it could be useful for studying scoring algos for enterprise
search).

So, while this collection is not useless, it's not the best fit either.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com


gsingers at apache

May 18, 2009, 3:59 AM

Post #20 of 29 (2926 views)
Permalink
Re: Open Relevance Project? [In reply to]

Mail archives are likely useful for a mail based corpus. I agree with
Andrzej about the rest of the docs, though.


On May 18, 2009, at 5:25 AM, André Warnier wrote:

> Hi.
> There has been an erlier suggestion here, later endorsed by someone
> else, to use the documentation of the Apache projects as a corpus.
> Being far from an expert, I am just naively wondering why the
> experts on this list seem to totally ignore it, without providing
> any argument.
> Is it somehow unsuitable, unpractical, inappropriate, bad,
> unfeasible, useless, uninteresting or ... ?
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search


ted.dunning at gmail

May 18, 2009, 8:41 AM

Post #21 of 29 (2923 views)
Permalink
Re: Open Relevance Project? [In reply to]

On the other hand, it is likely that we could find query and click logs for
the documentation.

On Mon, May 18, 2009 at 3:59 AM, Grant Ingersoll <gsingers [at] apache>wrote:

> Mail archives are likely useful for a mail based corpus. I agree with
> Andrzej about the rest of the docs, though.
>
>
>
> On May 18, 2009, at 5:25 AM, André Warnier wrote:
>
> Hi.
>> There has been an erlier suggestion here, later endorsed by someone else,
>> to use the documentation of the Apache projects as a corpus.
>> Being far from an expert, I am just naively wondering why the experts on
>> this list seem to totally ignore it, without providing any argument.
>> Is it somehow unsuitable, unpractical, inappropriate, bad, unfeasible,
>> useless, uninteresting or ... ?
>>
>>
>
--
Ted Dunning, CTO
DeepDyve


gsingers at apache

May 18, 2009, 10:57 AM

Post #22 of 29 (2916 views)
Permalink
Re: Open Relevance Project? [In reply to]

On May 18, 2009, at 11:41 AM, Ted Dunning wrote:

> On the other hand, it is likely that we could find query and click
> logs for
> the documentation.

Only if they are redacted/aggregated first. ASF Members have access,
but we'd need to get permission to distribute (after redaction/
aggregation) I suspect. Given the AOL marketing fiasco, we'd have to
go over them in pretty good detail before releasing to make sure there
is no personal information. AFAIK, I'm the only ASF Member who has so
far volunteered on this thread and I highly doubt I have the time for
what I imagine to be a pretty decent sized endeavor.

Stripping IP address is pretty straightforward, but the query terms
might be a bit more involved.

Still, can't hurt to find out what's involved.

-Grant


gsingers at apache

May 18, 2009, 7:46 PM

Post #23 of 29 (2909 views)
Permalink
Re: Open Relevance Project? [In reply to]

Some interesting discussion at http://thenoisychannel.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise/

On May 18, 2009, at 1:57 PM, Grant Ingersoll wrote:

>
> On May 18, 2009, at 11:41 AM, Ted Dunning wrote:
>
>> On the other hand, it is likely that we could find query and click
>> logs for
>> the documentation.
>
> Only if they are redacted/aggregated first. ASF Members have
> access, but we'd need to get permission to distribute (after
> redaction/aggregation) I suspect. Given the AOL marketing fiasco,
> we'd have to go over them in pretty good detail before releasing to
> make sure there is no personal information. AFAIK, I'm the only ASF
> Member who has so far volunteered on this thread and I highly doubt
> I have the time for what I imagine to be a pretty decent sized
> endeavor.
>
> Stripping IP address is pretty straightforward, but the query terms
> might be a bit more involved.
>
> Still, can't hurt to find out what's involved.
>
> -Grant


markrmiller at gmail

May 18, 2009, 8:00 PM

Post #24 of 29 (2909 views)
Permalink
Re: Open Relevance Project? [In reply to]

Grant Ingersoll wrote:
> Some interesting discussion at
> http://thenoisychannel.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise/
>
That was an interesting read. I think a lot of the argument misses the
point. It doesn't seem to me that the main benefit or intent comes from
'bake offs' with other search engines ("Selling search applications to
enterprises isn't, in my experience, about winning relevance
bake-offs.") - the main benefit is allowing us to measure changes and
improvements to Lucene's relevancy calculations and to make judgments
about how Lucene currently performs. I see it easily as important as the
Lucene benchmark contrib. Its not going to be a secret sauce, just like
the benchmarker has been no secret sauce - but its going to make it
easier to reliably improve Lucene in the future.

- Mark
>
> On May 18, 2009, at 1:57 PM, Grant Ingersoll wrote:
>
>>
>> On May 18, 2009, at 11:41 AM, Ted Dunning wrote:
>>
>>> On the other hand, it is likely that we could find query and click
>>> logs for
>>> the documentation.
>>
>> Only if they are redacted/aggregated first. ASF Members have access,
>> but we'd need to get permission to distribute (after
>> redaction/aggregation) I suspect. Given the AOL marketing fiasco,
>> we'd have to go over them in pretty good detail before releasing to
>> make sure there is no personal information. AFAIK, I'm the only ASF
>> Member who has so far volunteered on this thread and I highly doubt I
>> have the time for what I imagine to be a pretty decent sized endeavor.
>>
>> Stripping IP address is pretty straightforward, but the query terms
>> might be a bit more involved.
>>
>> Still, can't hurt to find out what's involved.
>>
>> -Grant
>
>


--
- Mark

http://www.lucidimagination.com


ted.dunning at gmail

May 18, 2009, 8:12 PM

Post #25 of 29 (2908 views)
Permalink
Re: Open Relevance Project? [In reply to]

I completely agee with this. In practice, search engines and to a larger
extent recommendation engines shape user behavior and are, in turn, shaped
by user behavior so that static relevancy tests are of only very limited
value in the end game.

But it is still *very* nice to have them.

On Mon, May 18, 2009 at 8:00 PM, Mark Miller <markrmiller [at] gmail> wrote:

> Grant Ingersoll wrote:
>
>> Some interesting discussion at
>> http://thenoisychannel.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise/
>>
> That was an interesting read. I think a lot of the argument misses the
> point. It doesn't seem to me that the main benefit or intent comes from
> 'bake offs' with other search engines ("Selling search applications to
> enterprises isn't, in my experience, about winning relevance bake-offs.") -
> the main benefit is allowing us to measure changes and improvements to
> Lucene's relevancy calculations and to make judgments about how Lucene
> currently performs. I see it easily as important as the Lucene benchmark
> contrib. Its not going to be a secret sauce, just like the benchmarker has
> been no secret sauce - but its going to make it easier to reliably improve
> Lucene in the future.
>
>

First page Previous page 1 2 Next page Last page  View All Lucene general RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.