Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Log of failed searches

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


apoc2400 at gmail

Jan 14, 2010, 1:37 AM

Post #1 of 20 (1140 views)
Permalink
Log of failed searches

Would it be possible to generate a log or statistics of searches on
Wikipedia using the "Go" button that did not immediately reach an article?
Properly anonymized of course. I think it would be useful for finding
missing articles and redirects to create. There would be a lot of crap of
course, but probably also very useful information on what people have
trouble finding.
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


magnusmanske at googlemail

Jan 14, 2010, 7:09 AM

Post #2 of 20 (1091 views)
Permalink
Re: Log of failed searches [In reply to]

On Thu, Jan 14, 2010 at 9:37 AM, Apoc 2400 <apoc2400 [at] gmail> wrote:
> Would it be possible to generate a log or statistics of searches on
> Wikipedia using the "Go" button that did not immediately reach an article?
> Properly anonymized of course. I think it would be useful for finding
> missing articles and redirects to create. There would be a lot of crap of
> course, but probably also very useful information on what people have
> trouble finding.

We used to have that. I don't remember why it was turned off -
probably too many results.

Magnus

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rainmansr at gmail

Jan 14, 2010, 7:22 AM

Post #3 of 20 (1090 views)
Permalink
Re: Log of failed searches [In reply to]

Magnus Manske wrote:
> On Thu, Jan 14, 2010 at 9:37 AM, Apoc 2400 <apoc2400 [at] gmail> wrote:
>
>> Would it be possible to generate a log or statistics of searches on
>> Wikipedia using the "Go" button that did not immediately reach an article?
>> Properly anonymized of course. I think it would be useful for finding
>> missing articles and redirects to create. There would be a lot of crap of
>> course, but probably also very useful information on what people have
>> trouble finding.
>>
>
> We used to have that. I don't remember why it was turned off -
> probably too many results.
>

We used to do it, and the plan was to make it public, however, there are
privacy issues apparently and no-one knows if we can or cannot publish
them, and in what format etc.. So since it was filling up the disk and
was not used, I have disabled it until a solution and storage space is
found.

Cheers, r.



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


smolensk at eunet

Jan 14, 2010, 7:27 AM

Post #4 of 20 (1093 views)
Permalink
Re: Log of failed searches [In reply to]

Robert Stojnic wrote:
> Magnus Manske wrote:
>> On Thu, Jan 14, 2010 at 9:37 AM, Apoc 2400 <apoc2400 [at] gmail> wrote:
>>> Would it be possible to generate a log or statistics of searches on
>>> Wikipedia using the "Go" button that did not immediately reach an article?

Also, searches made using either button that did not have any results.
There are smaller Wikipedias out there, you know :)

>>> Properly anonymized of course. I think it would be useful for finding
>>> missing articles and redirects to create. There would be a lot of crap of
>>> course, but probably also very useful information on what people have
>>> trouble finding.
>>>
>> We used to have that. I don't remember why it was turned off -
>> probably too many results.
>
> We used to do it, and the plan was to make it public, however, there are
> privacy issues apparently and no-one knows if we can or cannot publish

What would be privacy issues if only the statistics are displayed?

> them, and in what format etc.. So since it was filling up the disk and

I suggest HTML :)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


magnusmanske at googlemail

Jan 14, 2010, 7:47 AM

Post #5 of 20 (1095 views)
Permalink
Re: Log of failed searches [In reply to]

On Thu, Jan 14, 2010 at 3:27 PM, Nikola Smolenski <smolensk [at] eunet> wrote:
> Robert Stojnic wrote:
>> Magnus Manske wrote:
>>> On Thu, Jan 14, 2010 at 9:37 AM, Apoc 2400 <apoc2400 [at] gmail> wrote:
>>>> Would it be possible to generate a log or statistics of searches on
>>>> Wikipedia using the "Go" button that did not immediately reach an article?
>
> Also, searches made using either button that did not have any results.
> There are smaller Wikipedias out there, you know :)
>
>>>> Properly anonymized of course. I think it would be useful for finding
>>>> missing articles and redirects to create. There would be a lot of crap of
>>>> course, but probably also very useful information on what people have
>>>> trouble finding.
>>>>
>>> We used to have that. I don't remember why it was turned off -
>>> probably too many results.
>>
>> We used to do it, and the plan was to make it public, however, there are
>> privacy issues apparently and no-one knows if we can or cannot publish
>
> What would be privacy issues if only the statistics are displayed?

I guess people searching for their own name, or the like.

Suggestion :
* log search and SHA1 IP hash (anonymous!)
* search queries are logged in a standardized fashion (for grouping),
e.g. lowercase, single spaces, no leading/trailing spaces, special
chars converted to spaces, etc.
* display searches per week (?) that have been searched for at least
10 times from at least 5 different IP hashes (to avoid people
searching their own name 100 times...)

Magnus

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rainmansr at gmail

Jan 14, 2010, 7:56 AM

Post #6 of 20 (1089 views)
Permalink
Re: Log of failed searches [In reply to]

This sounds like a good idea, although we could probably argue about
cut-offs. However, since this needs to be done in-house (and not on
toolserver etc because I imagine we cannot distribute raw logs) I image
it is going to go very slow as there is no-one working on it or planning
to work on it from core staff...

r.

> I guess people searching for their own name, or the like.
>
> Suggestion :
> * log search and SHA1 IP hash (anonymous!)
> * search queries are logged in a standardized fashion (for grouping),
> e.g. lowercase, single spaces, no leading/trailing spaces, special
> chars converted to spaces, etc.
> * display searches per week (?) that have been searched for at least
> 10 times from at least 5 different IP hashes (to avoid people
> searching their own name 100 times...)
>
> Magnus
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


bryan.tongminh at gmail

Jan 14, 2010, 7:58 AM

Post #7 of 20 (1091 views)
Permalink
Re: Log of failed searches [In reply to]

On Thu, Jan 14, 2010 at 4:47 PM, Magnus Manske
<magnusmanske [at] googlemail> wrote:
> * log search and SHA1 IP hash (anonymous!)

There are only 2 billion unique addresses and they can all be found in
half an hour probably.


Bryan

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


gmaxwell at gmail

Jan 14, 2010, 8:00 AM

Post #8 of 20 (1097 views)
Permalink
Re: Log of failed searches [In reply to]

On Thu, Jan 14, 2010 at 10:47 AM, Magnus Manske
<magnusmanske [at] googlemail> wrote:
> Suggestion :
> * log search and SHA1 IP hash (anonymous!)

*Any* mapping of the IP is not anonymous. Please see the AOL search
results where unique IDs were connected between searches to disclose
information. (More over a straight simple hash of an IP can be
reversed simply by making a table of all expected IPs)

However: Since this is just for internal logging there is no need to
hash the IP. Just log it directly, and thus avoid the risk that
someone later will think the hash is something which can be disclosed.


> * search queries are logged in a standardized fashion (for grouping),
> e.g. lowercase, single spaces, no leading/trailing spaces, special
> chars converted to spaces, etc.

Excellent.

> * display searches per week (?) that have been searched for at least
> 10 times from at least 5 different IP hashes (to avoid people
> searching their own name 100 times...)

What I've suggested elsewhere was at least 4 different IPs, 5 sounds
fine to me too. I don't know that the minimum of 10 queries matters
once the 5 IP check is in place.

Per week would be okay. No shorter though.


If someone gives me a log format, I'll gladly write a fast tool for
producing this output.
(I did something like that before where I gave Brion a tool to produce
stats from access logs)

I think I have a C code for a parser for wikimedia's squid logs... so
if its just that I already have a good chunk of it done.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


dgerard at gmail

Jan 14, 2010, 8:01 AM

Post #9 of 20 (1090 views)
Permalink
Re: Log of failed searches [In reply to]

2010/1/14 Bryan Tong Minh <bryan.tongminh [at] gmail>:
> On Thu, Jan 14, 2010 at 4:47 PM, Magnus Manske
> <magnusmanske [at] googlemail> wrote:

>> * log search and SHA1 IP hash (anonymous!)

> There are only 2 billion unique addresses and they can all be found in
> half an hour probably.


A count of search terms, with no IP info at all? Would be more useful
than nothing.

(modulo the issue Michael Snow raised re: searches on suppressable names)


- d.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


gmaxwell at gmail

Jan 14, 2010, 8:15 AM

Post #10 of 20 (1095 views)
Permalink
Re: Log of failed searches [In reply to]

On Thu, Jan 14, 2010 at 11:01 AM, David Gerard <dgerard [at] gmail> wrote:
> 2010/1/14 Bryan Tong Minh <bryan.tongminh [at] gmail>:
>> On Thu, Jan 14, 2010 at 4:47 PM, Magnus Manske
>> <magnusmanske [at] googlemail> wrote:
>
>>> * log search and SHA1 IP hash (anonymous!)
>
>> There are only 2 billion unique addresses and they can all be found in
>> half an hour probably.
>
>
> A count of search terms, with no IP info at all? Would be more useful
> than nothing.
>
> (modulo the issue Michael Snow raised re: searches on suppressable names)

Magnus was not suggesting disclosing the IP hash, as far as I can
tell. He demonstrating an abundance of caution in suggesting only
logging that. (er, well, yea, if he was suggesting disclosing that...
we shouldn't do that. Even if we add a secret to the hash, it's risky
and allows interesting correlation attacks)


Here is what I would suggest disclosing:

#start_datetime end_datetime hits search_string
2010-01-01-0:0:4 2010-01-13-23-59-50 39284 naked people
2010-01-01-0:0:4 2010-01-13-23-59-50 23950 hot grits
...
2010-01-01-0:0:4 2010-01-13-23-59-50 5 autoerotic quantum chromodynamics


Which has first been filtered by:
* Canonicalization of strings (at least ascii case folding)
* Excluding strings over some length
* Excluding searches which did not come from at least 5 distinct IPs
during the reporting interval



There will be useful information excluded by this process, e.g. gads
of misspellings which came from only two to four unique IPs... but the
output would still be *far* more useful no information at all.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


gmaxwell at gmail

Jan 14, 2010, 8:21 AM

Post #11 of 20 (1090 views)
Permalink
Re: Log of failed searches [In reply to]

On Thu, Jan 14, 2010 at 11:15 AM, Gregory Maxwell <gmaxwell [at] gmail> wrote:
> Here is what I would suggest disclosing:
> #start_datetime end_datetime hits search_string
> 2010-01-01-0:0:4 2010-01-13-23-59-50 39284 naked people
> 2010-01-01-0:0:4 2010-01-13-23-59-50 23950 hot grits
> ...
> 2010-01-01-0:0:4 2010-01-13-23-59-50 5 autoerotic quantum chromodynamics

The logs are probably combined across wikis, so I'd change that to

#start_datetime end_datetime projectcode hits search_string
2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikipedia 39284 naked people
2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikipedia 23950 hot grits
...
2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikipedia 5 autoerotic quantum
chromodynamics
2010-01-01-0:0:4 2010-01-13-23-59-50 de.wikipedia 25093 Bondage &
Disziplin Pokémon
...
...
2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikinews 5 ethics in journalism

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


oscar.vives at gmail

Jan 14, 2010, 8:41 AM

Post #12 of 20 (1094 views)
Permalink
Re: Log of failed searches [In reply to]

2010/1/14 Gregory Maxwell <gmaxwell [at] gmail>:
> On Thu, Jan 14, 2010 at 11:15 AM, Gregory Maxwell <gmaxwell [at] gmail> wrote:
>> Here is what I would suggest disclosing:
>> #start_datetime end_datetime hits search_string
>> 2010-01-01-0:0:4 2010-01-13-23-59-50 39284 naked people
>> 2010-01-01-0:0:4 2010-01-13-23-59-50 23950 hot grits
>> ...
>> 2010-01-01-0:0:4 2010-01-13-23-59-50 5 autoerotic quantum chromodynamics
>
> The logs are probably combined across wikis, so I'd change that to
>
> #start_datetime end_datetime projectcode hits search_string
> 2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikipedia 39284 naked people
> 2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikipedia 23950 hot grits
> ...
> 2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikipedia 5 autoerotic quantum
> chromodynamics
> 2010-01-01-0:0:4 2010-01-13-23-59-50 de.wikipedia 25093 Bondage &
> Disziplin Pokémon

my $0.02

I expect some fun here, since error encodings will hit things like &, ñ, ó.

2010-01-01-0:0:4 2010-01-13-23-59-50 de.wikipedia 25093 Bondage amp;
Disziplin Pokémon
2010-01-01-0:0:4 2010-01-13-23-59-50 de.wikipedia 25093 Bondage %33amp;
Disziplin Pokémon
....

on the other part, all these errors will be browser/proxy bugs, and
not mediawiki bugs. I think.
Anyway, if the "special characters" are replaced by spaces, there will
be less weird shit, and more misterious space holes.


--
--
Fin del Mensaje.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


conrad.irwin at googlemail

Jan 14, 2010, 9:22 AM

Post #13 of 20 (1089 views)
Permalink
Re: Log of failed searches [In reply to]

> * search queries are logged in a standardized fashion (for grouping),
> e.g. lowercase, single spaces, no leading/trailing spaces, special
> chars converted to spaces, etc.

Wiktionary is case-sensitive and so case-folding there may not be
appropriate; I personally would be interested in seeing these logs
before even the NFC normalizers get to them (given a lack of any other
source to find out how people type fun characters in the wild) though I
can appreciate this is somewhat sadistic, and probably the logs are
taken too late for this.

It would not be too much work to publish a set of post-processing
scripts that could perform those normalisations that people are
interested in; I don't think any two people will agree exactly on what
information is useful, and removing information unnecessarily is just
draconian.

> * display searches per week (?) that have been searched for at least
> 10 times from at least 5 different IP hashes (to avoid people
> searching their own name 100 times...)

I don't think the IP addresses should come into the analysis at all,
though possibly a cut-off at 5 or 10 searches might be useful to prevent
a huge tail-end of probably useless information (it also might exclude
cases where people have typed things into the search box by accident -
maybe they got distracted while logging in)

> The logs are probably combined across wikis, so I'd change that to
>
> #start_datetime end_datetime projectcode hits search_string

If these files were to be provided regularly, it would make sense to
have the time period and the wiki defined in the file name, either a
month or a week at a time, this would leave the file contents very
simple, just the raw number of hits followed by a space, followed by
what was typed into the Search box (or as close to as is available).

$ cat enwiktionary-2010-01-failedsearches.lis

123919 MLIF
....
12873 mlif
...
103 MILF definition
...
1 what does M.I.L.F meen????

Conrad

( http://en.wiktionary.org/w/index.php?oldid=4055082 for MILF explanation)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Jan 14, 2010, 9:51 AM

Post #14 of 20 (1089 views)
Permalink
Re: Log of failed searches [In reply to]

On Thu, Jan 14, 2010 at 12:22 PM, Conrad Irwin
<conrad.irwin [at] googlemail> wrote:
> Wiktionary is case-sensitive and so case-folding there may not be
> appropriate; I personally would be interested in seeing these logs
> before even the NFC normalizers get to them (given a lack of any other
> source to find out how people type fun characters in the wild) though I
> can appreciate this is somewhat sadistic, and probably the logs are
> taken too late for this.

The logs are taken from the Squids, long before MediaWiki touches
them, so they shouldn't be normalized at all.

> I don't think the IP addresses should come into the analysis at all,
> though possibly a cut-off at 5 or 10 searches might be useful to prevent
> a huge tail-end of probably useless information (it also might exclude
> cases where people have typed things into the search box by accident -
> maybe they got distracted while logging in)

Some people might search for their own name more than five times in a
week, possibly together with other embarrassing or incriminating
search terms.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


conrad.irwin at googlemail

Jan 14, 2010, 10:15 AM

Post #15 of 20 (1090 views)
Permalink
Re: Log of failed searches [In reply to]

On 01/14/2010 05:51 PM, Aryeh Gregor wrote:
> On Thu, Jan 14, 2010 at 12:22 PM, Conrad Irwin
> <conrad.irwin [at] googlemail> wrote:
>> Wiktionary is case-sensitive and so case-folding there may not be
>> appropriate; I personally would be interested in seeing these logs
>> before even the NFC normalizers get to them (given a lack of any other
>> source to find out how people type fun characters in the wild) though I
>> can appreciate this is somewhat sadistic, and probably the logs are
>> taken too late for this.
>
> The logs are taken from the Squids, long before MediaWiki touches
> them, so they shouldn't be normalized at all.
>
>> I don't think the IP addresses should come into the analysis at all,
>> though possibly a cut-off at 5 or 10 searches might be useful to prevent
>> a huge tail-end of probably useless information (it also might exclude
>> cases where people have typed things into the search box by accident -
>> maybe they got distracted while logging in)
>
> Some people might search for their own name more than five times in a
> week, possibly together with other embarrassing or incriminating
> search terms.

Such people would be able to deny searching for such terms, I don't see
this as posing any more problems than the history dumps. Thinking
further though, it would be possible to tie a search to an IP address or
User when a page is created with the search term (as it is highly likely
if there was only one search that it was this user who did it).

It thus seems likely that a cut off point is needed, and that it can
only be chosen arbitrarily or by someone with relevant permission
scanning logs to find out this information. Looking at "prior art", it
seems that 25 is high enough or more than:
http://wikistics.falsikon.de/2008/wiktionary/fr/wanted/ but obviously,
the higher the number, the less complete the lists are.

Conrad

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


conrad.irwin at googlemail

Jan 14, 2010, 10:26 AM

Post #16 of 20 (1093 views)
Permalink
Re: Log of failed searches [In reply to]

> scanning logs to find out this information. Looking at "prior art", it
> seems that 25 is high enough or more than:
> http://wikistics.falsikon.de/2008/wiktionary/fr/wanted/ but obviously,
> the higher the number, the less complete the lists are.
>
> Conrad

Whoops, that should have been 14, can't do maths any more; sorry for the
spam.

Conrad

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


gmaxwell at gmail

Jan 14, 2010, 11:30 AM

Post #17 of 20 (1095 views)
Permalink
Re: Log of failed searches [In reply to]

On Thu, Jan 14, 2010 at 12:22 PM, Conrad Irwin
<conrad.irwin [at] googlemail> wrote:
> Wiktionary is case-sensitive and so case-folding there may not be
> appropriate; I personally would be interested in seeing these logs
> before even the NFC normalizers get to them (given a lack of any other
> source to find out how people type fun characters in the wild) though I
> can appreciate this is somewhat sadistic, and probably the logs are
> taken too late for this.
>
> It would not be too much work to publish a set of post-processing
> scripts that could perform those normalisations that people are
> interested in; I don't think any two people will agree exactly on what

You've missed the point of the normalization here. It's not to be
helpful to users: As you observe, it's easy for the recipient of the
list to perform their own. The reason to normalize is to push more
queries above the reporting threshold. For example, 5 people might
search for "john f. kinndey" (a misspelling of "John F. Kennedy"?) but
all capitalize it differently. A redirect on this misspelling would be
useful regardless of the case.

All things equal I'd rather *not* normalize the data... it's just more
stuff that may have surprising behaviour. But I think this is
something which may need to be balanced against the disclosure
threshold.

It would also be possible to do the disclosure calculation against
normalized data while releasing the raw values... but I must admit a
little bit of uneasiness that the normalization might be ignoring some
piece of information relevant to privacy.

For example, if we were to go that route we might employ some fairly
aggressive normalization... removing all whitespace and punctuation.
If we went as far as also removing all *numbers* from the check we'd
run into things like "Greg Maxwell (555)-555-1212" getting published
because enough distinct people searched for "greg maxwell". Obviously
the answer to that one is "don't remove numbers" from the check, but I
worry about the cases I haven't thought of.

On Thu, Jan 14, 2010 at 12:51 PM, Aryeh Gregor
<Simetrical+wikilist [at] gmail> wrote:
> Some people might search for their own name more than five times in a
> week, possibly together with other embarrassing or incriminating
> search terms.

Yes, it's possible that someone may search 5 times, from 5 IPs (which
*might* be from one machine due to proxy round-Robbins), an identical
string ... "MyFullName seen on friday night with a woman other than
his wife" ... but what to do?

Any information which is disclosed has some risk of disclosing
something that someone would rather not be. This risk can be made
arbitrarily small, but it can't be eliminated.

I think the benefit to the readers of having this information
available easily outweighs some sufficiently fringe confidentiality
concern. At some point your frequently repeated search is a
statistic, which no reasonable privacy policy would frown on
disclosing.

This is important to our operations, disclosing it is in the public
interest, and failing to do work in this area puts us at a
disadvantage compared to other parties who might be far less
scrupulous. (e.g. If WMF's search performs poorly, you might feel
compelled to use Search Engine X — which happens to secretly sell your
data to the highest bidder.)

Is there some sufficiently high number which *no one* paying attention
here has a concern about? We could simply start with that.... and
possibly lower the threshold over time as the lowest hanging fruit are
solved, tracking our disclosure comfort.

I think we all have an interest and obligation to take every
reasonable means, but no one can ask for more than that.

Would anyone feel more comfortable if this ignored queries made via
the secure server? Non-HTTPS traffic can be watched by anyone on the
path between you and Wikimedia... any illusion of absolute privacy on
the insecure traffic is patently false already.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rainmansr at gmail

Jan 14, 2010, 12:03 PM

Post #18 of 20 (1089 views)
Permalink
Re: Log of failed searches [In reply to]

> Such people would be able to deny searching for such terms, I don't see
> this as posing any more problems than the history dumps. Thinking
> further though, it would be possible to tie a search to an IP address or
> User when a page is created with the search term (as it is highly likely
> if there was only one search that it was this user who did it)

I think the biggest issue is that people expect their search queries to
be private. When the word got out that search log might become available
we got a couple of angry remarks like "we didn't sign up for this", and
"even google doesn't do it". Some form of statistics would probably be
fine with our users, but the cut-off numbers would need to be high
enough so that people no longer feel it is "their query"..

r.


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Platonides at gmail

Jan 14, 2010, 3:32 PM

Post #19 of 20 (1094 views)
Permalink
Re: Log of failed searches [In reply to]

Aryeh Gregor wrote:
> The logs are taken from the Squids, long before MediaWiki touches
> them, so they shouldn't be normalized at all.

Search isn't cached, so it may be easier to just log it at the backend.


I expect many people using things like "please tell me how many people
live in China", as revealed by such titles being created.
My conclusion is that some people (10%?) don't know how to search in a
encyclopedia. I mean, we have an article called [[China]] with a proper
Population section...

While reading this thread I have deleted a page called "Why do ghosts
manifest themselves?" with content
"fogcpijkñldjlkcmvlkmc.,vmblcjgmlkjglkjmf,.mfdgfdolfgdjk" [1].

I'm thinking in an extension to feed with regex extracting the actual
title they may be loking for.

Sampled search logs are unlikely to reveal them though, since what they
are repeating are the non-keywords, not the full query.

1-http://es.wikipedia.org/w/index.php?title=Special:Log&page=Por_que_se_manifiestan_los_fantasmas


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


gmaxwell at gmail

Jan 14, 2010, 6:16 PM

Post #20 of 20 (1081 views)
Permalink
Re: Log of failed searches [In reply to]

On Thu, Jan 14, 2010 at 6:32 PM, Platonides <Platonides [at] gmail> wrote:
> Sampled search logs are unlikely to reveal them though, since what they
> are repeating are the non-keywords, not the full query.

Sampling is fine, but aggregated logs aren't likely to… thats the
primary reason for reporting things other than the topmost queries.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.