Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

State of page view stats

 

 

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


z at mzmcbride

Aug 11, 2011, 3:12 PM

Post #1 of 26 (3789 views)
Permalink
State of page view stats

Hi.

I've been asked a few times recently about doing reports of the most-viewed
pages per month/per day/per year/etc. A few years after Domas first started
publishing this information in raw form, the current situation seems rather
bleak. Henrik has a visualization tool with a very simple JSON API behind it
(<http://stats.grok.se>), but other than that, I don't know of any efforts
to put this data into a database.

Currently, if you want data on, for example, every article on the English
Wikipedia, you'd have to make 3.7 million individual HTTP requests to
Henrik's tool. At one per second, you're looking at over a month's worth of
continuous fetching. This is obviously not practical.

A lot of people were waiting on Wikimedia's Open Web Analytics work to come
to fruition, but it seems that has been indefinitely put on hold. (Is that
right?)

Is it worth a Toolserver user's time to try to create a database of
per-project, per-page page view statistics? Is it worth a grant from the
Wikimedia Foundation to have someone work on this? Is it worth trying to
convince Wikimedia Deutschland to assign resources? And, of course, it
wouldn't be a bad idea if Domas' first-pass implementation was improved on
Wikimedia's side, regardless.

Thoughts and comments welcome on this. There's a lot of desire to have a
usable system.

MZMcBride



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ian at wikimedia

Aug 11, 2011, 4:26 PM

Post #2 of 26 (3750 views)
Permalink
Re: State of page view stats [In reply to]

Thanks for bringing this up! I don't have any answers, but there's a
feature I'd like to build on this dataset. I wonder if bringing this stuff
into a more readily available database could be part of that project in some
way.

Basically, I'd like to publish per-editor pageview stats. That is,
Mediawiki would keep track of the number of times an article had been viewed
since the first day you edited it, and let you know how many times your
edits had been seen (approximately, depending on the resolution of the
data). I think such personalized stats could really help to drive editor
retention. The information is available now through Henrik's tool, but
even if you know about stats.grok.se, it's hard to keep track and make the
connection between the graphs there and one's own contributions.

Clearly, pageview data of at least daily resolution would be required to
make such a thing work.

Are there other specific projects that require this data? It will be much
easier to make a case for accelerating development of the dataset if there
are some clear examples of where it's needed, and especially if it can help
to meet the current editor retention goals.

-Ian


On Thu, Aug 11, 2011 at 3:12 PM, MZMcBride <z [at] mzmcbride> wrote:

> Hi.
>
> I've been asked a few times recently about doing reports of the most-viewed
> pages per month/per day/per year/etc. A few years after Domas first started
> publishing this information in raw form, the current situation seems rather
> bleak. Henrik has a visualization tool with a very simple JSON API behind
> it
> (<http://stats.grok.se>), but other than that, I don't know of any efforts
> to put this data into a database.
>
> Currently, if you want data on, for example, every article on the English
> Wikipedia, you'd have to make 3.7 million individual HTTP requests to
> Henrik's tool. At one per second, you're looking at over a month's worth of
> continuous fetching. This is obviously not practical.
>
> A lot of people were waiting on Wikimedia's Open Web Analytics work to come
> to fruition, but it seems that has been indefinitely put on hold. (Is that
> right?)
>
> Is it worth a Toolserver user's time to try to create a database of
> per-project, per-page page view statistics? Is it worth a grant from the
> Wikimedia Foundation to have someone work on this? Is it worth trying to
> convince Wikimedia Deutschland to assign resources? And, of course, it
> wouldn't be a bad idea if Domas' first-pass implementation was improved on
> Wikimedia's side, regardless.
>
> Thoughts and comments welcome on this. There's a lot of desire to have a
> usable system.
>
> MZMcBride
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


derhoermi at gmx

Aug 11, 2011, 5:09 PM

Post #3 of 26 (3759 views)
Permalink
Re: State of page view stats [In reply to]

* MZMcBride wrote:
>I've been asked a few times recently about doing reports of the most-viewed
>pages per month/per day/per year/etc. A few years after Domas first started
>publishing this information in raw form, the current situation seems rather
>bleak. Henrik has a visualization tool with a very simple JSON API behind it
>(<http://stats.grok.se>), but other than that, I don't know of any efforts
>to put this data into a database.

When making http://katograph.appspot.com/ which renders the german Wiki-
pedia category system as an interactive "treemap" based on information
like number of articles in them and requests during a 3 day period, I
found that the proxy logs used for stats.grok.se are rather unreliable,
with many of the "top" pages being inplausible (articles on not very
notable subjects that have existed only for a very short time show up in
the top ten, for instance). On http://stats.grok.se/en/top you can see
this aswell, 40 million views for `Special:Export/Robert L. Bradley, Jr`
is rather implausible, as far as human users are concerned.

>Is it worth a Toolserver user's time to try to create a database of
>per-project, per-page page view statistics? Is it worth a grant from the
>Wikimedia Foundation to have someone work on this? Is it worth trying to
>convince Wikimedia Deutschland to assign resources? And, of course, it
>wouldn't be a bad idea if Domas' first-pass implementation was improved on
>Wikimedia's side, regardless.

The data that powers stats.grok.se is available for download, it should
be rather trivial to feed it into toolserver databases and query it as
desired, ignoring performance problems. But short of believing that in
December 2010 "User Datagram Protocol" was more interesting to people
than Julian Assange you would need some other data source to make good
statistics. http://stats.grok.se/de/201009/Ngai.cc would be another ex-
ample.
--
Björn Höhrmann · mailto:bjoern [at] hoehrmann · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


derhoermi at gmx

Aug 11, 2011, 5:53 PM

Post #4 of 26 (3749 views)
Permalink
Re: State of page view stats [In reply to]

* Ian Baker wrote:
>Basically, I'd like to publish per-editor pageview stats. That is,
>Mediawiki would keep track of the number of times an article had been viewed
>since the first day you edited it, and let you know how many times your
>edits had been seen (approximately, depending on the resolution of the
>data). I think such personalized stats could really help to drive editor
>retention. The information is available now through Henrik's tool, but
>even if you know about stats.grok.se, it's hard to keep track and make the
>connection between the graphs there and one's own contributions.

If the stats.grok.se data actually captures nearly all requests, then I
am not sure you realize how low the figures are. On the german Wikipedia
during 20-22 December 2009 the median number of requests for articles in
the category "Mann" (men) was 7, meaning half of the articles have been
requested at most 7 times during a 3 day period (2.33 times per day). In
the same period, the "Hauptseite" (Main Page) registered 900000 requests
(128 000 times as many requests compared to the "Mann" median figure).
--
Björn Höhrmann · mailto:bjoern [at] hoehrmann · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


robla at wikimedia

Aug 11, 2011, 5:55 PM

Post #5 of 26 (3748 views)
Permalink
Re: State of page view stats [In reply to]

On Thu, Aug 11, 2011 at 3:12 PM, MZMcBride <z [at] mzmcbride> wrote:
> I've been asked a few times recently about doing reports of the most-viewed
> pages per month/per day/per year/etc. A few years after Domas first started
> publishing this information in raw form, the current situation seems rather
> bleak. Henrik has a visualization tool with a very simple JSON API behind it
> (<http://stats.grok.se>), but other than that, I don't know of any efforts
> to put this data into a database.
>[...]
> A lot of people were waiting on Wikimedia's Open Web Analytics work to come
> to fruition, but it seems that has been indefinitely put on hold. (Is that
> right?)

That's correct. I owe everyone a longer writeup of our change in
direction on that project which I have the raw notes for.

The short answer is that we've been having a tough time hiring the
people we'd have do this work. Here are the two job descriptions:
http://wikimediafoundation.org/wiki/Job_openings/Software_Developer_Backend
http://wikimediafoundation.org/wiki/Job_openings/Systems_Engineer_-_Data_Analytics

Please help us recruit for these roles (and apply if you believe you are a fit)!

Thanks!
Rob

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


z at mzmcbride

Aug 11, 2011, 6:48 PM

Post #6 of 26 (3747 views)
Permalink
Re: State of page view stats [In reply to]

Rob Lanphier wrote:
> On Thu, Aug 11, 2011 at 3:12 PM, MZMcBride <z [at] mzmcbride> wrote:
>> I've been asked a few times recently about doing reports of the most-viewed
>> pages per month/per day/per year/etc. A few years after Domas first started
>> publishing this information in raw form, the current situation seems rather
>> bleak. Henrik has a visualization tool with a very simple JSON API behind it
>> (<http://stats.grok.se>), but other than that, I don't know of any efforts
>> to put this data into a database.
>> [...]
>> A lot of people were waiting on Wikimedia's Open Web Analytics work to come
>> to fruition, but it seems that has been indefinitely put on hold. (Is that
>> right?)
>
> That's correct. I owe everyone a longer writeup of our change in
> direction on that project which I have the raw notes for.

Okay. Please be sure to copy this list on that write-up. :-)

> The short answer is that we've been having a tough time hiring the
> people we'd have do this work. Here are the two job descriptions:
> http://wikimedia.org/wiki/Job_openings/Software_Developer_Backend
> http://wikimedia.org/wiki/Job_openings/Systems_Engineer_-_Data_Analytics

As someone with most of the skills and resources (with the exception of
time, possibly) to create a page view stats database, reading something like
this makes me think it's not the worth the effort on my part, iff Wikimedia
is planning on devoting actual resources to the endeavor. Is that a
reasonable conclusion to draw? Is it unreasonable?

MZMcBride



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


z at mzmcbride

Aug 11, 2011, 6:54 PM

Post #7 of 26 (3749 views)
Permalink
Re: State of page view stats [In reply to]

Bjoern Hoehrmann wrote:
> When making http://katograph.appspot.com/ which renders the german Wiki-
> pedia category system as an interactive "treemap" based on information
> like number of articles in them and requests during a 3 day period, I
> found that the proxy logs used for stats.grok.se are rather unreliable,
> with many of the "top" pages being inplausible (articles on not very
> notable subjects that have existed only for a very short time show up in
> the top ten, for instance). On http://stats.grok.se/en/top you can see
> this aswell, 40 million views for `Special:Export/Robert L. Bradley, Jr`
> is rather implausible, as far as human users are concerned.

Yes, the data is susceptible to manipulation, both intentional and
unintentional. As I said, this was a first-pass implementation on Domas'
part. As far as I know, this hasn't been touched by anyone in years. You're
absolutely correct that, at the end of the day, until the data itself is
better (more reliable), the resulting tools/graphs/scripts/everything that
rely on it will be bound by its limitations.

> MZMcBride wrote:
>> Is it worth a Toolserver user's time to try to create a database of
>> per-project, per-page page view statistics? Is it worth a grant from the
>> Wikimedia Foundation to have someone work on this? Is it worth trying to
>> convince Wikimedia Deutschland to assign resources? And, of course, it
>> wouldn't be a bad idea if Domas' first-pass implementation was improved on
>> Wikimedia's side, regardless.
>
> The data that powers stats.grok.se is available for download, it should
> be rather trivial to feed it into toolserver databases and query it as
> desired, ignoring performance problems.

Not simply performance. It's a lot of data and it needs to be indexed. That
has a real cost. There are also edge cases and corner cases (different
encodings of requests, etc.) that need to be accounted for. It's not a
particularly small undertaking, if it's to be done properly.

MZMcBride



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


z at mzmcbride

Aug 11, 2011, 8:23 PM

Post #8 of 26 (3743 views)
Permalink
Re: State of page view stats [In reply to]

Ian Baker wrote:
> Basically, I'd like to publish per-editor pageview stats. That is,
> Mediawiki would keep track of the number of times an article had been viewed
> since the first day you edited it, and let you know how many times your
> edits had been seen (approximately, depending on the resolution of the
> data). I think such personalized stats could really help to drive editor
> retention. The information is available now through Henrik's tool, but
> even if you know about stats.grok.se, it's hard to keep track and make the
> connection between the graphs there and one's own contributions.

This is a neat idea. MediaWiki has some page view count support built in,
but it's been disabled on Wikimedia wikis for pretty much forever. The
reality is that MediaWiki isn't launched for the vast majority of requests.
A user making an edit is obviously different, though. I think a database
with per-day view support would make this feature somewhat feasible, in a
JavaScript gadget or in a MediaWiki extension.

> Are there other specific projects that require this data? It will be much
> easier to make a case for accelerating development of the dataset if there
> are some clear examples of where it's needed, and especially if it can help
> to meet the current editor retention goals.

Heh. It's refreshing to hear this said aloud. Yes, if there were some way to
tie page view stats to fundraising/editor retention/usability/the gender
gap/the Global South, it'd be much simpler to get resources devoted to it.
Without a doubt.

There are countless applications for this data, particularly as a means of
measuring Wikipedia's impact. This data also provides a scale against which
other articles and projects can be measured. In a vacuum, knowing that the
English Wikipedia's article "John Doe" received 400 views per day on average
in June means very little. When you can compare that figure to the average
views per day of every other article on the English Wikipedia (or every
other article on the German Wikipedia), you can begin doing real analysis
work. Currently, this really isn't possible, and that's a Bad Thing.

MZMcBride



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


emw.wiki at gmail

Aug 11, 2011, 10:32 PM

Post #9 of 26 (3747 views)
Permalink
Re: State of page view stats [In reply to]

I'd be willing to work on this on a volunteer basis.

I developed http://toolserver.org/~emw/wikistats/, a page view analysis tool
that incorporates lots of features that have been requested of Henrik's tool.
The main bottleneck has been that, like MZMcBride mentions, an underlying
database of page view data is unavailable. Henrik's JSON API has limitations
probably tied to the underlying data model. The fact that there aren't any
other such API's is arguably the bigger problem.

I wrote down some initial thoughts on how this data reliability, and WMF's page
view data services generally, could be improved at
http://en.wikipedia.org/w/index.php?title=User_talk:Emw&oldid=442596566#wikistats_on_toolserver.
I've also drafted more specific implementation plans. These plans assume that
I would be working with the basic data in Domas's archives. There is still a
lot of untapped information in that data -- e.g. hourly views -- and potential
for mashups with categories, automated inference of trend causes, etc. If more
detailed (but still anonymized) OWA data were available, however, that would
obviously open up the potential for much richer APIs and analysis.

Getting the archived page view data into a database seems very doable. This
data seems like it would be useful even if there were OWA data available, since
that OWA data wouldn't cover 12/2007 through 2009. As I see it, the main thing
needed from WMF would be storage space on a publicly-available server. Then,
optionally, maybe some funds for the cost of cloud services to process and
compress the data, and put it into a database. Input and advice would be
invaluable, too.

Eric


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ian at wikimedia

Aug 11, 2011, 11:54 PM

Post #10 of 26 (3743 views)
Permalink
Re: State of page view stats [In reply to]

On Thu, Aug 11, 2011 at 8:23 PM, MZMcBride <z [at] mzmcbride> wrote:
>
> This is a neat idea. MediaWiki has some page view count support built in,
> but it's been disabled on Wikimedia wikis for pretty much forever. The
> reality is that MediaWiki isn't launched for the vast majority of requests.
> A user making an edit is obviously different, though. I think a database
> with per-day view support would make this feature somewhat feasible, in a
> JavaScript gadget or in a MediaWiki extension.


Oh, totally. The only place we can get meaningful data is from the squids,
which is where Dario's data comes from, yes?

Sadly, anything we build that works with that architecture won't be so
useful to other mediawiki installations, at least on the backend. I can
imagine an extension that displays the info to editors that can fetch its
stats from a slightly abstracted datasource, with some architecture whereby
the stats could come from a variety of log-processing applications via a
script or plugin. Then, we could write a connector for our cache clusters,
and someone else could write one for theirs, and we'd still get to share one
codebase for everything else.


> > Are there other specific projects that require this data? It will be
> much
> > easier to make a case for accelerating development of the dataset if
> there
> > are some clear examples of where it's needed, and especially if it can
> help
> > to meet the current editor retention goals.
>
> Heh. It's refreshing to hear this said aloud. Yes, if there were some way
> to
> tie page view stats to fundraising/editor retention/usability/the gender
> gap/the Global South, it'd be much simpler to get resources devoted to it.
> Without a doubt.
>
> There are countless applications for this data, particularly as a means of
> measuring Wikipedia's impact. This data also provides a scale against which
> other articles and projects can be measured. In a vacuum, knowing that the
> English Wikipedia's article "John Doe" received 400 views per day on
> average
> in June means very little. When you can compare that figure to the average
> views per day of every other article on the English Wikipedia (or every
> other article on the German Wikipedia), you can begin doing real analysis
> work. Currently, this really isn't possible, and that's a Bad Thing.


Oh, totally. I can see a lot of really effective potential applications for
this data. However, the link between view statistics and editor retention
isn't necessarily immediately clear. At WMF at least, the prevailing
point-of-view is that readership is doing okay, and at the moment we're
focused on other areas. Personally, I think readership numbers are an
important piece of the puzzle and could be a direct motivator for editing.
Furthermore, increasingly nuanced readership stats might be usable for that
other perennial goal, fundraising (thought I don't have specific ideas for
this at the moment).

I wonder if maybe we could consolidate a couple of concrete proposals for
features that are dependent on this data. That would help to highlight this
as a bottleneck and clearly explain how solving this problem now will help
contribute to meeting current goals.

My thinking is, if it's possible to make a good case for it, it should
happen now. Even if WMF has a req out for a developer to build this,
there's no reason to avoid consolidating the research and ideas in one place
so that person can work more effectively. Were someone in the community to
start building it, even better! If we bring on a dev to collaborate and
maintain it long-term, they'd just end up working together closely for a
while, which would accelerate the learning process for the new developer.
As someone who's still on the steep part of that learning curve, I can
attest that any and all information we can provide will get this feature out
the door faster. :)

-Ian
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


midom.lists at gmail

Aug 12, 2011, 1:49 AM

Post #11 of 26 (3740 views)
Permalink
Re: State of page view stats [In reply to]

Hi!

> Currently, if you want data on, for example, every article on the English
> Wikipedia, you'd have to make 3.7 million individual HTTP requests to
> Henrik's tool. At one per second, you're looking at over a month's worth of
> continuous fetching. This is obviously not practical.

Or you can download raw data.

> A lot of people were waiting on Wikimedia's Open Web Analytics work to come
> to fruition, but it seems that has been indefinitely put on hold. (Is that
> right?)

That project was pulsing with naiveness, if it ever had to be applied to wide scope of all projects ;-)

> Is it worth a Toolserver user's time to try to create a database of
> per-project, per-page page view statistics?

Creating such database is easy, making it efficient is a bit different :-)

> And, of course, it wouldn't be a bad idea if Domas' first-pass implementation was improved on Wikimedia's side, regardless.

My implementation is for obtaining raw data from our squid tier, what is wrong with it?
Generally I had ideas of making query-able data source - it isn't impossible given a decent mix of data structures ;-)

> Thoughts and comments welcome on this. There's a lot of desire to have a
> usable system.

Sure, interesting what people think could be useful with the dataset - we may facilitate it.

> But short of believing that in
> December 2010 "User Datagram Protocol" was more interesting to people
> than Julian Assange you would need some other data source to make good
> statistics.

Yeah, "lies, damn lies and statistics". We need better statistics (adjusted by wikipedian geekiness) than full page sample because you don't believe general purpose wiki articles that people can use in their work can be more popular than some random guy on the internet and trivia about him.
Dracula is also more popular than Julian Assange, so is Jenna Jameson ;-)

> http://stats.grok.se/de/201009/Ngai.cc would be another example.


Unfortunately every time you add ability to spam something, people will spam. There's also unintentional crap that ends up in HTTP requests because of broken clients. It is easy to filter that out in postprocessing, if you want, by applying article-exists bloom filter ;-)

> If the stats.grok.se data actually captures nearly all requests, then I am not sure you realize how low the figures are.

Low they are, Wikipedia's content is all about very long tail of data, besides some heavily accessed head. Just graph top-100 or top-1000 and you will see the shape of the curve:
https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AtHDNfVx0WNhdGhWVlQzRXZuU2podzR2YzdCMk04MlE&hl=en_US&gid=1

> As someone with most of the skills and resources (with the exception of time, possibly) to create a page view stats database, reading something like this makes me think...

Wow.

> Yes, the data is susceptible to manipulation, both intentional and unintentional.

I wonder how someone with most of skills and resources wants to solve this problem (besides the aforementioned article-exists filter, which could reduce dataset quite a lot ;)

> ... you can begin doing real analysis work. Currently, this really isn't possible, and that's a Bad Thing.

Raw data allows you to do whatever analysis you want. Shove it into SPSS/R/.. ;-) Statistics much?

> The main bottleneck has been that, like MZMcBride mentions, an underlying
> database of page view data is unavailable.

Underlying database is available, just not in easily queryable format. There's a distinction there, unless you all imagine database as something you send SQL to and it gives you data. Sorted files are databases too ;-)
Anyway, I don't say that the project is impossible or unnecessary, but there're lots of tradeoffs to be made - what kind of real time querying workloads are to be expected, what kind of pre-filtering do people expect, etc.

Of course, we could always use OWA.

Domas
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


westand at cis

Aug 12, 2011, 4:08 AM

Post #12 of 26 (3742 views)
Permalink
Re: State of page view stats [In reply to]

Hello everyone,

I've actually been parsing the raw data from
[http://dammit.lt/wikistats/] daily into a MySQL database for over a
year now. I also store statistics at hour-granularity, whereas
[stats.grok.se] stores them at day granularity, it seems.

I only do this for en.wiki, and its certainly not efficient enough to
open up for public use. However, I'd be willing to chat and share code
with any interested developer. The strategy and schema are a bit
awkward, but it works, and requires on average ~2 hours processing to
store 24 hours worth of statistics.

Thanks, -AW


On 08/12/2011 04:49 AM, Domas Mituzas wrote:
> Hi!
>
>> Currently, if you want data on, for example, every article on the English
>> Wikipedia, you'd have to make 3.7 million individual HTTP requests to
>> Henrik's tool. At one per second, you're looking at over a month's worth of
>> continuous fetching. This is obviously not practical.
>
> Or you can download raw data.
>
>> A lot of people were waiting on Wikimedia's Open Web Analytics work to come
>> to fruition, but it seems that has been indefinitely put on hold. (Is that
>> right?)
>
> That project was pulsing with naiveness, if it ever had to be applied to wide scope of all projects ;-)
>
>> Is it worth a Toolserver user's time to try to create a database of
>> per-project, per-page page view statistics?
>
> Creating such database is easy, making it efficient is a bit different :-)
>
>> And, of course, it wouldn't be a bad idea if Domas' first-pass implementation was improved on Wikimedia's side, regardless.
>
> My implementation is for obtaining raw data from our squid tier, what is wrong with it?
> Generally I had ideas of making query-able data source - it isn't impossible given a decent mix of data structures ;-)
>
>> Thoughts and comments welcome on this. There's a lot of desire to have a
>> usable system.
>
> Sure, interesting what people think could be useful with the dataset - we may facilitate it.
>
>> But short of believing that in
>> December 2010 "User Datagram Protocol" was more interesting to people
>> than Julian Assange you would need some other data source to make good
>> statistics.
>
> Yeah, "lies, damn lies and statistics". We need better statistics (adjusted by wikipedian geekiness) than full page sample because you don't believe general purpose wiki articles that people can use in their work can be more popular than some random guy on the internet and trivia about him.
> Dracula is also more popular than Julian Assange, so is Jenna Jameson ;-)
>
>> http://stats.grok.se/de/201009/Ngai.cc would be another example.
>
>
> Unfortunately every time you add ability to spam something, people will spam. There's also unintentional crap that ends up in HTTP requests because of broken clients. It is easy to filter that out in postprocessing, if you want, by applying article-exists bloom filter ;-)
>
>> If the stats.grok.se data actually captures nearly all requests, then I am not sure you realize how low the figures are.
>
> Low they are, Wikipedia's content is all about very long tail of data, besides some heavily accessed head. Just graph top-100 or top-1000 and you will see the shape of the curve:
> https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AtHDNfVx0WNhdGhWVlQzRXZuU2podzR2YzdCMk04MlE&hl=en_US&gid=1
>
>> As someone with most of the skills and resources (with the exception of time, possibly) to create a page view stats database, reading something like this makes me think...
>
> Wow.
>
>> Yes, the data is susceptible to manipulation, both intentional and unintentional.
>
> I wonder how someone with most of skills and resources wants to solve this problem (besides the aforementioned article-exists filter, which could reduce dataset quite a lot ;)
>
>> ... you can begin doing real analysis work. Currently, this really isn't possible, and that's a Bad Thing.
>
> Raw data allows you to do whatever analysis you want. Shove it into SPSS/R/.. ;-) Statistics much?
>
>> The main bottleneck has been that, like MZMcBride mentions, an underlying
>> database of page view data is unavailable.
>
> Underlying database is available, just not in easily queryable format. There's a distinction there, unless you all imagine database as something you send SQL to and it gives you data. Sorted files are databases too ;-)
> Anyway, I don't say that the project is impossible or unnecessary, but there're lots of tradeoffs to be made - what kind of real time querying workloads are to be expected, what kind of pre-filtering do people expect, etc.
>
> Of course, we could always use OWA.
>
> Domas
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

--
Andrew G. West, Doctoral Student
Dept. of Computer and Information Science
University of Pennsylvania, Philadelphia PA
Email: westand [at] cis
Website: http://www.cis.upenn.edu/~westand

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


emw.wiki at gmail

Aug 12, 2011, 4:49 AM

Post #13 of 26 (3739 views)
Permalink
Re: State of page view stats [In reply to]

> Anyway, I don't say that the project is impossible or unnecessary, but
there're lots of tradeoffs to be made
> - what kind of real time querying workloads are to be expected, what kind of
pre-filtering do people expect, etc.

I could be biased here, but I think the canonical use case for someone seeking
page view information would be viewing page view counts for a set of articles --
most times a single article, but also multiple articles -- over an arbitrary
time range. Narrowing that down, I'm not sure whether the level of demand for
real-time data (say, for the previous hour) would be higher than the demand for
fast query results for more historical data. Would these two workloads imply
the kind of trade-off you were referring to? If not, could you give some
examples of what kind of expected workloads/use cases would entail such
trade-offs?

If ordering pages by page view count for a given time period would imply such a
tradeoff, then I think it'd make sense to deprioritize page ordering.

I'd be really interested to know your thoughts on an efficient schema for
organizing the raw page view data in the archives at http://dammit.lt/wikistats/.

Thanks,
Eric


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


z at mzmcbride

Aug 12, 2011, 7:30 AM

Post #14 of 26 (3740 views)
Permalink
Re: State of page view stats [In reply to]

Andrew G. West wrote:
> I've actually been parsing the raw data from
> [http://dammit.lt/wikistats/] daily into a MySQL database for over a
> year now. I also store statistics at hour-granularity, whereas
> [stats.grok.se] stores them at day granularity, it seems.
>
> I only do this for en.wiki, and its certainly not efficient enough to
> open up for public use. However, I'd be willing to chat and share code
> with any interested developer. The strategy and schema are a bit
> awkward, but it works, and requires on average ~2 hours processing to
> store 24 hours worth of statistics.

I'd certainly be interested in seeing the code and database schema you've
written, if only as a point of reference and to learn from any
bugs/issues/etc. that you've encountered along the way. Is it possible for
you to post the code you're using somewhere?

MZMcBride



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


z at mzmcbride

Aug 12, 2011, 7:51 AM

Post #15 of 26 (3740 views)
Permalink
Re: State of page view stats [In reply to]

Domas Mituzas wrote:
> Hi!

Hi!

>> Currently, if you want data on, for example, every article on the English
>> Wikipedia, you'd have to make 3.7 million individual HTTP requests to
>> Henrik's tool. At one per second, you're looking at over a month's worth of
>> continuous fetching. This is obviously not practical.
>
> Or you can download raw data.

Downloading gigs and gigs of raw data and then processing it is generally
more impractical for end-users.

>> Is it worth a Toolserver user's time to try to create a database of
>> per-project, per-page page view statistics?
>
> Creating such database is easy, making it efficient is a bit different :-)

Any tips? :-) My thoughts were that the schema used by the GlobalUsage
extension might be reusable here (storing wiki, page namespace ID, page
namespace name, and page title).

>> And, of course, it wouldn't be a bad idea if Domas' first-pass implementation
>> was improved on Wikimedia's side, regardless.
>
> My implementation is for obtaining raw data from our squid tier, what is wrong
> with it? Generally I had ideas of making query-able data source - it isn't
> impossible given a decent mix of data structures ;-)

Well, more documentation is always a good thing. I'd start there.

As I recall, the system of determining which domain a request went to is a
bit esoteric and it might be the worth the cost to store the whole domain
name in order to cover edge cases (labs wikis, wikimediafoundation.org,
*.wikimedia.org, etc.).

There's some sort of distinction between projectcounts and pagecounts (again
with documentation) that could probably stand to be eliminated or
simplified.

But the biggest improvement would be post-processing (cleaning up) the
source files. Right now if there are anomalies in the data, every re-user is
expected to find and fix these on their own. It's _incredibly_ inefficient
for everyone to adjust the data (for encoding strangeness, for bad clients,
for data manipulation, for page existence possibly, etc.) rather than having
the source files come out cleaner.

I think your first-pass was great. But I also think it could be improved.
:-)

>> As someone with most of the skills and resources (with the exception of time,
>> possibly) to create a page view stats database, reading something like this
>> makes me think...
>
> Wow.

I meant that it wouldn't be very difficult to write a script to take the raw
data and put it into a public database on the Toolserver (which probably has
enough hardware resources for this project currently). It's maintainability
and sustainability that are the bigger concerns. Once you create a public
database for something like this, people will want it to stick around
indefinitely. That's quite a load to take on.

I'm also likely being incredibly naïve, though I did note somewhere that it
wouldn't be a particularly small undertaking to do this project well.

>> Yes, the data is susceptible to manipulation, both intentional and
>> unintentional.
>
> I wonder how someone with most of skills and resources wants to solve this
> problem (besides the aforementioned article-exists filter, which could reduce
> dataset quite a lot ;)

I'd actually say that having data for non-existent pages is a feature, not a
bug. There's potential there to catch future redirects and new pages, I
imagine.

>> ... you can begin doing real analysis work. Currently, this really isn't
>> possible, and that's a Bad Thing.
>
> Raw data allows you to do whatever analysis you want. Shove it into SPSS/R/..
> ;-) Statistics much?

A user wants to analyze a category with 100 members for the page view data
of each category member. You think it's a Good Thing that the user has to
first spend countless hours processing gigabytes of raw data in order to do
that analysis? It's a Very Bad Thing. And the people who are capable of
doing analysis aren't always the ones capable of writing the scripts and the
schemas necessary to get the data into a usable form.

>> The main bottleneck has been that, like MZMcBride mentions, an underlying
>> database of page view data is unavailable.
>
> Underlying database is available, just not in easily queryable format. There's
> a distinction there, unless you all imagine database as something you send SQL
> to and it gives you data. Sorted files are databases too ;-)

The reality is that a large pile of data that's not easily queryable is
directly equivalent to no data at all, for most users. Echoing what I said
earlier, it doesn't make much sense for people to be continually forced to
reinvent the wheel (post-processing raw data and putting it into a queryable
format).

MZMcBride



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


westand at cis

Aug 12, 2011, 8:19 AM

Post #16 of 26 (3739 views)
Permalink
Re: State of page view stats [In reply to]

Note that to avoid too much traffic here, I've responded to MZMcBride
privately with my code. I'd be happy to share my code with others, and
include others in its discussion -- just contact me/us privately.

Thanks, -AW


On 08/12/2011 10:30 AM, MZMcBride wrote:
> Andrew G. West wrote:
>> I've actually been parsing the raw data from
>> [http://dammit.lt/wikistats/] daily into a MySQL database for over a
>> year now. I also store statistics at hour-granularity, whereas
>> [stats.grok.se] stores them at day granularity, it seems.
>>
>> I only do this for en.wiki, and its certainly not efficient enough to
>> open up for public use. However, I'd be willing to chat and share code
>> with any interested developer. The strategy and schema are a bit
>> awkward, but it works, and requires on average ~2 hours processing to
>> store 24 hours worth of statistics.
>
> I'd certainly be interested in seeing the code and database schema you've
> written, if only as a point of reference and to learn from any
> bugs/issues/etc. that you've encountered along the way. Is it possible for
> you to post the code you're using somewhere?
>
> MZMcBride
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

--
Andrew G. West, Doctoral Student
Dept. of Computer and Information Science
University of Pennsylvania, Philadelphia PA
Email: westand [at] cis
Website: http://www.cis.upenn.edu/~westand

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


midom.lists at gmail

Aug 12, 2011, 8:46 AM

Post #17 of 26 (3738 views)
Permalink
Re: State of page view stats [In reply to]

> Downloading gigs and gigs of raw data and then processing it is generally
> more impractical for end-users.

You were talking about 3.7M articles. :) It is way more practical than working with pointwise APIs though :-)

> Any tips? :-) My thoughts were that the schema used by the GlobalUsage
> extension might be reusable here (storing wiki, page namespace ID, page
> namespace name, and page title).

I don't know what GlobalUsage does, but probably it is all wrong ;-)

> As I recall, the system of determining which domain a request went to is a
> bit esoteric and it might be the worth the cost to store the whole domain
> name in order to cover edge cases (labs wikis, wikimediafoundation.org,
> *.wikimedia.org, etc.).

*shrug*, maybe, if I'd run a second pass I'd aim for cache oblivious system with compressed data both on-disk and in-cache (currently it is b-tree with standard b-tree costs).
Then we could actually store more data ;-) Do note, there're _lots_ of data items, and increasing per-item cost may quadruple resource usage ;-)

Otoh, expanding project names is straightforward, if you know how).

> There's some sort of distinction between projectcounts and pagecounts (again
> with documentation) that could probably stand to be eliminated or
> simplified.

projectcounts are aggregated by project, pagecounts are aggregated by page. If you looked at data it should be obvious ;-)
And yes, probably best documentation was in some email somewhere. I should've started a decent project with descriptions and support and whatever.
Maybe once we move data distribution back into WMF proper, there's no need for it to live nowadays somewhere in Germany.

> But the biggest improvement would be post-processing (cleaning up) the
> source files. Right now if there are anomalies in the data, every re-user is
> expected to find and fix these on their own. It's _incredibly_ inefficient
> for everyone to adjust the data (for encoding strangeness, for bad clients,
> for data manipulation, for page existence possibly, etc.) rather than having
> the source files come out cleaner.

Raw data is fascinating in that regard though - one can see what are bad clients, what are anomalies, how they encode titles, what are erroneus titles, etc.
There're zillions of ways to do post-processing, and none of these will match all needs of every user.

> I think your first-pass was great. But I also think it could be improved.
> :-)

Sure, it can be improved in many ways, including more data (some people ask (page,geography) aggregations, though with our long tail that is huuuuuge dataset growth ;-)

> I meant that it wouldn't be very difficult to write a script to take the raw
> data and put it into a public database on the Toolserver (which probably has
> enough hardware resources for this project currently).

I doubt Toolserver has enough resources to have this data thrown at it and queried more, unless you simplify needs a lot.
There's 5G raw uncompressed data per day in text form, and long tail makes caching quite painful, unless you go for cache oblivious methods.

> It's maintainability
> and sustainability that are the bigger concerns. Once you create a public
> database for something like this, people will want it to stick around
> indefinitely. That's quite a load to take on.

I'd love to see that all the data is preserved infinitely. It is one of most interesting datasets around, and its value for the future is quite incredible.

> I'm also likely being incredibly naïve, though I did note somewhere that it
> wouldn't be a particularly small undertaking to do this project well.

Well, initial work took few hours ;-) I guess by spending few more hours we could improve that, if we really knew what we want.

> I'd actually say that having data for non-existent pages is a feature, not a
> bug. There's potential there to catch future redirects and new pages, I
> imagine.

That is one of reasons we don't eliminate that data now from raw dataset. I don't see it as a bug, I just see that for long-term aggregations that data could be omitted.

> A user wants to analyze a category with 100 members for the page view data
> of each category member. You think it's a Good Thing that the user has to
> first spend countless hours processing gigabytes of raw data in order to do
> that analysis? It's a Very Bad Thing. And the people who are capable of
> doing analysis aren't always the ones capable of writing the scripts and the
> schemas necessary to get the data into a usable form.

No, I think we should have API to that data to fetch small sets of data without much pain.

> The reality is that a large pile of data that's not easily queryable is
> directly equivalent to no data at all, for most users. Echoing what I said
> earlier, it doesn't make much sense for people to be continually forced to
> reinvent the wheel (post-processing raw data and putting it into a queryable
> format).

I agree. By opening up the dataset I expected others to build upon that and create services.
Apparently that doesn't happen. As lots of people use the data, I guess there is need for it, but not enough will to build anything for others to use, so it will end up being created in WMF proper.

Building a service where data would be shown on every article is relatively different task from just analytical workload support.
For now, building query-able service has been on my todo list, but there were too many initiatives around that suggested that someone else will do that ;-)

Domas



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ian at wikimedia

Aug 12, 2011, 12:03 PM

Post #18 of 26 (3890 views)
Permalink
Re: State of page view stats [In reply to]

Hey, Domas! Firstly, sorry to confuse you with Dario earlier. I am so very
bad with names. :)

Secondly, thank you for putting together the data we have today. I'm not
sure if anyone's mentioned it lately, but it's clearly a really useful
thing. I think that's why we're having this conversation now: what's been
learned about potential use cases, and how can we make this excellent
resource even more valuable?

> Any tips? :-) My thoughts were that the schema used by the GlobalUsage
> > extension might be reusable here (storing wiki, page namespace ID, page
> > namespace name, and page title).
>
> I don't know what GlobalUsage does, but probably it is all wrong ;-)


Here's an excerpt form the readme:

"When using a shared image repository, it is impossible to see within
MediaWiki
whether a file is used on one of the slave wikis. On Wikimedia this is
handled
by the CheckUsage tool on the toolserver, but it is merely a hack of
function
that should be built in.

"GlobalUsage creates a new table globalimagelinks, which is basically the
same
as imagelinks, but includes the usage of all images on all associated
wikis."


The database table itself is about what you'd imagine. It's approximately
the metadata we'd need to uniquely identify an article, but it seems to be
solving a rather different problem. Uniquely identifying an article is
certainly necessary, but I don't think it's the hard part.

I'm not sure that Mysql is the place to store this data--it's big and has
few dimensions. Since we'd have to make external queries available through
an API anyway, why not back it with the right storage engine?

[...]

> projectcounts are aggregated by project, pagecounts are aggregated by page.
> If you looked at data it should be obvious ;-)
> And yes, probably best documentation was in some email somewhere. I
> should've started a decent project with descriptions and support and
> whatever.
> Maybe once we move data distribution back into WMF proper, there's no need
> for it to live nowadays somewhere in Germany.


The documentation needed here seems pretty straightforward. Like, a file
at http://dammit.lt/wikistats/README that just explains the format of the
data, what's included, and what's not. We've covered most of it in this
thread already. All that's left is a basic explanation of what each field
means in pagecounts/projectcounts. If you tell me these things, I'll even
write it. :)


> > But the biggest improvement would be post-processing (cleaning up) the
> > source files. Right now if there are anomalies in the data, every re-user
> is
> > expected to find and fix these on their own. It's _incredibly_
> inefficient
> > for everyone to adjust the data (for encoding strangeness, for bad
> clients,
> > for data manipulation, for page existence possibly, etc.) rather than
> having
> > the source files come out cleaner.
>
> Raw data is fascinating in that regard though - one can see what are bad
> clients, what are anomalies, how they encode titles, what are erroneus
> titles, etc.
> There're zillions of ways to do post-processing, and none of these will
> match all needs of every user.


Oh, totally! However, I think some uses are more common than others. I bet
this covers them:

1. View counts for a subset of existing articles over a range of dates.
2. Sorted/limited aggregate stats (top 100, bottom 50, etc) for a subset of
articles and date range.
3. Most popular non-existing (missing) articles for a project.

I feel like making those things easier would be awesome, and raw data would
still be available for anyone who wants to build something else. I think
Domas's dataset is great, and the above should be based on it.

Sure, it can be improved in many ways, including more data (some people ask
> (page,geography) aggregations, though with our long tail that is huuuuuge
> dataset growth ;-)


Absolutely. I think it makes sense to start by making the existing data
more usable, and then potentially add more to it in the future.


> > I meant that it wouldn't be very difficult to write a script to take the
> raw
> > data and put it into a public database on the Toolserver (which probably
> has
> > enough hardware resources for this project currently).
>
> I doubt Toolserver has enough resources to have this data thrown at it and
> queried more, unless you simplify needs a lot.
> There's 5G raw uncompressed data per day in text form, and long tail makes
> caching quite painful, unless you go for cache oblivious methods.


Yeah. The folks at trendingtopics.org are processing it all on an EC2
Hadoop cluster, and throwing the results in a SQL database. They have a
very specific focus, though, so their methods might not be appropriate here.
They're an excellent example of someone using the existing dataset in an
interesting way, but the fact that they're using EC2 is telling: many people
do not have the expertise to handle that sort of thing.

I think building an efficiently queryable set of all historic data is
unrealistic without a separate cluster. We're talking 100GB/year, before
indexing, which is about 400GB if we go back to 2008. I can imagine a
workable solution that discards resolution as time passes, which is what
most web stats generation packages do anyway. Here's an example:

Daily counts (and maybe hour of day averages) going back one month (~10GB)
Weekly counts, day of week and hour of day averages going back six months
(~10GB)
Monthly stats (including averages) forever (~4GB/year)

That data could be kept in RAM, hashed across two machines, if we really
wanted it to be fast. That's probably not necessary, but you get my point.


> > It's maintainability
> > and sustainability that are the bigger concerns. Once you create a public
> > database for something like this, people will want it to stick around
> > indefinitely. That's quite a load to take on.
>
> I'd love to see that all the data is preserved infinitely. It is one of
> most interesting datasets around, and its value for the future is quite
> incredible.


Agreed. 100GB a year is not a lot of data to *store* (especially if it's
compressed). It's just a lot to interactively query.


> > I'm also likely being incredibly naïve, though I did note somewhere that
> it
> > wouldn't be a particularly small undertaking to do this project well.
>
> Well, initial work took few hours ;-) I guess by spending few more hours we
> could improve that, if we really knew what we want.


I think we're in a position to decide what we want.

Honestly, the investigation I've done while participating in this thread
suggests that I can probably get what I want from the raw data. I'll just
pull each day into an in-memory hash, update a database table, and move to
the next day. It'll be slower than if the data was already hanging out in
some hashed format (like Berkeley DB), but whatever. However, I need data
for all articles, which is different from most use cases I think.

I'd like to assemble some examples of projects that need better data, so we
know what it makes sense to build--what seems nice to have and what's
actually useful is so often different.

I agree. By opening up the dataset I expected others to build upon that and
> create services.
> Apparently that doesn't happen. As lots of people use the data, I guess
> there is need for it, but not enough will to build anything for others to
> use, so it will end up being created in WMF proper.
>

Yeah. I think it's just a tough problem to solve for an outside
contributor. It's hard to get around the need for hardware (which in turn
must be managed and maintained).


> Building a service where data would be shown on every article is relatively
> different task from just analytical workload support.
>

Yep, however it depends entirely on the same data. It's really just another
post-processing step.

-Ian
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ian at wikimedia

Aug 12, 2011, 12:12 PM

Post #19 of 26 (3737 views)
Permalink
Re: State of page view stats [In reply to]

>
>
> I think building an efficiently queryable set of all historic data is
> unrealistic without a separate cluster. We're talking 100GB/year, before
> indexing, which is about 400GB if we go back to 2008.
>
[etc]

So, these numbers were based on my incorrect assumption that the data I was
looking at was daily, but it's actually hourly. So, I guess, multiply
everything by 24, and then disregard some of what I said there?

-Ian
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


z at mzmcbride

Aug 12, 2011, 12:21 PM

Post #20 of 26 (3735 views)
Permalink
Re: State of page view stats [In reply to]

Domas Mituzas wrote:
>> Any tips? :-) My thoughts were that the schema used by the GlobalUsage
>> extension might be reusable here (storing wiki, page namespace ID, page
>> namespace name, and page title).
>
> I don't know what GlobalUsage does, but probably it is all wrong ;-)

GlobalUsage tracks file uses across a wiki family. Its schema is available
here:
<http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/GlobalUsage/Glob
alUsage.sql?view=log>.

>> But the biggest improvement would be post-processing (cleaning up) the
>> source files. Right now if there are anomalies in the data, every re-user is
>> expected to find and fix these on their own. It's _incredibly_ inefficient
>> for everyone to adjust the data (for encoding strangeness, for bad clients,
>> for data manipulation, for page existence possibly, etc.) rather than having
>> the source files come out cleaner.
>
> Raw data is fascinating in that regard though - one can see what are bad
> clients, what are anomalies, how they encode titles, what are erroneus titles,
> etc.
> There're zillions of ways to do post-processing, and none of these will match
> all needs of every user.

Yes, so providing raw data alongside cleaner data or alongside SQL table
dumps (similar to the current dumps for MediaWiki tables) might make more
sense here.

> I'd love to see that all the data is preserved infinitely. It is one of most
> interesting datasets around, and its value for the future is quite incredible.

Nemo has done some work to put the files on Internet Archive, I think.

>> The reality is that a large pile of data that's not easily queryable is
>> directly equivalent to no data at all, for most users. Echoing what I said
>> earlier, it doesn't make much sense for people to be continually forced to
>> reinvent the wheel (post-processing raw data and putting it into a queryable
>> format).
>
> I agree. By opening up the dataset I expected others to build upon that and
> create services.
> Apparently that doesn't happen. As lots of people use the data, I guess there
> is need for it, but not enough will to build anything for others to use, so it
> will end up being created in WMF proper.
>
> Building a service where data would be shown on every article is relatively
> different task from just analytical workload support.
> For now, building query-able service has been on my todo list, but there were
> too many initiatives around that suggested that someone else will do that ;-)

Yes, beyond Henrik's site, there really isn't much. It would probably help
if Wikimedia stopped engaging in so much cookie-licking. That was part of
the purpose of this thread: to clarify what Wikimedia is actually planning
to invest in this endeavor.

Thank you for the detailed replies, Domas. :-)

MZMcBride
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


erikzachte at infodisiac

Aug 12, 2011, 2:49 PM

Post #21 of 26 (3744 views)
Permalink
State of page view stats [In reply to]

I maintain compacted monthly version of dammit.lt page view stats, starting
with Jan 2010 (not an official WMF project).

This is to preserve our page views counts for future historians (compare
Twitter archive by Library of Congress)

It could also be used to resurrect
http://wikistics.falsikon.de/latest/wikipedia/en/ which was very popular.

Alas the author vanished and does not reply on requests and we don't have
the source code.



I just applied for storage on dataset1 or ..2, will publish the monthly <
2Gb files asap.



Each day I download 24 hourly dammit.lt files and compact these into one
file.

Each month I compact these into monthly file.



Major space saving: monthly files with all hourly page views is 8 Gb
(compressed),

with only articles with 5+ page views per month it is even less than 2 Gb.



This is because each page title occurs once instead of up to 24*31 times,
and 'bytes sent' field is omitted.

All hourly counts are preserved, prefixed by day number and hour number.



Here are the first lines of one such file which also describes the format:



Erik Zachte (on wikibreak till Sep 12)





# Wikimedia article requests (aka page views) for year 2010, month 11
#
# Each line contains four fields separated by spaces
# - wiki code (subproject.project, see below)
# - article title (encoding from original hourly files is preserved to
maintain proper sort sequence)
# - monthly total (possibly extrapolated from available data when hours/days
in input were missing)
# - hourly counts (only for hours where indeed article requests occurred)
#
# Subproject is language code, followed by project code
# Project is b:wikibooks, k:wiktionary, n:wikinews, q:wikiquote,
s:wikisource, v:wikiversity, z:wikipedia
# Note: suffix z added by compression script: project wikipedia happens to
be sorted last in dammit.lt files, so add this suffix to fix sort order

#

# To keep hourly counts compact and tidy both day and hour are coded as one
character each, as follows:

# Hour 0..23 shown as A..X convert to number:
ordinal (char) - ordinal ('A')

# Day 1..31 shown as A.._ 27=[ 28=\ 29=] 30=^ 31=_ convert to number:
ordinal (char) - ordinal ('A') + 1

#

# Original data source: Wikimedia full (=unsampled) squid logs

# These data have been aggregated from hourly pagecount files at
<http://dammit.lt/wikistats> http://dammit.lt/wikistats, originally produced
by Domas Mituzas

# Daily and monthly aggregator script built by Erik Zachte

# Each day hourly files for previous day are downloaded and merged into one
file per day # Each month daily files are merged into one file per month

#

# This file contains only lines with monthly page request total
greater/equal 5

#

# Data for all hours of each day were available in input

#

aa.b File:Broom_icon.svg 6 AV1,IQ1,OT1,QB1,YT1,^K1

aa.b File:Wikimedia.png 7 BO1,BW1,CE1,EV1,LA1,TA1,^A1

aa.b File:Wikipedia-logo-de.png 5 BO1,CE1,EV1,LA1,TA1

aa.b File:Wikiversity-logo.png 7 AB1,BO1,CE1,EV1,LA1,TA1,[.C1

aa.b File:Wiktionary-logo-de.png 5 CE1,CM1,EV1,TA1,^N1

aa.b File_talk:Commons-logo.svg 9 CE3,UO3,YE3

aa.b File_talk:Incubator-notext.svg 60
CH3,CL3,DB3,DG3,ET3,FH3,GM3,GO3,IA3,JQ3,KT3,LK3,LL3,MH3,OO3,PF3,XO3,[F3,[O3,
]P3

aa.b MediaWiki:Ipb_cant_unblock 5 BO1,JL1,XX1,[F2













_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


erikzachte at infodisiac

Aug 12, 2011, 2:53 PM

Post #22 of 26 (3764 views)
Permalink
State of page view stats [In reply to]

[Resending as plain text]

I maintain compacted monthly version of dammit.lt page view stats, starting
with Jan 2010 (not an official WMF project).
This is to preserve our page views counts for future historians (compare
Twitter archive by Library of Congress)
It could also be used to resurrect
http://wikistics.falsikon.de/latest/wikipedia/en/ which was very popular.
Alas the author vanished and does not reply on requests and we don’t have
the source code.

I just applied for storage on dataset1 or ..2, will publish the monthly <
2Gb files asap.

Each day I download 24 hourly dammit.lt files and compact these into one
file.
Each month I compact these into monthly file.

Major space saving: monthly files with all hourly page views is 8 Gb
(compressed),
with only articles with 5+ page views  per month it is even less than 2 Gb.

This is because each page title occurs once instead of up to 24*31 times,
and ‘bytes sent’ field is omitted.
All hourly  counts are preserved, prefixed by day number and hour number.  

Here are the first lines of one such file which also describes the format:

Erik Zachte (on wikibreak till Sep 12)



# Wikimedia article requests (aka page views) for year 2010, month 11
#
# Each line contains four fields separated by spaces
# - wiki code (subproject.project, see below)
# - article title (encoding from original hourly files is preserved to
maintain proper sort sequence)
# - monthly total (possibly extrapolated from available data when hours/days
in input were missing)
# - hourly counts (only for hours where indeed article requests occurred)
#
# Subproject is language code, followed by project code
# Project is b:wikibooks, k:wiktionary, n:wikinews, q:wikiquote,
s:wikisource, v:wikiversity, z:wikipedia
# Note: suffix z added by compression script: project wikipedia happens to
be sorted last in dammit.lt files, so add this suffix to fix sort order
#
# To keep hourly counts compact and tidy both day and hour are coded as one
character each, as follows:
# Hour 0..23 shown as A..X                            convert to number:
ordinal (char) - ordinal ('A')
# Day  1..31 shown as A.._  27=[ 28=\ 29=] 30=^ 31=_  convert to number:
ordinal (char) - ordinal ('A') + 1
#
# Original data source: Wikimedia full (=unsampled) squid logs
# These data have been aggregated from hourly pagecount files at
http://dammit.lt/wikistats, originally produced by Domas Mituzas
# Daily and monthly aggregator script built by Erik Zachte
# Each day hourly files for previous day are downloaded and merged into one
file per day # Each month daily files are merged into one file per month
#
# This file contains only lines with monthly page request total
greater/equal 5
#
# Data for all hours of each day were available in input
#
aa.b File:Broom_icon.svg 6 AV1,IQ1,OT1,QB1,YT1,^K1
aa.b File:Wikimedia.png 7 BO1,BW1,CE1,EV1,LA1,TA1,^A1
aa.b File:Wikipedia-logo-de.png 5 BO1,CE1,EV1,LA1,TA1
aa.b File:Wikiversity-logo.png 7 AB1,BO1,CE1,EV1,LA1,TA1,[.C1
aa.b File:Wiktionary-logo-de.png 5 CE1,CM1,EV1,TA1,^N1
aa.b File_talk:Commons-logo.svg 9 CE3,UO3,YE3
aa.b File_talk:Incubator-notext.svg 60
CH3,CL3,DB3,DG3,ET3,FH3,GM3,GO3,IA3,JQ3,KT3,LK3,LL3,MH3,OO3,PF3,XO3,[F3,[O3,
]P3
aa.b MediaWiki:Ipb_cant_unblock 5 BO1,JL1,XX1,[F2









_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


erikzachte at infodisiac

Aug 14, 2011, 5:58 AM

Post #23 of 26 (3717 views)
Permalink
State of page view stats [In reply to]

Page view archives are now online at
http://dumps.wikimedia.org/other/pagecounts-ez/monthly/

Archives contain description of format (also in previous post)

Erik Zachte




_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


robla at wikimedia

Aug 14, 2011, 10:39 PM

Post #24 of 26 (3719 views)
Permalink
Re: State of page view stats [In reply to]

On Fri, Aug 12, 2011 at 12:21 PM, MZMcBride <z [at] mzmcbride> wrote:
> Domas Mituzas wrote:
>> Building a service where data would be shown on every article is relatively
>> different task from just analytical workload support.
>> For now, building query-able service has been on my todo list, but there were
>> too many initiatives around that suggested that someone else will do that ;-)
>
> Yes, beyond Henrik's site, there really isn't much. It would probably help
> if Wikimedia stopped engaging in so much cookie-licking. That was part of
> the purpose of this thread: to clarify what Wikimedia is actually planning
> to invest in this endeavor.

If the question is "am I wasting my time if I work on this?", the
answer is "almost certainly not", so please embark. It will almost
certainly be valuable no matter what you do.

Now, the caveat on that is this: if you ask "will I feel like I've
wasted my time?", the answer is more ambiguous, because I don't know
what you expect. Even a proof of concept is valuable, but you
probably don't want to write a mere proof of concept. So, if you want
to increase the odds that your work will be more than a proof of
concept, then there's more overhead.

Here's an extremely likely version of the future should you decide to
do something here and you're successful in building something: you'll
do something that gets a following. WMF hires a couple of engineers,
and starts on working on the system. The two systems are
complementary, and both end up having their own followings for
different reasons. While it's likely that some future WMF system will
eventually be capable of this, getting granular per-page statistics is
something that hasn't been at the top of the priority list. In one
"wasted time" scenario, we figure out that it wouldn't be *that* hard
to do the same thing with the data we have, and we figure out how to
provide an alternative. However, I suspect that day probably gets
postponed because there would be some other system providing that
function.

With any luck, if you build something, it will be in a state that we
can actually work together on it at some point after we get the people
we plan to hire hired. The more review you get from other people who
understand the Wikimedia cluster, the more likely that case is.

Here's an extremely likely version of the future should you decide not
to do something here: we won't build something like what you have in
mind. So, the best way to guarantee what you want will exist is to
build it.

Re: cookie licking. That's a side-effect of planning in the open. If
we wait until we're sure a project is going to be successfully
completed before we talk about it, we either won't be as open as we
should be, or not taking the risks we should be, or both.

Rob

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


p858snake at gmail

Aug 14, 2011, 10:57 PM

Post #25 of 26 (3718 views)
Permalink
Re: State of page view stats [In reply to]

On Sat, Aug 13, 2011 at 1:19 AM, Andrew G. West <westand [at] cis> wrote:
> Note that to avoid too much traffic here, I've responded to MZMcBride
> privately with my code. I'd be happy to share my code with others, and
> include others in its discussion -- just contact me/us privately.
>
> Thanks, -AW
Depending on what license and type of release you want on your code
you should consider dumping it up on our svn, If you don't have commit
access you can read this page
<http://www.mediawiki.org/wiki/Commit_access> for more information if
you would like to consider that route.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.