ian at wikimedia
Aug 12, 2011, 12:03 PM
Post #18 of 26
Hey, Domas! Firstly, sorry to confuse you with Dario earlier. I am so very
bad with names. :)
Secondly, thank you for putting together the data we have today. I'm not
sure if anyone's mentioned it lately, but it's clearly a really useful
thing. I think that's why we're having this conversation now: what's been
learned about potential use cases, and how can we make this excellent
resource even more valuable?
> Any tips? :-) My thoughts were that the schema used by the GlobalUsage
> > extension might be reusable here (storing wiki, page namespace ID, page
> > namespace name, and page title).
> I don't know what GlobalUsage does, but probably it is all wrong ;-)
Here's an excerpt form the readme:
"When using a shared image repository, it is impossible to see within
whether a file is used on one of the slave wikis. On Wikimedia this is
by the CheckUsage tool on the toolserver, but it is merely a hack of
that should be built in.
"GlobalUsage creates a new table globalimagelinks, which is basically the
as imagelinks, but includes the usage of all images on all associated
The database table itself is about what you'd imagine. It's approximately
the metadata we'd need to uniquely identify an article, but it seems to be
solving a rather different problem. Uniquely identifying an article is
certainly necessary, but I don't think it's the hard part.
I'm not sure that Mysql is the place to store this data--it's big and has
few dimensions. Since we'd have to make external queries available through
an API anyway, why not back it with the right storage engine?
> projectcounts are aggregated by project, pagecounts are aggregated by page.
> If you looked at data it should be obvious ;-)
> And yes, probably best documentation was in some email somewhere. I
> should've started a decent project with descriptions and support and
> Maybe once we move data distribution back into WMF proper, there's no need
> for it to live nowadays somewhere in Germany.
The documentation needed here seems pretty straightforward. Like, a file
at http://dammit.lt/wikistats/README that just explains the format of the
data, what's included, and what's not. We've covered most of it in this
thread already. All that's left is a basic explanation of what each field
means in pagecounts/projectcounts. If you tell me these things, I'll even
write it. :)
> > But the biggest improvement would be post-processing (cleaning up) the
> > source files. Right now if there are anomalies in the data, every re-user
> > expected to find and fix these on their own. It's _incredibly_
> > for everyone to adjust the data (for encoding strangeness, for bad
> > for data manipulation, for page existence possibly, etc.) rather than
> > the source files come out cleaner.
> Raw data is fascinating in that regard though - one can see what are bad
> clients, what are anomalies, how they encode titles, what are erroneus
> titles, etc.
> There're zillions of ways to do post-processing, and none of these will
> match all needs of every user.
Oh, totally! However, I think some uses are more common than others. I bet
this covers them:
1. View counts for a subset of existing articles over a range of dates.
2. Sorted/limited aggregate stats (top 100, bottom 50, etc) for a subset of
articles and date range.
3. Most popular non-existing (missing) articles for a project.
I feel like making those things easier would be awesome, and raw data would
still be available for anyone who wants to build something else. I think
Domas's dataset is great, and the above should be based on it.
Sure, it can be improved in many ways, including more data (some people ask
> (page,geography) aggregations, though with our long tail that is huuuuuge
> dataset growth ;-)
Absolutely. I think it makes sense to start by making the existing data
more usable, and then potentially add more to it in the future.
> > I meant that it wouldn't be very difficult to write a script to take the
> > data and put it into a public database on the Toolserver (which probably
> > enough hardware resources for this project currently).
> I doubt Toolserver has enough resources to have this data thrown at it and
> queried more, unless you simplify needs a lot.
> There's 5G raw uncompressed data per day in text form, and long tail makes
> caching quite painful, unless you go for cache oblivious methods.
Yeah. The folks at trendingtopics.org are processing it all on an EC2
Hadoop cluster, and throwing the results in a SQL database. They have a
very specific focus, though, so their methods might not be appropriate here.
They're an excellent example of someone using the existing dataset in an
interesting way, but the fact that they're using EC2 is telling: many people
do not have the expertise to handle that sort of thing.
I think building an efficiently queryable set of all historic data is
unrealistic without a separate cluster. We're talking 100GB/year, before
indexing, which is about 400GB if we go back to 2008. I can imagine a
workable solution that discards resolution as time passes, which is what
most web stats generation packages do anyway. Here's an example:
Daily counts (and maybe hour of day averages) going back one month (~10GB)
Weekly counts, day of week and hour of day averages going back six months
Monthly stats (including averages) forever (~4GB/year)
That data could be kept in RAM, hashed across two machines, if we really
wanted it to be fast. That's probably not necessary, but you get my point.
> > It's maintainability
> > and sustainability that are the bigger concerns. Once you create a public
> > database for something like this, people will want it to stick around
> > indefinitely. That's quite a load to take on.
> I'd love to see that all the data is preserved infinitely. It is one of
> most interesting datasets around, and its value for the future is quite
Agreed. 100GB a year is not a lot of data to *store* (especially if it's
compressed). It's just a lot to interactively query.
> > I'm also likely being incredibly naïve, though I did note somewhere that
> > wouldn't be a particularly small undertaking to do this project well.
> Well, initial work took few hours ;-) I guess by spending few more hours we
> could improve that, if we really knew what we want.
I think we're in a position to decide what we want.
Honestly, the investigation I've done while participating in this thread
suggests that I can probably get what I want from the raw data. I'll just
pull each day into an in-memory hash, update a database table, and move to
the next day. It'll be slower than if the data was already hanging out in
some hashed format (like Berkeley DB), but whatever. However, I need data
for all articles, which is different from most use cases I think.
I'd like to assemble some examples of projects that need better data, so we
know what it makes sense to build--what seems nice to have and what's
actually useful is so often different.
I agree. By opening up the dataset I expected others to build upon that and
> create services.
> Apparently that doesn't happen. As lots of people use the data, I guess
> there is need for it, but not enough will to build anything for others to
> use, so it will end up being created in WMF proper.
Yeah. I think it's just a tough problem to solve for an outside
contributor. It's hard to get around the need for hardware (which in turn
must be managed and maintained).
> Building a service where data would be shown on every article is relatively
> different task from just analytical workload support.
Yep, however it depends entirely on the same data. It's really just another
Wikitech-l mailing list
Wikitech-l [at] lists