rlane32 at gmail
Apr 23, 2012, 11:06 AM
Why not just use a queue? We use the job queue for this right now for
Re: Request for Comments: Cross site data access for Wikidata
[In reply to]
nearly the same purpose. The job queue isn't amazing, but it works.
Maybe someone should replace this with a better system while they are
On Mon, Apr 23, 2012 at 5:45 AM, Daniel Kinzler <daniel [at] brightbyte> wrote:
> Hi all!
> The wikidata team has been discussing how to best make data from wikidata
> available on local wikis. Fetching the data via HTTP whenever a page is
> re-rendered doesn't seem prudent, so we (mainly Jeroen) came up with a
> push-based architecture.
> The proposal is at
> I have copied it below too.
> Please have a lot and let us know if you think this is viable, and which of the
> two variants you deem better!
> -- daniel
> PS: Please keep the discussion on wikitech-l, so we have it all in one place.
> == Proposal: HTTP push to local db storage ==
> * Every time an item on Wikidata is changed, an HTTP push is issued to all
> subscribing clients (wikis)
> ** initially, "subscriptions" are just entries in an array in the configuration.
> ** Pushes can be done via the job queue.
> ** pushing is done via the mediawiki API, but other protocols such as PubSub
> Hubbub / AtomPub can easily be added to support 3rd parties.
> ** pushes need to be authenticated, so we don't get malicious crap. Pushes
> should be done using a special user with a special user right.
> ** the push may contain either the full set of information for the item, or just
> a delta (diff) + hash for integrity check (in case an update was missed).
> * When the client receives a push, it does two things:
> *# write the fresh data into a local database table (the local wikidata cache)
> *# invalidate the (parser) cache for all pages that use the respective item (for
> now we can assume that we know this from the language links)
> *#* if we only update language links, the page doesn't even need to be
> re-parsed: we just update the languagelinks in the cached ParserOutput object.
> * when a page is rendered, interlanguage links and other info is taken from the
> local wikidata cache. No queries are made to wikidata during parsing/rendering.
> * In case an update is missed, we need a mechanism to allow requesting a full
> purge and re-fetch of all data from on the client side and not just wait until
> the next push which might very well take a very long time to happen.
> ** There needs to be a manual option for when someone detects this. maybe
> action=purge can be made to do this. Simple cache-invalidation however shouldn't
> pull info from wikidata.
> **A time-to-live could be added to the local copy of the data so that it's
> updated by doing a pull periodically so the data does not stay stale
> indefinitely after a failed push.
> === Variation: shared database tables ===
> Instead of having a local wikidata cache on each wiki (which may grow big - a
> first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all
> client wikis could access the same central database table(s) managed by the
> wikidata wiki.
> * this is similar to the way the globalusage extension tracks the usage of
> commons images
> * whenever a page is re-rendered, the local wiki would query the table in the
> wikidata db. This means a cross-cluster db query whenever a page is rendered,
> instead a local query.
> * the HTTP push mechanism described above would still be needed to purge the
> parser cache when needed. But the push requests would not need to contain the
> updated data, they may just be requests to purge the cache.
> * the ability for full HTTP pushes (using the mediawiki API or some other
> interface) would still be desirable for 3rd party integration.
> * This approach greatly lowers the amount of space used in the database
> * it doesn't change the number of http requests made
> ** it does however reduce the amount of data transferred via http (but not by
> much, at least not compared to pushing diffs)
> * it doesn't change the number of database requests, but it introduces
> cross-cluster requests
> Wikitech-l mailing list
> Wikitech-l [at] lists
Wikitech-l mailing list
Wikitech-l [at] lists