benapetr at gmail
Apr 23, 2012, 7:09 AM
I mean, in simple words:
Re: Request for Comments: Cross site data access for Wikidata
[In reply to]
Your idea: when the data on wikidata is changed the new content is
pushed to all local wikis / somewhere
My idea: local wikis retrieve data from wikidata db directly, no need
to push anything on change
On Mon, Apr 23, 2012 at 4:07 PM, Petr Bena <benapetr [at] gmail> wrote:
> I think it would be much better if the local wikis where it is
> supposed to access this would have some sort of client extension which
> would allow them to render the content using the db of wikidata. That
> would be much simpler and faster
> On Mon, Apr 23, 2012 at 2:45 PM, Daniel Kinzler <daniel [at] brightbyte> wrote:
>> Hi all!
>> The wikidata team has been discussing how to best make data from wikidata
>> available on local wikis. Fetching the data via HTTP whenever a page is
>> re-rendered doesn't seem prudent, so we (mainly Jeroen) came up with a
>> push-based architecture.
>> The proposal is at
>> I have copied it below too.
>> Please have a lot and let us know if you think this is viable, and which of the
>> two variants you deem better!
>> -- daniel
>> PS: Please keep the discussion on wikitech-l, so we have it all in one place.
>> == Proposal: HTTP push to local db storage ==
>> * Every time an item on Wikidata is changed, an HTTP push is issued to all
>> subscribing clients (wikis)
>> ** initially, "subscriptions" are just entries in an array in the configuration.
>> ** Pushes can be done via the job queue.
>> ** pushing is done via the mediawiki API, but other protocols such as PubSub
>> Hubbub / AtomPub can easily be added to support 3rd parties.
>> ** pushes need to be authenticated, so we don't get malicious crap. Pushes
>> should be done using a special user with a special user right.
>> ** the push may contain either the full set of information for the item, or just
>> a delta (diff) + hash for integrity check (in case an update was missed).
>> * When the client receives a push, it does two things:
>> *# write the fresh data into a local database table (the local wikidata cache)
>> *# invalidate the (parser) cache for all pages that use the respective item (for
>> now we can assume that we know this from the language links)
>> *#* if we only update language links, the page doesn't even need to be
>> re-parsed: we just update the languagelinks in the cached ParserOutput object.
>> * when a page is rendered, interlanguage links and other info is taken from the
>> local wikidata cache. No queries are made to wikidata during parsing/rendering.
>> * In case an update is missed, we need a mechanism to allow requesting a full
>> purge and re-fetch of all data from on the client side and not just wait until
>> the next push which might very well take a very long time to happen.
>> ** There needs to be a manual option for when someone detects this. maybe
>> action=purge can be made to do this. Simple cache-invalidation however shouldn't
>> pull info from wikidata.
>> **A time-to-live could be added to the local copy of the data so that it's
>> updated by doing a pull periodically so the data does not stay stale
>> indefinitely after a failed push.
>> === Variation: shared database tables ===
>> Instead of having a local wikidata cache on each wiki (which may grow big - a
>> first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all
>> client wikis could access the same central database table(s) managed by the
>> wikidata wiki.
>> * this is similar to the way the globalusage extension tracks the usage of
>> commons images
>> * whenever a page is re-rendered, the local wiki would query the table in the
>> wikidata db. This means a cross-cluster db query whenever a page is rendered,
>> instead a local query.
>> * the HTTP push mechanism described above would still be needed to purge the
>> parser cache when needed. But the push requests would not need to contain the
>> updated data, they may just be requests to purge the cache.
>> * the ability for full HTTP pushes (using the mediawiki API or some other
>> interface) would still be desirable for 3rd party integration.
>> * This approach greatly lowers the amount of space used in the database
>> * it doesn't change the number of http requests made
>> ** it does however reduce the amount of data transferred via http (but not by
>> much, at least not compared to pushing diffs)
>> * it doesn't change the number of database requests, but it introduces
>> cross-cluster requests
>> Wikitech-l mailing list
>> Wikitech-l [at] lists
Wikitech-l mailing list
Wikitech-l [at] lists