Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Wikidata blockers weekly update

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


denny.vrandecic at wikimedia

Aug 9, 2012, 6:54 AM

Post #1 of 20 (1294 views)
Permalink
Wikidata blockers weekly update

Hi all,

here is our update on last weeks blocker email. The list got
considerably shorter, but we have some long standing issues still
there. No new blockers came up.

== Ongoing ==

* Merging the Wikidata branch (ContentHandler) is still open, see
<https://bugzilla.wikimedia.org/show_bug.cgi?id=38622>. There has been
no feedback in the last few weeks. Daniel is waiting for input.

* Changeset <https://gerrit.wikimedia.org/r/#/c/14295/>, bug
<https://bugzilla.wikimedia.org/show_bug.cgi?id=38705> about handling
sites. The idea is to migrate from the "interwiki" table to the new
"Sites" facility. RobLa mentioned two weeks ago that Chad seems to be
working in a similar direction, but we haven't seen comments yet. No
discussion is ongoing or any substantial feedback was received here as
well, and it seems somewhat stuck.

== New in the list ==

Nothing.

== Merges ==
* https://gerrit.wikimedia.org/r/#/c/14301/ (got merged. Yay!)

== Abandoned changesets or not-blocking anymore ==
* https://gerrit.wikimedia.org/r/#/c/14084/ (abandoned)
* https://gerrit.wikimedia.org/r/#/c/8924/ (not blocking anymore but
could use some reviewing love)
* https://gerrit.wikimedia.org/r/#/c/14303/ (review in progress. not
blocking anymore if we drop the STTL extension in favour of the ULS
extension, currently investigated)
* https://gerrit.wikimedia.org/r/#/c/17073/ (a change to the skin,
which we abandoned and we resolve it differently)

I hope this helps,
Cheers,
Denny

--
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


robla at wikimedia

Aug 9, 2012, 7:49 AM

Post #2 of 20 (1282 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

Hi Denny,

Thanks for the update. Comments inline:

On Thu, Aug 9, 2012 at 6:54 AM, Denny Vrandečić
<denny.vrandecic [at] wikimedia> wrote:
> * Merging the Wikidata branch (ContentHandler) is still open, see
> <https://bugzilla.wikimedia.org/show_bug.cgi?id=38622>. There has been
> no feedback in the last few weeks. Daniel is waiting for input.

Per discussion on the bug, there's an unresolved issue with the code
as stored in Gerrit. Tim tried cloning this, and wasn't able to find
some of the revisions that Daniel referred to. Last week, you
mentioned that Daniel was going to send mail to the list about the
Gerrit stuff, but I don't recall seeing that.

If you can get the code somewhere Tim can review it, he's ready to look at it.

> * Changeset <https://gerrit.wikimedia.org/r/#/c/14295/>, bug
> <https://bugzilla.wikimedia.org/show_bug.cgi?id=38705> about handling
> sites. The idea is to migrate from the "interwiki" table to the new
> "Sites" facility. RobLa mentioned two weeks ago that Chad seems to be
> working in a similar direction, but we haven't seen comments yet. No
> discussion is ongoing or any substantial feedback was received here as
> well, and it seems somewhat stuck.

I'd strongly suggest starting a separate thread on this mailing list
about this (please change the subject line if you reply to this
message). In short, this is a controversial approach, and is unclear
why you're letting it block your work.

It looks like this page needs an update as well:
http://www.mediawiki.org/wiki/Wikidata_deployment

One thing that was tacked on the wiki page without mention here or a
bug created was the "Stick to that language" extension. Is that a
hard requirement, or nice to have?

Rob

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jeroendedauw at gmail

Aug 9, 2012, 8:01 AM

Post #3 of 20 (1285 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

Hey,

> ... is unclear why you're letting it block your work.

The current "interwiki" related code in core has many assumptions baked in
that prevent us from doing what we need to do in phase 1. For instance
"language links" can only be made to sites with an id that is a language
code. Since we're properly identifying sites across our clients, we're
using global identifiers, which will be "enwiki" rather then "en", so this
cannot work. That's only one of the many evil things in the current code.

> this is a controversial approach

How so?

Is anyone suggesting building on top of the pile of crap we currently have
would be better?

Cheers

--
Jeroen De Dauw
http://www.bn2vs.com
Don't panic. Don't be evil.
--
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


lists at nadir-seen-fire

Aug 9, 2012, 8:26 AM

Post #4 of 20 (1279 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

On Thu, 09 Aug 2012 06:54:03 -0700, Denny Vrandečić
<denny.vrandecic [at] wikimedia> wrote:

> Hi all,
>
> [...]
> * Changeset <https://gerrit.wikimedia.org/r/#/c/14295/>, bug
> <https://bugzilla.wikimedia.org/show_bug.cgi?id=38705> about handling
> sites. The idea is to migrate from the "interwiki" table to the new
> "Sites" facility. RobLa mentioned two weeks ago that Chad seems to be
> working in a similar direction, but we haven't seen comments yet. No
> discussion is ongoing or any substantial feedback was received here as
> well, and it seems somewhat stuck.
> [...]
>
> I hope this helps,
> Cheers,
> Denny
>

I would like some more information on this. The bug doesn't appear to even
have the correct link for a discussion on this.

Redoing our interwiki code to deal with some mistakes we made in storage
was something I was hoping to do.
So if this is something hoping to replace the interwiki system I'd like to
look over what the plan and overall idea is with this to make sure we
don't repeat the same mistakes.

--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


denny.vrandecic at wikimedia

Aug 9, 2012, 8:48 AM

Post #5 of 20 (1280 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

Hi Rob,

thanks for the answers.

2012/8/9 Rob Lanphier <robla [at] wikimedia>:
> It looks like this page needs an update as well:
> http://www.mediawiki.org/wiki/Wikidata_deployment

Thanks, I updated the page.

> One thing that was tacked on the wiki page without mention here or a
> bug created was the "Stick to that language" extension. Is that a
> hard requirement, or nice to have?

We are currently investigating using the "Universal Language Selector"
instead of "Stick to that language", and on first glance it looks
good. If this remains like this, we will drop "Stick to that
language". That is why I didn't list the corresponding open issues
there. We'd be happy to go for ULS instead. We expect to have a
resolution on that next week.

Cheers,
Denny

--
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


daniel at brightbyte

Aug 9, 2012, 9:06 AM

Post #6 of 20 (1278 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

On 09.08.2012 16:49, Rob Lanphier wrote:
> On Thu, Aug 9, 2012 at 6:54 AM, Denny Vrandečić
> <denny.vrandecic [at] wikimedia> wrote:
>> * Merging the Wikidata branch (ContentHandler) is still open, see
>> <https://bugzilla.wikimedia.org/show_bug.cgi?id=38622>. There has been
>> no feedback in the last few weeks. Daniel is waiting for input.
>
> Per discussion on the bug, there's an unresolved issue with the code
> as stored in Gerrit. Tim tried cloning this, and wasn't able to find
> some of the revisions that Daniel referred to.

That would be strange, a straight fresh clone works fine for me, the revisions
are in the log.

Tim, please confirm that you are unable to see the changes I mentioned when you
just switch to the Wikidata branch on an up to date working copy of core,
ignoring Gerrit.

Also, dennyb added direct links to the respective commits on gitweb. They are
there. Gerrit just doesn't know about it. And the shortlogs on gitweb are strange.

> Last week, you
> mentioned that Daniel was going to send mail to the list about the
> Gerrit stuff, but I don't recall seeing that.

I investigated the problem and reported my findings on bugzilla. There isn't
muchz to say except "gerrit doesn't know about direct pushes" and "gitweb is
confusing".

> If you can get the code somewhere Tim can review it, he's ready to look at it.

Well, it's in the git repo. Everyone in the team is using that branch for
development and testing, they'd notice if important changes were missing. So i'm
confident that it really *is* there.

-- daniel



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jeroendedauw at gmail

Aug 9, 2012, 9:12 AM

Post #7 of 20 (1280 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

Hey,

> So if this is something hoping to replace the interwiki system I'd like
to look over what the plan and overall idea is with this to make sure we
don't repeat the same mistakes.

Please have a look at the patch on gerrit then. Feedback is much
appreciated :) https://gerrit.wikimedia.org/r/#/c/14295/

Cheers

--
Jeroen De Dauw
http://www.bn2vs.com
Don't panic. Don't be evil.
--
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


lists at nadir-seen-fire

Aug 9, 2012, 11:30 AM

Post #8 of 20 (1284 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

On Thu, 09 Aug 2012 09:12:16 -0700, Jeroen De Dauw
<jeroendedauw [at] gmail> wrote:

> Hey,
>
>> So if this is something hoping to replace the interwiki system I'd like
>> to look over what the plan and overall idea is with this to make sure we
>> don't repeat the same mistakes.
>
> Please have a look at the patch on gerrit then. Feedback is much
> appreciated :) https://gerrit.wikimedia.org/r/#/c/14295/
>
> Cheers
>
> --
> Jeroen De Dauw
> http://www.bn2vs.com
> Don't panic. Don't be evil.
> --

Looking over the code it does seem we're repeating the same issues that
exist with the current interwiki system I was planning to eliminate when I
moved includes/Interwiki.php to includes/interwiki/Interwiki.php and put
this on my endless to-do list.

The issue I was trying to deal with was storage. Currently we 100% assume
that the interwiki list is a table and there will only ever be one of
them. But this counters multiple facts about interwikis in practice:
- We have a default set of interwiki links. Because we use a database
instead of flat files we end up inserting stuff on installation. As a
result when something changes eg: Wikimedia supports https:// and now all
links are supposed to be protocol-relative. We have hundreds of wikis all
using outdated interwiki rules even after they upgrade MediaWiki because
interwiki links are only inserted by software on installation, they are
not taken directly from the software map.
- In practice we don't want one interwiki map. In projects like Wikimedia
we actually usually want two or three. We want a global shared list of
interwikis so that [[Wikipedia:]] [[commons:]] etc... work on every
project. We want a shared list of interwikis for each project (ie:
Wikipedias, Wiktionaries, etc...), primarily so that [[en:]] [[es:]]
etc... language links are not duplicated, since these can't be global but
also there may be some interwiki links that apply to some projects but not
others. And sometimes we also want a wiki-local interwiki list because
some communities want to add links to sites that other wikis don't. Or we
may want to localize a link. And we end up writing absolutely horrible
hacks we shouldn't have to because implementation is ignorant of reality.

I had planned to do a few primary things to the system:
- Drop the notion of the interwiki list simply being a database table.
Multiple class implementations were going to make it possible to have
database backed interwiki lists, file backed interwiki lists (multiple
formats), etc...
- Drop the single-list handling and add allow a list of multiple interwiki
sources to be configured from a wg variable.
Together it would mean that our default list of interwiki links would no
longer be stored in the interwiki table and instead would be read directly
from our source code where cleaning up the urls would nicely update all
wikis when they upgrade. And it would mean that it would be easy to setup
multiple interwiki list sources for wikis. Such as a global interwiki
database, a project one, and a local one. And it would be possible to use
simple text based file backed interwiki lists so that people don't need to
mess with sql.

----
But it looks like the new sites code is also focused around a single list
of database backed sites.

((Also, while there are a number of really interesting ideas, sorry to say
it but some of the code already triggers that "Must rewrite!" mood rather
than thinking of small incremental tweaks))

Also anything in this area really needs to think of our lack of user
interface. If we rewrite this then we absolutely must include a UI to view
and edit this in core. By rewriting it we ditch every hack trying to make
it easy to control the interwiki list and only make the problem worse.
The notes on synchronizing with wikidata look interesting. But this kind
of thing absolutely has to be user-friendly and multi-wiki friendly at a
core level, not only for wikis using wikidata.
----
I think some of this stuff is a bit large to discuss in code review or
email. I'd like to do this RfC style, listing everything we need from
different perspectives so we can come up with something that doesn't need
to be redone yet again.

Originally I was focused around taking interwiki dependence out-of the
database. But the talk of synchronization and other things in the code has
me thinking of other things like a database table as a final index (like
pagelinks, etc...), fetching lists, siteinfo, etc... from other sites, and
other things. So I have a feeling that the best thing we come up with will
probably look different than what either of us started with.

Firstly though, I probably won't be able to come up with a good idea
without a good understanding of Wikidata's role in all this:
- I would like to understand what Wikidata needs out of interwiki/sites
and what it's going to do with the data
- I'd also like to know if Wikidata plans to add any interface that will
add/remove sites


If we do this hastily I think we may also miss a very good chance to make
fixing bug 11 and bug 10237 much more sanely possible.

bug 39199 also covers a thought on linking in pages I've been thinking
about.

[bug 11] https://bugzilla.wikimedia.org/show_bug.cgi?id=11
[bug 10237] https://bugzilla.wikimedia.org/show_bug.cgi?id=10237
[bug 39199] https://bugzilla.wikimedia.org/show_bug.cgi?id=39199

--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jeroendedauw at gmail

Aug 9, 2012, 12:00 PM

Post #9 of 20 (1281 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

Hey,

Daniel, thanks for your input.

TL;DR at the bottom :)

> The issue I was trying to deal with was storage. Currently we 100% assume
that the interwiki list is a table and there will only ever be one of them.

Yes, we are not changing this. Having a more flexible system might or might
not be something we'd want in MediaWiki. We do not need it in Wikidata
though. The changes we're making here do not seem to affect this issue at
all, so you can just as well implement it later on.

> In practice we don't want one interwiki map. In projects like Wikimedia
we actually usually want two or three.
> ..
> And sometimes we also want a wiki-local interwiki list because some
communities want to add links to sites that other wikis don't.

This we are actually tacking, although in a different fashion then you
propose. Rather then having many different lists of sites to maintain, we
have split sites from their configuration. The list of sites is global and
shared by all clients. Their configuration however is local. So if wiki a
wants to use site x as interwikilink with prefix foobar, wiki b wants to
use it with prefix baz and wiki c does not want to use it as interwikilink
at all, this is perfectly possible. This split and associated
generalization our changes bring add a lot of flexibility compared to the
current system and remove bad assumptions currently baked in.

> Also anything in this area really needs to think of our lack of user
interface. If we rewrite this then we absolutely must include a UI to view
and edit this in core.

Again, this is not something we're touching at all, or want to touch, as we
don't need it. Personally I think I'd be great to have such facilities, and
it makes sense to add these after the backend has been fixed. I'd be happy
to work with you on this (or leave it entirely up to you) once we got the
relevant rewrite work done.

> By rewriting it we ditch every hack trying to make it easy to control the
interwiki list and only make the problem worse.

Our change will not drop any existing functionality. I will make sure there
are tools/facilities at least as good (and probably better) then the
current ones.

> I would like to understand what Wikidata needs out of interwiki/sites and
what it's going to do with the data

We need this for our "equivalent links", which consist out of a global site
id and a page. Right now we do not have consistent global ids, in fact we
don't have global ids. We just have local ids that happen to be similar
everywhere (while one might not want this, but is pretty much forced to
right now), which must be language codes in order to be "languagelinks" or
(better named) "equivalent links". Also, right now, all languagelinks are
interwikilinks (wtf) - we want to be able to have "equivalent links"
without then also being interwiki links!

> I'd also like to know if Wikidata plans to add any interface that will
add/remove sites

The backend will have an interface to do this, but we're not planning on
any API modules or UIs. The backend will be written keeping in mind people
will want those though, so it ought to be easy to add them later on.

So to wrap up: I don't think there is any conflict between what we want to
do (if you disagree, please provide some pointers). You can make your
changes later on, and will have a much more solid base to work on then now.

Cheers

--
Jeroen De Dauw
http://www.bn2vs.com
Don't panic. Don't be evil.
--
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


alolita.sharma at gmail

Aug 9, 2012, 12:49 PM

Post #10 of 20 (1275 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

Denny,

> We are currently investigating using the "Universal Language Selector"
> instead of "Stick to that language", and on first glance it looks
> good. If this remains like this, we will drop "Stick to that
> language". That is why I didn't list the corresponding open issues
> there. We'd be happy to go for ULS instead. We expect to have a
> resolution on that next week.
>

Look forward to discussing ULS in more detail.

Best,
Alolita

On Thu, Aug 9, 2012 at 8:48 AM, Denny Vrandečić
<denny.vrandecic [at] wikimedia> wrote:
> Hi Rob,
>
> thanks for the answers.
>
> 2012/8/9 Rob Lanphier <robla [at] wikimedia>:
>> It looks like this page needs an update as well:
>> http://www.mediawiki.org/wiki/Wikidata_deployment
>
> Thanks, I updated the page.
>
>> One thing that was tacked on the wiki page without mention here or a
>> bug created was the "Stick to that language" extension. Is that a
>> hard requirement, or nice to have?
>
> We are currently investigating using the "Universal Language Selector"
> instead of "Stick to that language", and on first glance it looks
> good. If this remains like this, we will drop "Stick to that
> language". That is why I didn't list the corresponding open issues
> there. We'd be happy to go for ULS instead. We expect to have a
> resolution on that next week.
>
> Cheers,
> Denny
>
> --
> Project director Wikidata
> Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
> Tel. +49-30-219 158 26-0 | http://wikimedia.de
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
> unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
> Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


lists at nadir-seen-fire

Aug 9, 2012, 1:24 PM

Post #11 of 20 (1279 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

On 12-08-09 12:00 PM, Jeroen De Dauw wrote:
> Hey,
>
> Daniel, thanks for your input.
>
> TL;DR at the bottom :)
>
> > The issue I was trying to deal with was storage. Currently we 100%
> assume that the interwiki list is a table and there will only ever be
> one of them.
>
> Yes, we are not changing this. Having a more flexible system might or
> might not be something we'd want in MediaWiki. We do not need it in
> Wikidata though. The changes we're making here do not seem to affect
> this issue at all, so you can just as well implement it later on.
>
> > In practice we don't want one interwiki map. In projects like
> Wikimedia we actually usually want two or three.
> > ..
> > And sometimes we also want a wiki-local interwiki list because some
> communities want to add links to sites that other wikis don't.
>
> This we are actually tacking, although in a different fashion then you
> propose. Rather then having many different lists of sites to maintain,
> we have split sites from their configuration. The list of sites is
> global and shared by all clients. Their configuration however is
> local. So if wiki a wants to use site x as interwikilink with prefix
> foobar, wiki b wants to use it with prefix baz and wiki c does not
> want to use it as interwikilink at all, this is perfectly possible.
> This split and associated generalization our changes bring add a lot
> of flexibility compared to the current system and remove bad
> assumptions currently baked in.
I think we're going to need to have some of this and the synchronization
stuff in core.
Right now the code has nothing but the one sites table. No repo code so
presumably the only implementation of that for awhile will be wikidata.
And if parts of this table is supposed to be editable in some cases
where there is no repo but non-editable then I don't see any way for an
edit ui to tell the difference.

I'm also not sure how this synchronization which sounds like one-way
will play with individual wikis wanting to add new interwiki links.

> > Also anything in this area really needs to think of our lack of user
> interface. If we rewrite this then we absolutely must include a UI to
> view and edit this in core.
>
> Again, this is not something we're touching at all, or want to touch,
> as we don't need it. Personally I think I'd be great to have such
> facilities, and it makes sense to add these after the backend has been
> fixed. I'd be happy to work with you on this (or leave it entirely up
> to you) once we got the relevant rewrite work done.
>
> > By rewriting it we ditch every hack trying to make it easy to
> control the interwiki list and only make the problem worse.
>
> Our change will not drop any existing functionality. I will make sure
> there are tools/facilities at least as good (and probably better) then
> the current ones.
I'm talking about things like the interwiki extensions and scripts that
turn wiki tables into interwiki lists. All these things are written
against the interwiki table. So by rewriting and using a new table we
implicitly break all the working tricks and throw the user back into SQL.

> > I would like to understand what Wikidata needs out of
> interwiki/sites and what it's going to do with the data
>
> We need this for our "equivalent links", which consist out of a global
> site id and a page. Right now we do not have consistent global ids, in
> fact we don't have global ids. We just have local ids that happen to
> be similar everywhere (while one might not want this, but is pretty
> much forced to right now), which must be language codes in order to be
> "languagelinks" or (better named) "equivalent links". Also, right now,
> all languagelinks are interwikilinks (wtf) - we want to be able to
> have "equivalent links" without then also being interwiki links!
I like the idea of table entries without actual interwikis. The idea of
some interface listing user selectable sites came to mind and perhaps
sites being added trivially even automatically.
Though if you plan to support this I think you'll need to drop the NOT
NULL from site_local_key.

Actually, another thought makes me think the schema should be a little
different.
site_local_key probably shouldn't be a column, it should probably be
another table.
Something like site_local_key (slc_key, slc_site) which would map things
like en:, Wikipedia:, etc... to a specific site.
I can see wikis wanting to use multiple interwiki names for the same
site. In fact I'm pretty sure this already happens with the existing
interwiki table. We just create duplicate rows.
But you want global ids so I really don't think you want data
duplication like that to happen.

> > I'd also like to know if Wikidata plans to add any interface that
> will add/remove sites
>
> The backend will have an interface to do this, but we're not planning
> on any API modules or UIs. The backend will be written keeping in mind
> people will want those though, so it ought to be easy to add them
> later on.
>
> So to wrap up: I don't think there is any conflict between what we
> want to do (if you disagree, please provide some pointers). You can
> make your changes later on, and will have a much more solid base to
> work on then now.
I think I need to understand the plans you have for synchronization a
bit more.
- Where does Wikidata get the sites
- What synchronizes the data
- What is the repo like. Also what it it based off of. Is this wikis
syncing from another wiki's sites table or does Wikidata have a real set
of data the sites table gets based off of.
- Is this one-way synchronization or multiway.

synchronization, treatment of the table (whether it's an index of
something else or first class data), and editing/UIs for editing are a
set of things where you can get in the way of the ability to do the
others later if you don't think of them all up front.

Our old interwiki table was treated as first-class data and was simple
data that was easy to create an edit interface for. As a result it's
hard to do any synchronization for since we didn't plan for it.
Likewise if we design a sites table focused on synchronizing data and
treatment of the table as simultaneous first-class data with some of it
treated like an index. We can easily come up with something that is
going to get in the way of the consistency needed for a UI.

One of our options might be to treat sites like an index of data built
from other sources just like pagelinks. Wikidata can act as a repo, the
sites code can build from multiple sources with Wikidata being the
first, and when a UI comes into play the UI can create it's own list of
sites and that can be used as a source for the building of the sites table.
----
Heh, it probably doesn't help that this is making my abstract revision
idea come up and make me want to have the UI depend off of that.
> Cheers
>
> --
> Jeroen De Dauw
> http://www.bn2vs.com
> Don't panic. Don't be evil.
> --
Btw if you really want to make this an abstract list of sites dropping
site_url and the other two related columns might be an idea.
At first glance the url looks like something standard that every site
would have. But once you throw something like MediaWiki into the mix
with short urls, long urls, and an API the url really becomes type
specific data that should probably go in the blob. Especially when you
start thinking about other custom types.
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jeroendedauw at gmail

Aug 9, 2012, 3:55 PM

Post #12 of 20 (1278 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

Hey,

You bring up some good points.

I think we're going to need to have some of this and the synchronization
> stuff in core.
> Right now the code has nothing but the one sites table. No repo code so
> presumably the only implementation of that for awhile will be wikidata. And
> if parts of this table is supposed to be editable in some cases where there
> is no repo but non-editable then I don't see any way for an edit ui to tell
> the difference.
>

We indeed need some configuration setting(s) for wikis to distinguish
between the two cases. That seems to be all "synchronisation code" we'll
need in core. It might or might not be useful to have more logic in core,
or in some dedicated extension. Personally I think having the actual
synchronization code in a separate extension would be nice, as a lot of it
won't be Wikidata specific. This is however not a requirement for Wikidata,
so the current plan is to just have it in the extension, always keeping in
mind that it should be easy to split it off later on. I'd love to discuss
this point further, but it should be clear this is not much of a blocker
for the current code, as it seems unlikely to affect it much, if at all.

On that note consider we're initially creating the new system in parallel
with the old one, which enabled us to just try out changes, and alter them
later on if it turns out there is a better way to do them. Then once we're
confident the new system is what we want to stick to, and know it works
because of it's usage by Wikidata, we can replace the current code with the
new system. This ought to allow us to work a lot faster by not blocking on
discussions and details for to long.

> I'm also not sure how this synchronization which sounds like one-way will
play with individual wikis wanting to add new interwiki links.

For our case we only need it to work one way, from the Wikidata repo to
it's clients. More discussion would need to happen to decide on an
alternate approach. I already indicated I think this is not a blocker for
the current set of changes, so I'd prefer this to happen after the current
code got merged.

I'm talking about things like the interwiki extensions and scripts that
> turn wiki tables into interwiki lists. All these things are written against
> the interwiki table. So by rewriting and using a new table we implicitly
> break all the working tricks and throw the user back into SQL.
>

I am aware of this. Like noted already, the current new code does not yet
replace the old code, so this is not a blocker yet, but it will be for
replacing the old code with the new system. Having looked at the existing
code using the old system, I think migration should not be to hard, since
the new system can do everything the old one can do and the current using
code is not that much. The new system also has clear interfaces, preventing
the script from needing to know of the database table at all. That ought to
facilitate the "do not depend on a single db table" a lot, obviously :)

I like the idea of table entries without actual interwikis. The idea of
> some interface listing user selectable sites came to mind and perhaps sites
> being added trivially even automatically.
> Though if you plan to support this I think you'll need to drop the NOT
> NULL from site_local_key.
>

I don't think the field needs to allow for null - right now the local keys
on the repo will be by default the same as the global keys, so none of them
will be null. On your client wiki you will then have these values by
default as well. If you don't want a particular site to be usable as
"languagelink" or "interwikilink", then simply set this in your local
configuration. No need to set the local id to null. Depending on how
actually we end up handling the defaulting process, having null might or
might not turn out to be useful. This is a detail though, so I'd suggest
sticking with not null for now, and then if it turns out I'd be more
convenient to allow for null when writing the sync code, just change it
then.

Actually, another thought makes me think the schema should be a little
> different.
> site_local_key probably shouldn't be a column, it should probably be
> another table.
> Something like site_local_key (slc_key, slc_site) which would map things
> like en:, Wikipedia:, etc... to a specific site.
>

Denny and I discussed this at some length, now already more then a month
ago (man, this is taking long...). Our conclusions where that we do not
need it, or would benefit from it much in Wikidata. In fact, I'd introduce
additional complexity, which is a good argument for not including it in our
already huge project. I do agree that conceptually it's nicer to not
duplicate such info, but if you consider the extra complexity you'd need to
get rid of it, and the little gain you have (removal of some minor
duplication which we've had since forever and is not bothering anyone), I'm
sceptical we ought to go with this approach, even outside of Wikidata.

I think I need to understand the plans you have for synchronization a bit
> more.
> - Where does Wikidata get the sites
>

The repository wiki holds the canonical copy of the sites, which gets send
to all clients. Modification of the site data can only happen on the
repository. All wikis (repo and clients) have their own local config so can
choose to enable all sites for all functionality, completely hide them, or
anything in between.

- What synchronizes the data
>

The repo. As already mentioned, it might be nicer to split this off in it's
own extension at some point. But before we get to that, we first need to
have the current changes merged.

Btw if you really want to make this an abstract list of sites dropping site_url
> and the other two related columns might be an idea.
> At first glance the url looks like something standard that every site
> would have. But once you throw something like MediaWiki into the mix with
> short urls, long urls, and an API the url really becomes type specific data
> that should probably go in the blob. Especially when you start thinking
> about other custom types.
>

The patch sitting on gerrit already includes this. (Did you really look at
it already? The fields are documented quite well I'd think.) Every site has
a url (that's not specific to the type of site), but we have a type system
with currently the default (general) site type and a MediaWikiSite type.
The type system works with two blob fields, one for type specific data and
one for type specific configuration.

Cheers

--
Jeroen De Dauw
http://www.bn2vs.com
Don't panic. Don't be evil.
--
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


lists at nadir-seen-fire

Aug 9, 2012, 6:03 PM

Post #13 of 20 (1277 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

On 12-08-09 3:55 PM, Jeroen De Dauw wrote:
> Hey,
>
> You bring up some good points.
>
> I think we're going to need to have some of this and the
> synchronization stuff in core.
> Right now the code has nothing but the one sites table. No repo
> code so presumably the only implementation of that for awhile will
> be wikidata. And if parts of this table is supposed to be editable
> in some cases where there is no repo but non-editable then I don't
> see any way for an edit ui to tell the difference.
>
>
> We indeed need some configuration setting(s) for wikis to distinguish
> between the two cases. That seems to be all "synchronisation code"
> we'll need in core. It might or might not be useful to have more logic
> in core, or in some dedicated extension. Personally I think having the
> actual synchronization code in a separate extension would be nice, as
> a lot of it won't be Wikidata specific. This is however not a
> requirement for Wikidata, so the current plan is to just have it in
> the extension, always keeping in mind that it should be easy to split
> it off later on. I'd love to discuss this point further, but it should
> be clear this is not much of a blocker for the current code, as it
> seems unlikely to affect it much, if at all.
>
> On that note consider we're initially creating the new system in
> parallel with the old one, which enabled us to just try out changes,
> and alter them later on if it turns out there is a better way to do
> them. Then once we're confident the new system is what we want to
> stick to, and know it works because of it's usage by Wikidata, we can
> replace the current code with the new system. This ought to allow us
> to work a lot faster by not blocking on discussions and details for to
> long.
>
> > I'm also not sure how this synchronization which sounds like one-way
> will play with individual wikis wanting to add new interwiki links.
>
> For our case we only need it to work one way, from the Wikidata repo
> to it's clients. More discussion would need to happen to decide on an
> alternate approach. I already indicated I think this is not a blocker
> for the current set of changes, so I'd prefer this to happen after the
> current code got merged.
>
> I'm talking about things like the interwiki extensions and scripts
> that turn wiki tables into interwiki lists. All these things are
> written against the interwiki table. So by rewriting and using a
> new table we implicitly break all the working tricks and throw the
> user back into SQL.
>
>
> I am aware of this. Like noted already, the current new code does not
> yet replace the old code, so this is not a blocker yet, but it will be
> for replacing the old code with the new system. Having looked at the
> existing code using the old system, I think migration should not be to
> hard, since the new system can do everything the old one can do and
> the current using code is not that much. The new system also has clear
> interfaces, preventing the script from needing to know of the database
> table at all. That ought to facilitate the "do not depend on a single
> db table" a lot, obviously :)
>
> I like the idea of table entries without actual interwikis. The
> idea of some interface listing user selectable sites came to mind
> and perhaps sites being added trivially even automatically.
> Though if you plan to support this I think you'll need to drop the
> NOT NULL from site_local_key.
>
>
> I don't think the field needs to allow for null - right now the local
> keys on the repo will be by default the same as the global keys, so
> none of them will be null. On your client wiki you will then have
> these values by default as well. If you don't want a particular site
> to be usable as "languagelink" or "interwikilink", then simply set
> this in your local configuration. No need to set the local id to null.
> Depending on how actually we end up handling the defaulting process,
> having null might or might not turn out to be useful. This is a detail
> though, so I'd suggest sticking with not null for now, and then if it
> turns out I'd be more convenient to allow for null when writing the
> sync code, just change it then.
You mean site_config?
You're suggesting the interwiki system should look for a site by
site_local_key, when it finds one parse out the site_config, check if
it's disabled and if so ignore the fact it found a site with that local
key? Instead of just not having a site_local_key for that row in the
first place?

> Actually, another thought makes me think the schema should be a
> little different.
> site_local_key probably shouldn't be a column, it should probably
> be another table.
> Something like site_local_key (slc_key, slc_site) which would map
> things like en:, Wikipedia:, etc... to a specific site.
>
>
> Denny and I discussed this at some length, now already more then a
> month ago (man, this is taking long...). Our conclusions where that we
> do not need it, or would benefit from it much in Wikidata. In fact,
> I'd introduce additional complexity, which is a good argument for not
> including it in our already huge project. I do agree that conceptually
> it's nicer to not duplicate such info, but if you consider the extra
> complexity you'd need to get rid of it, and the little gain you have
> (removal of some minor duplication which we've had since forever and
> is not bothering anyone), I'm sceptical we ought to go with this
> approach, even outside of Wikidata.
You've added global ids into this mix. So data duplication simply
because one wiki needs a second local name will mean that one url now
has two different global ids this sounds precisely like something that
is going to get in the way of the whole reason you wanted this rewrite.
It will also start to create issues with the sync code.
Additionally the number of duplicates needed is going to vary wiki by
wiki. en.wikisource is going to need one Wikipedia: to link to en.wp
while fr.wp is going to need two, Wikipedia: and en: to point to en.wp.
I can only see data duplication creating more problems than we need.

As for the supposed complexity of this extra table. site_data and
site_config are blobs of presumably serialized data. You've already
eliminated the simplicity needed for this to be human editable from SQL
so there is no reason to hold back on making the database schema the
best it can be. As for deletions if you're worried about making them
simple just add a foreign key with cascading deletion. Then the rows in
site_local_key will automatically be deleted when you delete the row in
sites without any extra complexity.

> I think I need to understand the plans you have for
> synchronization a bit more.
> - Where does Wikidata get the sites
>
>
> The repository wiki holds the canonical copy of the sites, which gets
> send to all clients. Modification of the site data can only happen on
> the repository. All wikis (repo and clients) have their own local
> config so can choose to enable all sites for all functionality,
> completely hide them, or anything in between.
Ok, I'm leaning more and more towards the idea that we should make the
full sites table a second-class index of sites pulled from any number of
data sources that you can carelessly truncate and have rebuilt (ie: it
has no more value than pagelinks).
Wikidata's data syncing would be served by creating a secondary table
with the local link_{key,inline,navigation}, forward, and config
columns. When you sync the data from the Wikidata repo and the site
local table would be combined to create what goes into the index table
with the full list of sites.
Doing it this way frees us from creating any restrictions on whatever
source we get sites from that we shouldn't be placing on them.
Wikidata gets site local stuff and global data and doesn't have to worry
about whether parts of the row are supposed to be editable or not. There
is nothing stopping us from making our first non-wikidata site source a
plaintext file so we have time to write a really good UI. And the UI is
free from restrictions placed by using this one table, so it's free to
do it in whatever way fits a UI best. Whether that means it's an
editable wikitext page or better yet a nice ui using that abstract
revision system I wanted to build.

> - What synchronizes the data
>
>
> The repo. As already mentioned, it might be nicer to split this off in
> it's own extension at some point. But before we get to that, we first
> need to have the current changes merged.
>
> Btw if you really want to make this an abstract list of sites
> dropping site_url and the other two related columns might be an idea.
> At first glance the url looks like something standard that every
> site would have. But once you throw something like MediaWiki into
> the mix with short urls, long urls, and an API the url really
> becomes type specific data that should probably go in the blob.
> Especially when you start thinking about other custom types.
>
>
> The patch sitting on gerrit already includes this. (Did you really
> look at it already? The fields are documented quite well I'd think.)
> Every site has a url (that's not specific to the type of site), but we
> have a type system with currently the default (general) site type and
> a MediaWikiSite type. The type system works with two blob fields, one
> for type specific data and one for type specific configuration.
Yeah, I looked at the schema I know there is a data blob, that's what
I'm talking about. I mean while you'd think that a url is something
every site would have one of it's actually more of a type specific piece
of data because some site types can actually have multiple urls, etc...
which depend on what the page input is. So you might as well drop the 3
url related columns and just use the data blob that you already have.
The $1 pattern may not even work for some sites. For example something
like a gerrit type may want to know a specific root path for gerrit
without any $1 funny business and then handle what actual url gets
output in special ways. ie: So that [[gerrit:14295]] links to
https://gerrit.wikimedia.org/r/#/c/14295 while [[gerrit:
I0a96e58556026d8c923551b07af838ca426a2ab3]] links to
https://gerrit.wikimedia.org/r/#q,I0a96e58556026d8c923551b07af838ca426a2ab3,n,z

> Cheers
>
> --
> Jeroen De Dauw
> http://www.bn2vs.com
> Don't panic. Don't be evil.
> --

~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jeroendedauw at gmail

Aug 10, 2012, 4:51 AM

Post #14 of 20 (1263 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

Hey,

You mean site_config?
> You're suggesting the interwiki system should look for a site by
> site_local_key, when it finds one parse out the site_config, check if it's
> disabled and if so ignore the fact it found a site with that local key?
> Instead of just not having a site_local_key for that row in the first place?
>

No. Since the interwiki system is not specific to any type of site, this
approach would be making it needlessly hard. The site_link_inline field
determines if the site should be usable as interwiki link, as you can see
in the patchset:

-- If the site should be linkable inline as an "interwiki link" using
-- [[site_local_key:pageTitle]].
site_link_inline bool NOT NULL,

So queries would be _very_ simple.

> So data duplication simply because one wiki needs a second local name
will mean that one url now has two different global ids this sounds
precisely like something that is going to get in the way of the whole
reason you wanted this rewrite.

* It does not get in our way at all, and is completely disjunct from why we
want the rewrite
* It's currently done like this
* The changes we do need and are proposing to make will make such a rewrite
at a later point easier then it is now

> Doing it this way frees us from creating any restrictions on whatever
source we get sites from that we shouldn't be placing on them.

* We don't need this for Wikidata
* It's a new feature that might or might not be nice to have that currently
does not exist
* The changes we do need and are proposing to make will make such a rewrite
at a later point easier then it is now

> So you might as well drop the 3 url related columns and just use the data
blob that you already have.

I don't see what this would gain us at all. It's just make things more
complicated.

> The $1 pattern may not even work for some sites.

* We don't need this for Wikidata
* It's a new feature that might or might not be nice to have that currently
does not exist
* The changes we do need and are proposing to make will make such a rewrite
at a later point easier then it is now

And in fact we are making this more flexible by having the type system. The
MediaWiki site type could for instance be able to form both "nice" urls and
index.php ones. Or a gerrit type could have the logic to distinguish
between the gerrit commit number and a sha1 hash.

Cheers

--
Jeroen De Dauw
http://www.bn2vs.com
Don't panic. Don't be evil.
--
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


denny.vrandecic at wikimedia

Aug 10, 2012, 5:03 AM

Post #15 of 20 (1262 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

Hi Daniel,

thanks for your comments. Some of the suggestions you make would
extend the functionality beyond what we need right now. They look
certainly useful, and I don't think that we would make implementing
them any harder than it is right now -- rather the opposite.

As usual, the perfect and the next-step are great enemies. I
understand that the patch does not lead to a perfect world directly,
that would cover all of your use cases -- but it nicely covers ours.

My questions would be:

* do you think that we are going in the wrong direction, or do you
think we are not going far enough yet?

* do you think that we are making some use cases harder to implement
in the future than they would be now, and if so which ones?

* do you see other issues with the patch that should block it from
being deployed, and which ones would that be?

Cheers,
Denny



2012/8/10 Daniel Friesen <lists [at] nadir-seen-fire>:
> On 12-08-09 3:55 PM, Jeroen De Dauw wrote:
>> Hey,
>>
>> You bring up some good points.
>>
>> I think we're going to need to have some of this and the
>> synchronization stuff in core.
>> Right now the code has nothing but the one sites table. No repo
>> code so presumably the only implementation of that for awhile will
>> be wikidata. And if parts of this table is supposed to be editable
>> in some cases where there is no repo but non-editable then I don't
>> see any way for an edit ui to tell the difference.
>>
>>
>> We indeed need some configuration setting(s) for wikis to distinguish
>> between the two cases. That seems to be all "synchronisation code"
>> we'll need in core. It might or might not be useful to have more logic
>> in core, or in some dedicated extension. Personally I think having the
>> actual synchronization code in a separate extension would be nice, as
>> a lot of it won't be Wikidata specific. This is however not a
>> requirement for Wikidata, so the current plan is to just have it in
>> the extension, always keeping in mind that it should be easy to split
>> it off later on. I'd love to discuss this point further, but it should
>> be clear this is not much of a blocker for the current code, as it
>> seems unlikely to affect it much, if at all.
>>
>> On that note consider we're initially creating the new system in
>> parallel with the old one, which enabled us to just try out changes,
>> and alter them later on if it turns out there is a better way to do
>> them. Then once we're confident the new system is what we want to
>> stick to, and know it works because of it's usage by Wikidata, we can
>> replace the current code with the new system. This ought to allow us
>> to work a lot faster by not blocking on discussions and details for to
>> long.
>>
>> > I'm also not sure how this synchronization which sounds like one-way
>> will play with individual wikis wanting to add new interwiki links.
>>
>> For our case we only need it to work one way, from the Wikidata repo
>> to it's clients. More discussion would need to happen to decide on an
>> alternate approach. I already indicated I think this is not a blocker
>> for the current set of changes, so I'd prefer this to happen after the
>> current code got merged.
>>
>> I'm talking about things like the interwiki extensions and scripts
>> that turn wiki tables into interwiki lists. All these things are
>> written against the interwiki table. So by rewriting and using a
>> new table we implicitly break all the working tricks and throw the
>> user back into SQL.
>>
>>
>> I am aware of this. Like noted already, the current new code does not
>> yet replace the old code, so this is not a blocker yet, but it will be
>> for replacing the old code with the new system. Having looked at the
>> existing code using the old system, I think migration should not be to
>> hard, since the new system can do everything the old one can do and
>> the current using code is not that much. The new system also has clear
>> interfaces, preventing the script from needing to know of the database
>> table at all. That ought to facilitate the "do not depend on a single
>> db table" a lot, obviously :)
>>
>> I like the idea of table entries without actual interwikis. The
>> idea of some interface listing user selectable sites came to mind
>> and perhaps sites being added trivially even automatically.
>> Though if you plan to support this I think you'll need to drop the
>> NOT NULL from site_local_key.
>>
>>
>> I don't think the field needs to allow for null - right now the local
>> keys on the repo will be by default the same as the global keys, so
>> none of them will be null. On your client wiki you will then have
>> these values by default as well. If you don't want a particular site
>> to be usable as "languagelink" or "interwikilink", then simply set
>> this in your local configuration. No need to set the local id to null.
>> Depending on how actually we end up handling the defaulting process,
>> having null might or might not turn out to be useful. This is a detail
>> though, so I'd suggest sticking with not null for now, and then if it
>> turns out I'd be more convenient to allow for null when writing the
>> sync code, just change it then.
> You mean site_config?
> You're suggesting the interwiki system should look for a site by
> site_local_key, when it finds one parse out the site_config, check if
> it's disabled and if so ignore the fact it found a site with that local
> key? Instead of just not having a site_local_key for that row in the
> first place?
>
>> Actually, another thought makes me think the schema should be a
>> little different.
>> site_local_key probably shouldn't be a column, it should probably
>> be another table.
>> Something like site_local_key (slc_key, slc_site) which would map
>> things like en:, Wikipedia:, etc... to a specific site.
>>
>>
>> Denny and I discussed this at some length, now already more then a
>> month ago (man, this is taking long...). Our conclusions where that we
>> do not need it, or would benefit from it much in Wikidata. In fact,
>> I'd introduce additional complexity, which is a good argument for not
>> including it in our already huge project. I do agree that conceptually
>> it's nicer to not duplicate such info, but if you consider the extra
>> complexity you'd need to get rid of it, and the little gain you have
>> (removal of some minor duplication which we've had since forever and
>> is not bothering anyone), I'm sceptical we ought to go with this
>> approach, even outside of Wikidata.
> You've added global ids into this mix. So data duplication simply
> because one wiki needs a second local name will mean that one url now
> has two different global ids this sounds precisely like something that
> is going to get in the way of the whole reason you wanted this rewrite.
> It will also start to create issues with the sync code.
> Additionally the number of duplicates needed is going to vary wiki by
> wiki. en.wikisource is going to need one Wikipedia: to link to en.wp
> while fr.wp is going to need two, Wikipedia: and en: to point to en.wp.
> I can only see data duplication creating more problems than we need.
>
> As for the supposed complexity of this extra table. site_data and
> site_config are blobs of presumably serialized data. You've already
> eliminated the simplicity needed for this to be human editable from SQL
> so there is no reason to hold back on making the database schema the
> best it can be. As for deletions if you're worried about making them
> simple just add a foreign key with cascading deletion. Then the rows in
> site_local_key will automatically be deleted when you delete the row in
> sites without any extra complexity.
>
>> I think I need to understand the plans you have for
>> synchronization a bit more.
>> - Where does Wikidata get the sites
>>
>>
>> The repository wiki holds the canonical copy of the sites, which gets
>> send to all clients. Modification of the site data can only happen on
>> the repository. All wikis (repo and clients) have their own local
>> config so can choose to enable all sites for all functionality,
>> completely hide them, or anything in between.
> Ok, I'm leaning more and more towards the idea that we should make the
> full sites table a second-class index of sites pulled from any number of
> data sources that you can carelessly truncate and have rebuilt (ie: it
> has no more value than pagelinks).
> Wikidata's data syncing would be served by creating a secondary table
> with the local link_{key,inline,navigation}, forward, and config
> columns. When you sync the data from the Wikidata repo and the site
> local table would be combined to create what goes into the index table
> with the full list of sites.
> Doing it this way frees us from creating any restrictions on whatever
> source we get sites from that we shouldn't be placing on them.
> Wikidata gets site local stuff and global data and doesn't have to worry
> about whether parts of the row are supposed to be editable or not. There
> is nothing stopping us from making our first non-wikidata site source a
> plaintext file so we have time to write a really good UI. And the UI is
> free from restrictions placed by using this one table, so it's free to
> do it in whatever way fits a UI best. Whether that means it's an
> editable wikitext page or better yet a nice ui using that abstract
> revision system I wanted to build.
>
>> - What synchronizes the data
>>
>>
>> The repo. As already mentioned, it might be nicer to split this off in
>> it's own extension at some point. But before we get to that, we first
>> need to have the current changes merged.
>>
>> Btw if you really want to make this an abstract list of sites
>> dropping site_url and the other two related columns might be an idea.
>> At first glance the url looks like something standard that every
>> site would have. But once you throw something like MediaWiki into
>> the mix with short urls, long urls, and an API the url really
>> becomes type specific data that should probably go in the blob.
>> Especially when you start thinking about other custom types.
>>
>>
>> The patch sitting on gerrit already includes this. (Did you really
>> look at it already? The fields are documented quite well I'd think.)
>> Every site has a url (that's not specific to the type of site), but we
>> have a type system with currently the default (general) site type and
>> a MediaWikiSite type. The type system works with two blob fields, one
>> for type specific data and one for type specific configuration.
> Yeah, I looked at the schema I know there is a data blob, that's what
> I'm talking about. I mean while you'd think that a url is something
> every site would have one of it's actually more of a type specific piece
> of data because some site types can actually have multiple urls, etc...
> which depend on what the page input is. So you might as well drop the 3
> url related columns and just use the data blob that you already have.
> The $1 pattern may not even work for some sites. For example something
> like a gerrit type may want to know a specific root path for gerrit
> without any $1 funny business and then handle what actual url gets
> output in special ways. ie: So that [[gerrit:14295]] links to
> https://gerrit.wikimedia.org/r/#/c/14295 while [[gerrit:
> I0a96e58556026d8c923551b07af838ca426a2ab3]] links to
> https://gerrit.wikimedia.org/r/#q,I0a96e58556026d8c923551b07af838ca426a2ab3,n,z
>
>> Cheers
>>
>> --
>> Jeroen De Dauw
>> http://www.bn2vs.com
>> Don't panic. Don't be evil.
>> --
>
> ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



--
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


bawolff+wn at gmail

Aug 10, 2012, 6:33 AM

Post #16 of 20 (1259 views)
Permalink
Wikidata blockers weekly update [In reply to]

> Hey,
>
> You mean site_config?
> > You're suggesting the interwiki system should look for a site by
> > site_local_key, when it finds one parse out the site_config, check if it's
> > disabled and if so ignore the fact it found a site with that local key?
> > Instead of just not having a site_local_key for that row in the first place?
> >
>
> No. Since the interwiki system is not specific to any type of site, this
> approach would be making it needlessly hard. The site_link_inline field
> determines if the site should be usable as interwiki link, as you can see
> in the patchset:
>
> -- If the site should be linkable inline as an "interwiki link" using
> -- [[site_local_key:pageTitle]].
> site_link_inline bool NOT NULL,
>
> So queries would be _very_ simple.
>
> > So data duplication simply because one wiki needs a second local name
> will mean that one url now has two different global ids this sounds
> precisely like something that is going to get in the way of the whole
> reason you wanted this rewrite.
>
> * It does not get in our way at all, and is completely disjunct from why we
> want the rewrite
> * It's currently done like this
> * The changes we do need and are proposing to make will make such a rewrite
> at a later point easier then it is now
>
> > Doing it this way frees us from creating any restrictions on whatever
> source we get sites from that we shouldn't be placing on them.
>
> * We don't need this for Wikidata
> * It's a new feature that might or might not be nice to have that currently
> does not exist
> * The changes we do need and are proposing to make will make such a rewrite
> at a later point easier then it is now
>
> > So you might as well drop the 3 url related columns and just use the data
> blob that you already have.
>
> I don't see what this would gain us at all. It's just make things more
> complicated.
>
> > The $1 pattern may not even work for some sites.
>
> * We don't need this for Wikidata
> * It's a new feature that might or might not be nice to have that currently
> does not exist
> * The changes we do need and are proposing to make will make such a rewrite
> at a later point easier then it is now
>
> And in fact we are making this more flexible by having the type system. The
> MediaWiki site type could for instance be able to form both "nice" urls and
> index.php ones. Or a gerrit type could have the logic to distinguish
> between the gerrit commit number and a sha1 hash.
>
> Cheers

[.Just to clarify, I'm doing inline replies to things various people
said, not just Jeroen]

First and foremost, I'm a little confused as to what the actual use
cases here are. Could we get a short summary for those who aren't
entirely following how wikidata will work, why the current interwiki
situation is insufficient? I've read the I0a96e585 and
http://lists.wikimedia.org/pipermail/wikitech-l/2012-June/060992.html,
but everything seems very vague "It doesn't work for our situation",
without any detailed explanation of what that situation is. At most
the messages kind of hint at wanting to be able to access the list of
interwiki types of the wikidata "server" from a wikidata "client" (and
keep them in sync, or at least have them replicated from
server->client). But there's no explanation given to why one needs to
do that (are we doing some form of interwiki transclusion and need to
render foreign interwiki links correctly? Want to be able to do global
whatlinkshere and need unique global ids for various wikis? Something
else?)

>* Site definitions can exist that are not used as "interlanguage link" and
>not used as "interwiki link"

And if we put one of those on a talk page, what would happen? Or if
foo was one such link, doing [[:foo:some page]] (Current behaviour is
it becomes an interwiki).

Although to be fair, I do see how the current way we distinguish
between interwiki and interlang links is a bit hacky.

>And in fact we are making this more flexible by having the type system. The
>MediaWiki site type could for instance be able to form both "nice" urls and
>index.php ones. Or a gerrit type could have the logic to distinguish
>between the gerrit commit number and a sha1 hash.

I must admit I do like this this idea. In particular the current
situation where we treat the value of an interwiki link as a title
(aka spaces -> underscores etc) even for sites that do not use such
conventions, has always bothered me. Having interwikis that support
url re-writing based on the value does sound cool, but I certainly
wouldn't want said code in a db blob (and just using an integer
site_type identifier is quite far away from giving us that, but its
still a step in a positive direction), which raises the question of
where would such rewriting code go.


> The issue I was trying to deal with was storage. Currently we 100% assume
>that the interwiki list is a table and there will only ever be one of them.

Do we really assume that? Certainly that's the default config, but I
don't think that is the config used on WMF. As far as I'm aware,
Wikimedia uses a cdb database file (via $wgInterwikiCache), which
contains all the interwikis for all sites. From what I understand, it
supports doing various "scope" levels of interwikis, including per db,
per site (Wikipedia, Wiktionary, etc), or global interwikis that act
on all sites.

The feature is a bit wmf specific, but it does seem to support
different levels of interwiki lists.

Furthermore, I imagine (but don't know, so lets see how fast I get
corrected ;) that the cdb database was introduced not just as
convenience measure for easier administration of the interwiki tables,
but also for better performance. If so, one should also take into
account any performance hit that may come with switching to the
proposed "sites" facility.

Cheers,
-bawolff

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jeroendedauw at gmail

Aug 10, 2012, 6:40 AM

Post #17 of 20 (1260 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

Hey,

Having interwikis that support
> url re-writing based on the value does sound cool, but I certainly
> wouldn't want said code in a db blob
>

We do not have code in blobs in the db, that seems like a rather mad thing
to do! :)

The blobs we have are for holding data or config specific to some site
type. Having url re-writing based on page value does not require any such
type specific data or config. It requires type specific logic, which would
just go in the relevant Site deriving class, for instance MediaWikiSite.

Cheers

--
Jeroen De Dauw
http://www.bn2vs.com
Don't panic. Don't be evil.
--
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


lists at nadir-seen-fire

Aug 10, 2012, 9:02 AM

Post #18 of 20 (1259 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

On Fri, 10 Aug 2012 05:03:58 -0700, Denny Vrandečić
<denny.vrandecic [at] wikimedia> wrote:

> Hi Daniel,
>
> thanks for your comments. Some of the suggestions you make would
> extend the functionality beyond what we need right now. They look
> certainly useful, and I don't think that we would make implementing
> them any harder than it is right now -- rather the opposite.
>
> As usual, the perfect and the next-step are great enemies. I
> understand that the patch does not lead to a perfect world directly,
> that would cover all of your use cases -- but it nicely covers ours.
>
> My questions would be:
>
> * do you think that we are going in the wrong direction, or do you
> think we are not going far enough yet?
>
> * do you think that we are making some use cases harder to implement
> in the future than they would be now, and if so which ones?

I don't think you're going far enough.
You're rewriting a core feature in core but key issues with the old system
that should be fixed in any rewrite of it are explicitly being repeated
just because you don't need them fixed for Wikidata.
I'm not expecting any of you to spend a pile of time writing a UI because
it's missing. But I do expect that if we have a good idea what the optimal
database schema and usage of the feature is that you'd make a tiny effort
to include the fixes that Wikidata doesn't explicitly need. Instead of
rewriting it using a non-optimal format and forcing someone else to
rewrite stuff again.

Taking site_local_key for an example. Clearly site_local_key as a single
column does not work. We know from our interwiki experience we really want
multiple keys. And there is absolutely no value at all to forcing data
duplication.

If we use a proper site_local_key right now before submitting the code it
should be a minimal change to the code you have right now. (Unless the ORM
mapper makes it hard to use joins, in which case you'd be making a bad
choice from the start since when someone does fix site_local_key they will
need to break interface compatibility)

If someone trys to do this later. They are going to have to do schema
changes, a full data migration in the updater, AND they are going to have
to find some way to do de-dupliation of data.
These are things that wouldn't need to be bothered with at all if the
initial rewrite just made a few extra steps.

> * do you see other issues with the patch that should block it from
> being deployed, and which ones would that be?

I covered a few of them in inline comments on the commit. Things like not
understanding the role of group. Using ints for site types being bad for
extensibility. etc...

> Cheers,
> Denny

--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

> 2012/8/10 Daniel Friesen <lists [at] nadir-seen-fire>:
>> On 12-08-09 3:55 PM, Jeroen De Dauw wrote:
>>> Hey,
>>>
>>> You bring up some good points.
>>>
>>> I think we're going to need to have some of this and the
>>> synchronization stuff in core.
>>> Right now the code has nothing but the one sites table. No repo
>>> code so presumably the only implementation of that for awhile will
>>> be wikidata. And if parts of this table is supposed to be editable
>>> in some cases where there is no repo but non-editable then I don't
>>> see any way for an edit ui to tell the difference.
>>>
>>>
>>> We indeed need some configuration setting(s) for wikis to distinguish
>>> between the two cases. That seems to be all "synchronisation code"
>>> we'll need in core. It might or might not be useful to have more logic
>>> in core, or in some dedicated extension. Personally I think having the
>>> actual synchronization code in a separate extension would be nice, as
>>> a lot of it won't be Wikidata specific. This is however not a
>>> requirement for Wikidata, so the current plan is to just have it in
>>> the extension, always keeping in mind that it should be easy to split
>>> it off later on. I'd love to discuss this point further, but it should
>>> be clear this is not much of a blocker for the current code, as it
>>> seems unlikely to affect it much, if at all.
>>>
>>> On that note consider we're initially creating the new system in
>>> parallel with the old one, which enabled us to just try out changes,
>>> and alter them later on if it turns out there is a better way to do
>>> them. Then once we're confident the new system is what we want to
>>> stick to, and know it works because of it's usage by Wikidata, we can
>>> replace the current code with the new system. This ought to allow us
>>> to work a lot faster by not blocking on discussions and details for to
>>> long.
>>>
>>> > I'm also not sure how this synchronization which sounds like one-way
>>> will play with individual wikis wanting to add new interwiki links.
>>>
>>> For our case we only need it to work one way, from the Wikidata repo
>>> to it's clients. More discussion would need to happen to decide on an
>>> alternate approach. I already indicated I think this is not a blocker
>>> for the current set of changes, so I'd prefer this to happen after the
>>> current code got merged.
>>>
>>> I'm talking about things like the interwiki extensions and scripts
>>> that turn wiki tables into interwiki lists. All these things are
>>> written against the interwiki table. So by rewriting and using a
>>> new table we implicitly break all the working tricks and throw the
>>> user back into SQL.
>>>
>>>
>>> I am aware of this. Like noted already, the current new code does not
>>> yet replace the old code, so this is not a blocker yet, but it will be
>>> for replacing the old code with the new system. Having looked at the
>>> existing code using the old system, I think migration should not be to
>>> hard, since the new system can do everything the old one can do and
>>> the current using code is not that much. The new system also has clear
>>> interfaces, preventing the script from needing to know of the database
>>> table at all. That ought to facilitate the "do not depend on a single
>>> db table" a lot, obviously :)
>>>
>>> I like the idea of table entries without actual interwikis. The
>>> idea of some interface listing user selectable sites came to mind
>>> and perhaps sites being added trivially even automatically.
>>> Though if you plan to support this I think you'll need to drop the
>>> NOT NULL from site_local_key.
>>>
>>>
>>> I don't think the field needs to allow for null - right now the local
>>> keys on the repo will be by default the same as the global keys, so
>>> none of them will be null. On your client wiki you will then have
>>> these values by default as well. If you don't want a particular site
>>> to be usable as "languagelink" or "interwikilink", then simply set
>>> this in your local configuration. No need to set the local id to null.
>>> Depending on how actually we end up handling the defaulting process,
>>> having null might or might not turn out to be useful. This is a detail
>>> though, so I'd suggest sticking with not null for now, and then if it
>>> turns out I'd be more convenient to allow for null when writing the
>>> sync code, just change it then.
>> You mean site_config?
>> You're suggesting the interwiki system should look for a site by
>> site_local_key, when it finds one parse out the site_config, check if
>> it's disabled and if so ignore the fact it found a site with that local
>> key? Instead of just not having a site_local_key for that row in the
>> first place?
>>
>>> Actually, another thought makes me think the schema should be a
>>> little different.
>>> site_local_key probably shouldn't be a column, it should probably
>>> be another table.
>>> Something like site_local_key (slc_key, slc_site) which would map
>>> things like en:, Wikipedia:, etc... to a specific site.
>>>
>>>
>>> Denny and I discussed this at some length, now already more then a
>>> month ago (man, this is taking long...). Our conclusions where that we
>>> do not need it, or would benefit from it much in Wikidata. In fact,
>>> I'd introduce additional complexity, which is a good argument for not
>>> including it in our already huge project. I do agree that conceptually
>>> it's nicer to not duplicate such info, but if you consider the extra
>>> complexity you'd need to get rid of it, and the little gain you have
>>> (removal of some minor duplication which we've had since forever and
>>> is not bothering anyone), I'm sceptical we ought to go with this
>>> approach, even outside of Wikidata.
>> You've added global ids into this mix. So data duplication simply
>> because one wiki needs a second local name will mean that one url now
>> has two different global ids this sounds precisely like something that
>> is going to get in the way of the whole reason you wanted this rewrite.
>> It will also start to create issues with the sync code.
>> Additionally the number of duplicates needed is going to vary wiki by
>> wiki. en.wikisource is going to need one Wikipedia: to link to en.wp
>> while fr.wp is going to need two, Wikipedia: and en: to point to en.wp.
>> I can only see data duplication creating more problems than we need.
>>
>> As for the supposed complexity of this extra table. site_data and
>> site_config are blobs of presumably serialized data. You've already
>> eliminated the simplicity needed for this to be human editable from SQL
>> so there is no reason to hold back on making the database schema the
>> best it can be. As for deletions if you're worried about making them
>> simple just add a foreign key with cascading deletion. Then the rows in
>> site_local_key will automatically be deleted when you delete the row in
>> sites without any extra complexity.
>>
>>> I think I need to understand the plans you have for
>>> synchronization a bit more.
>>> - Where does Wikidata get the sites
>>>
>>>
>>> The repository wiki holds the canonical copy of the sites, which gets
>>> send to all clients. Modification of the site data can only happen on
>>> the repository. All wikis (repo and clients) have their own local
>>> config so can choose to enable all sites for all functionality,
>>> completely hide them, or anything in between.
>> Ok, I'm leaning more and more towards the idea that we should make the
>> full sites table a second-class index of sites pulled from any number of
>> data sources that you can carelessly truncate and have rebuilt (ie: it
>> has no more value than pagelinks).
>> Wikidata's data syncing would be served by creating a secondary table
>> with the local link_{key,inline,navigation}, forward, and config
>> columns. When you sync the data from the Wikidata repo and the site
>> local table would be combined to create what goes into the index table
>> with the full list of sites.
>> Doing it this way frees us from creating any restrictions on whatever
>> source we get sites from that we shouldn't be placing on them.
>> Wikidata gets site local stuff and global data and doesn't have to worry
>> about whether parts of the row are supposed to be editable or not. There
>> is nothing stopping us from making our first non-wikidata site source a
>> plaintext file so we have time to write a really good UI. And the UI is
>> free from restrictions placed by using this one table, so it's free to
>> do it in whatever way fits a UI best. Whether that means it's an
>> editable wikitext page or better yet a nice ui using that abstract
>> revision system I wanted to build.
>>
>>> - What synchronizes the data
>>>
>>>
>>> The repo. As already mentioned, it might be nicer to split this off in
>>> it's own extension at some point. But before we get to that, we first
>>> need to have the current changes merged.
>>>
>>> Btw if you really want to make this an abstract list of sites
>>> dropping site_url and the other two related columns might be an
>>> idea.
>>> At first glance the url looks like something standard that every
>>> site would have. But once you throw something like MediaWiki into
>>> the mix with short urls, long urls, and an API the url really
>>> becomes type specific data that should probably go in the blob.
>>> Especially when you start thinking about other custom types.
>>>
>>>
>>> The patch sitting on gerrit already includes this. (Did you really
>>> look at it already? The fields are documented quite well I'd think.)
>>> Every site has a url (that's not specific to the type of site), but we
>>> have a type system with currently the default (general) site type and
>>> a MediaWikiSite type. The type system works with two blob fields, one
>>> for type specific data and one for type specific configuration.
>> Yeah, I looked at the schema I know there is a data blob, that's what
>> I'm talking about. I mean while you'd think that a url is something
>> every site would have one of it's actually more of a type specific piece
>> of data because some site types can actually have multiple urls, etc...
>> which depend on what the page input is. So you might as well drop the 3
>> url related columns and just use the data blob that you already have.
>> The $1 pattern may not even work for some sites. For example something
>> like a gerrit type may want to know a specific root path for gerrit
>> without any $1 funny business and then handle what actual url gets
>> output in special ways. ie: So that [[gerrit:14295]] links to
>> https://gerrit.wikimedia.org/r/#/c/14295 while [[gerrit:
>> I0a96e58556026d8c923551b07af838ca426a2ab3]] links to
>> https://gerrit.wikimedia.org/r/#q,I0a96e58556026d8c923551b07af838ca426a2ab3,n,z
>>
>>> Cheers
>>>
>>> --
>>> Jeroen De Dauw
>>> http://www.bn2vs.com
>>> Don't panic. Don't be evil.
>>> --
>>
>> ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


jeroendedauw at gmail

Aug 10, 2012, 9:37 AM

Post #19 of 20 (1256 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

Hey,

> But I do expect that if we have a good idea what the optimal database
schema and usage of the feature is that you'd make a tiny effort to include
the fixes that Wikidata doesn't explicitly need.

This is entirely reasonable to ask. However this particular change is not
tiny, and it would cost us both effort to implement and make the change
even bigger, while we're trying to keep it small. We actually did go for
the low hanging fruit we did not need ourselves here, so implying we don't
care about concerns outside of our project would be short-sighted. After
all, strictly speaking we do not _need_ this rewrite. We could just
continue on pouring crap onto the current pile and hope it does not
collapse rather then fix all of the issues our change is tackling.

> Instead of rewriting it using a non-optimal format and forcing someone
else to rewrite stuff again.

We are not touching this, so you would still need to make the change if you
want to fix this issue, but you would not need to do it _again_. To be
honest I don't understand why you have a problem here. We're making it
easier for you to make this change. If you think it's that important, then
let's get our changes through so you can start making yours without us
getting in each others way.

> Unless the ORM mapper makes it hard to use joins

It does basically does not affect joins - it has no facilities for it, but
it does not make them harder either.

Cheers

--
Jeroen De Dauw
http://www.bn2vs.com
Don't panic. Don't be evil.
--
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


innocentkiller at gmail

Aug 10, 2012, 11:02 AM

Post #20 of 20 (1257 views)
Permalink
Re: Wikidata blockers weekly update [In reply to]

On Fri, Aug 10, 2012 at 9:02 AM, Daniel Friesen
<lists [at] nadir-seen-fire> wrote:
> I don't think you're going far enough.
> You're rewriting a core feature in core but key issues with the old system
> that should be fixed in any rewrite of it are explicitly being repeated just
> because you don't need them fixed for Wikidata.
> I'm not expecting any of you to spend a pile of time writing a UI because
> it's missing. But I do expect that if we have a good idea what the optimal
> database schema and usage of the feature is that you'd make a tiny effort to
> include the fixes that Wikidata doesn't explicitly need. Instead of
> rewriting it using a non-optimal format and forcing someone else to rewrite
> stuff again.
>

I agree one billion percent with everything you've said here, and it's
the *exact* point I've been trying to make all along.

I have no qualms with people trying to fix this--it's something that
needs to be fixed and has been on my todo list for far longer than
it should've been. But if it's going to be fixed/rewritten, time should
be taken so it is done properly.

-Chad

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.