Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Architectural revisions to improve category sorting

 

 

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


Simetrical+wikilist at gmail

Jul 21, 2010, 2:00 PM

Post #1 of 36 (3867 views)
Permalink
Architectural revisions to improve category sorting

I'm going to begin working on the following bugs:

* "Support collation by a certain locale (sorting order of
characters)", https://bugzilla.wikimedia.org/show_bug.cgi?id=164 (only
parts related to category sorting)
* "Subcategory paging is not separate from article or image paging",
https://bugzilla.wikimedia.org/show_bug.cgi?id=1211
* "CategoryTree is inefficient",
https://bugzilla.wikimedia.org/show_bug.cgi?id=23682

As well as possibly:

* "Categories need to be structured by namespace",
https://bugzilla.wikimedia.org/show_bug.cgi?id=450
* "Natural number sorting in category listings",
https://bugzilla.wikimedia.org/show_bug.cgi?id=6948

There are essentially two problems here:

1) We currently sort articles on category pages by the Unicode code
point of their sort key. This is terrible for anything other than
English, and dodgy sometimes even for English. (This is bugs 164 and
6948.)

2) We have no way to efficiently get all items that are in a category
and also in a particular namespace. Particularly, we can't retrieve
all subcategories without scanning all items in the category, which is
inefficient when we have a few (or no) subcategories and tons of
items. (This is bugs 1211, 23682, and 450.)

One part of (2) needs to be clarified. The primary use-case is
obviously that we want to be able to count subcategories efficiently,
or display all of them when we only display some of the items in the
category: this is bugs 1211 and 23682. Secondarily, we have a request
at bug 450 to organize category pages by namespace, so main, Talk:,
User:, etc. are all paginated separately.

I think the goal for (2) should be to allow efficient separate
retrieval of subcategories, files, and other pages, but not to
distinguish between namespaces otherwise. The major motivation is
that to do this efficiently, we'll need to add namespace info to the
categorylinks table, and we want this to stay consistent with the info
in the page table. Categories, files, and other types of pages cannot
be moved to one another, as far as I know (it would hardly make
sense), so it automatically stays consistent this way. This is a big
plus, because there are inevitably bugs that cause denormalized data
to fall out of sync (look at cat_pages).

Furthermore, I don't think it's obvious that we want separate
namespaces to display separately at all on category pages. What's a
case where that would be desired? It would break up the display a
lot, with a bunch of separate headers for different namespaces, when
each namespace might only have a few items. Most categories whose
sort appearance you'd care about (i.e., excepting maintenance
categories) will have nearly everything in one namespace anyway. You
could always split the category into separate ones per namespace if
you want them separate.

So I propose that we keep the current category/normal page/file split,
and paginate those three parts of the page separately. So you'd have
up to 200 subcategories, then below that up to 200 normal pages, then
below that up to 200 files. (The numbers could be adjusted.
Currently they're hardcoded, which is stupid.) Paginating
subcategories separately is obviously needed. Paginating files
separately is not really needed, but it would be much more consistent.

The overall solution, then, would be:

1) Change the way category sortkeys are generated. Start them with a
letter depending on namespace, like 'C' for category, 'P' for regular
page, 'F' for file. After that first letter, append a sortkey
generated by ICU or whatever. I think Tim has opinions on what would
be a good choice to convert the article title into sort key -- if not,
I'll have to research it and hopefully not come up with a completely
incorrect answer.

2) On category pages, maintain three offsets and do three queries (or
maybe UNION them together, doesn't matter), one for each of
categories/regular pages/files. Because of (1), this will be
efficient and will also sort less unreasonably for non-English
languages.

One problem that was pointed out somewhere in the massive useless
discussion on bug 164 is that we'd have to do something to display the
first letter for each section. Currently it's just the first letter
of the sortkey, but if that's some binary string, that becomes a
problem. I'm not seeing an obvious solution, since the
sortkey-generation algorithm will be opaque to us. If it sorts Á the
same as A, then how do we figure out that the "canonical" first letter
for the section should be "A" and not "Á"? How do we even figure out
where the sections begin or end? Would that even make sense in all
cases? At a first pass, I'd say we should just skip the first letter
and display all the items straight from beginning to end without
section divisions. I don't think that's a big problem.

This is just my initial thoughts. Feedback appreciated. If people
agree with the general approach, I can start coding this up tomorrow.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


roan.kattouw at gmail

Jul 21, 2010, 2:49 PM

Post #2 of 36 (3734 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

2010/7/21 Aryeh Gregor <Simetrical+wikilist [at] gmail>:
> Categories, files, and other types of pages cannot
> be moved to one another, as far as I know (it would hardly make
> sense), so it automatically stays consistent this way.
This is true for categories but not for files:
http://www.mediawiki.org/w/index.php?title=Special:Log&dir=prev&offset=20091202100459&limit=2&type=move&user=Catrope

> So I propose that we keep the current category/normal page/file split,
> and paginate those three parts of the page separately. So you'd have
> up to 200 subcategories, then below that up to 200 normal pages, then
> below that up to 200 files. (The numbers could be adjusted.
> Currently they're hardcoded, which is stupid.) Paginating
> subcategories separately is obviously needed. Paginating files
> separately is not really needed, but it would be much more consistent.
>
Sounds good to me. I do think we will want to page all three of these
things separately, it'd be stupidly inconsistent not to do that.

> The overall solution, then, would be:
>
> 1) Change the way category sortkeys are generated. Start them with a
> letter depending on namespace, like 'C' for category, 'P' for regular
> page, 'F' for file. After that first letter, append a sortkey
> generated by ICU or whatever. I think Tim has opinions on what would
> be a good choice to convert the article title into sort key -- if not,
> I'll have to research it and hopefully not come up with a completely
> incorrect answer.
>
Note that different languages will want different orders. For
instance, German generally sorts as ae, as oe and as ue, whereas
the Swedish sort , and at the end of the alphabet (so they
actually say A, B, C, ... Z, , , and use the phrase "from A to
"). These collation schemes obviously conflict in their handling of
and , and I'm sure there's crazier stuff out there.

This could be solved by having a different collation scheme for each
content language (these have to be standardized *somewhere*, right?)
and using {{DEFAULTSORT:}} for those rare cases where you have an
article about a German person on a non-German wiki and want it to sort
the German way.

> 2) On category pages, maintain three offsets and do three queries (or
> maybe UNION them together, doesn't matter),
In my personal opinion, UNION makes zero sense because you'd have to
pull the data apart again after querying it, as you're displaying it
separately as well. Separate queries are much cleaner in this case.

> One problem that was pointed out somewhere in the massive useless
> discussion on bug 164 is that we'd have to do something to display the
> first letter for each section. Currently it's just the first letter
> of the sortkey, but if that's some binary string, that becomes a
> problem. I'm not seeing an obvious solution, since the
> sortkey-generation algorithm will be opaque to us. If it sorts the
> same as A, then how do we figure out that the "canonical" first letter
> for the section should be "A" and not ""? How do we even figure out
> where the sections begin or end? Would that even make sense in all
> cases? At a first pass, I'd say we should just skip the first letter
> and display all the items straight from beginning to end without
> section divisions. I don't think that's a big problem.
>
I agree that the first-letter thing is a nice-to-have, but I'm more
worried about the general problem that sortkeys won't be
human-readable strings anymore (the API currently displays them and,
obviously, uses them for paging) nor possible to decode into
human-readable strings (because the encoding essentially loses
information when e.g. a and are folded). It would be nice if we
could store the original, unmunged sortkey in the categorylinks table,
although I realize that would eat space for display and debugging
purposes only.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


conrad.irwin at gmail

Jul 21, 2010, 3:04 PM

Post #3 of 36 (3735 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

On 21 July 2010 14:49, Roan Kattouw <roan.kattouw [at] gmail> wrote:
> 2010/7/21 Aryeh Gregor <Simetrical+wikilist [at] gmail>:

> Note that different languages will want different orders. For
> instance, German generally sorts as ae, as oe and as ue, whereas
> the Swedish sort , and at the end of the alphabet (so they
> actually say A, B, C, ... Z, , , and use the phrase "from A to
> "). These collation schemes obviously conflict in their handling of
> and , and I'm sure there's crazier stuff out there.
>
> This could be solved by having a different collation scheme for each
> content language (these have to be standardized *somewhere*, right?)
> and using {{DEFAULTSORT:}} for those rare cases where you have an
> article about a German person on a non-German wiki and want it to sort
> the German way.

For Wiktionary, every language is included in one wiki (and even on
one page) - it would be phenomenal to be able to select the collation
per category. As per-page or per-wiki will not help very much at all.

>
>> 2) On category pages, maintain three offsets and do three queries (or
>> maybe UNION them together, doesn't matter),
> In my personal opinion, UNION makes zero sense because you'd have to
> pull the data apart again after querying it, as you're displaying it
> separately as well. Separate queries are much cleaner in this case.
>
>> One problem that was pointed out somewhere in the massive useless
>> discussion on bug 164 is that we'd have to do something to display the
>> first letter for each section. Currently it's just the first letter
>> of the sortkey, but if that's some binary string, that becomes a
>> problem. I'm not seeing an obvious solution, since the
>> sortkey-generation algorithm will be opaque to us. If it sorts the
>> same as A, then how do we figure out that the "canonical" first letter
>> for the section should be "A" and not ""? How do we even figure out
>> where the sections begin or end? Would that even make sense in all
>> cases? At a first pass, I'd say we should just skip the first letter
>> and display all the items straight from beginning to end without
>> section divisions. I don't think that's a big problem.
>>
> I agree that the first-letter thing is a nice-to-have, but I'm more
> worried about the general problem that sortkeys won't be
> human-readable strings anymore (the API currently displays them and,
> obviously, uses them for paging) nor possible to decode into
> human-readable strings (because the encoding essentially loses
> information when e.g. a and are folded). It would be nice if we
> could store the original, unmunged sortkey in the categorylinks table,
> although I realize that would eat space for display and debugging
> purposes only.

There is no way to go from the sort-key to the first letter and
indeed, you can't even put the first letter at the start of the sort
key, as you need to sort the sections differently per language. The
solution I use for generating the indices on Wiktionary is to store
the first letter explicitly (either of the page or the user-provided
sort key before they are fed into ICU). This would (in the future)
allow "topical" categories, but that's juts a distraction for now.

Conrad

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Jul 21, 2010, 3:15 PM

Post #4 of 36 (3740 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

On Wed, Jul 21, 2010 at 5:49 PM, Roan Kattouw <roan.kattouw [at] gmail> wrote:
> This is true for categories but not for files:
> http://www.mediawiki.org/w/index.php?title=Special:Log&dir=prev&offset=20091202100459&limit=2&type=move&user=Catrope

Blech. Does this make any sense? Can we change it? It would simply
this considerably.

> Note that different languages will want different orders. For
> instance, German generally sorts ä as ae, ö as oe and ü as ue, whereas
> the Swedish sort å, ä and ö at the end of the alphabet (so they
> actually say A, B, C, ... Z, Å, Ä, Ö and use the phrase "from A to
> Ö"). These collation schemes obviously conflict in their handling of ä
> and ö, and I'm sure there's crazier stuff out there.
>
> This could be solved by having a different collation scheme for each
> content language (these have to be standardized *somewhere*, right?)
> and using {{DEFAULTSORT:}} for those rare cases where you have an
> article about a German person on a non-German wiki and want it to sort
> the German way.

Yes, of course. I'm assuming that the magical sortkey-generator I'm
plugging into here is locale-specific.

> In my personal opinion, UNION makes zero sense because you'd have to
> pull the data apart again after querying it, as you're displaying it
> separately as well. Separate queries are much cleaner in this case.

It's pretty simple to do either way. Makes no big difference.

> I agree that the first-letter thing is a nice-to-have, but I'm more
> worried about the general problem that sortkeys won't be
> human-readable strings anymore (the API currently displays them and,
> obviously, uses them for paging) nor possible to decode into
> human-readable strings (because the encoding essentially loses
> information when e.g. a and á are folded). It would be nice if we
> could store the original, unmunged sortkey in the categorylinks table,
> although I realize that would eat space for display and debugging
> purposes only.

This would also require altering the table. Why is it necessary? For
paging, we can just use cl_from to stick in the URL, and retrieve
cl_sortkey based on that and cl_to. That will make it be short and
not look horribly ugly. When do we ever need a human-readable form of
the sortkey, as opposed to a human-readable form of the title? API
users should keep working when this happens with no special code
changes on server or client, just they'll have horribly long and ugly
URLs with encoded binary. Sortkeys are often weird and not suitable
for display to humans anyway, like when "*" is used.

I'm not seeing this as worth adding a fourth field to categorylinks,
which is a huge table already.

On Wed, Jul 21, 2010 at 6:04 PM, Conrad Irwin <conrad.irwin [at] gmail> wrote:
> For Wiktionary, every language is included in one wiki (and even on
> one page) - it would be phenomenal to be able to select the collation
> per category. As per-page or per-wiki will not help very much at all.

Why won't per-page help? I'm not understanding clearly here. I don't
think it would be too much trouble to add per-page and per-category
parser functions to set the language used for sort keys, though.

> There is no way to go from the sort-key to the first letter  and
> indeed, you can't even put the first letter at the start of the sort
> key, as you need to sort the sections differently per language. The
> solution I use for generating the indices on Wiktionary is to store
> the first letter explicitly (either of the page or the user-provided
> sort key before they are fed into ICU). This would (in the future)
> allow "topical" categories, but that's juts a distraction for now.

But different articles that are sorted as though they started with the
same letter might not actually start with the same letter, so how do
we figure out which first letter is the correct one? This is a
problem even if you're just dealing with accented letters -- I have no
idea how this stuff works (or doesn't work) for CJK or whatnot.
(Judging by these:

http://ja.wikipedia.org/wiki/Category:%E5%AD%98%E5%91%BD%E4%BA%BA%E7%89%A9
http://zh.wikipedia.org/wiki/Category:%E5%9C%A8%E4%B8%96%E4%BA%BA%E7%89%A9
http://zh-yue.wikipedia.org/wiki/Category:%E5%9C%A8%E4%B8%96%E4%BA%BA%E7%89%A9

the strategy is just to manually force sortkeys to begin with
something like "A" or "あ". Cantonese doesn't do this, and it ends up
with one article per "letter" in many cases.)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


daniel at brightbyte

Jul 21, 2010, 3:18 PM

Post #5 of 36 (3748 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

Aryeh Gregor schrieb:
> * "Categories need to be structured by namespace",
> https://bugzilla.wikimedia.org/show_bug.cgi?id=450
> * "Natural number sorting in category listings",
> https://bugzilla.wikimedia.org/show_bug.cgi?id=6948

While we definitly need efficient retrieval by namespace, the default sort key
should *not* include the namespace prefix. it's very annoying that all files get
sorted under "F" currently, or that pages from the Wikipedia namespace all end
up under "W".

-- daniel

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


conrad.irwin at gmail

Jul 21, 2010, 3:22 PM

Post #6 of 36 (3732 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

On 21 July 2010 15:15, Aryeh Gregor <Simetrical+wikilist [at] gmail> wrote:
>
> Why won't per-page help? I'm not understanding clearly here. I don't
> think it would be too much trouble to add per-page and per-category
> parser functions to set the language used for sort keys, though.
>

Because there are multiple languages on each page - so you need lots
of different sort keys.

>
> But different articles that are sorted as though they started with the
> same letter might not actually start with the same letter, so how do
> we figure out which first letter is the correct one? This is a
> problem even if you're just dealing with accented letters -- I have no
> idea how this stuff works (or doesn't work) for CJK or whatnot.

If it's sorted as starting with "a" it should appear under "a". The
alternative would be to have different explicit sorting for the
sections in the category than for the words in the section, which I
think is unnecessary.

Conrad

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


ariel at wikimedia

Jul 21, 2010, 3:27 PM

Post #7 of 36 (3733 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

Στις 21-07-2010, ημέρα Τετ, και ώρα 23:49 +0200, ο/η Roan Kattouw
έγραψε:
> 2010/7/21 Aryeh Gregor <Simetrical+wikilist [at] gmail>:
>
> > The overall solution, then, would be:
> >
> > 1) Change the way category sortkeys are generated. Start them with a
> > letter depending on namespace, like 'C' for category, 'P' for regular
> > page, 'F' for file. After that first letter, append a sortkey
> > generated by ICU or whatever. I think Tim has opinions on what would
> > be a good choice to convert the article title into sort key -- if not,
> > I'll have to research it and hopefully not come up with a completely
> > incorrect answer.
> >
> Note that different languages will want different orders. For
> instance, German generally sorts ä as ae, ö as oe and ü as ue, whereas
> the Swedish sort å, ä and ö at the end of the alphabet (so they
> actually say A, B, C, ... Z, Å, Ä, Ö and use the phrase "from A to
> Ö"). These collation schemes obviously conflict in their handling of ä
> and ö, and I'm sure there's crazier stuff out there.
>
> This could be solved by having a different collation scheme for each
> content language (these have to be standardized *somewhere*, right?)
> and using {{DEFAULTSORT:}} for those rare cases where you have an
> article about a German person on a non-German wiki and want it to sort
> the German way.
>

It's much worse than that. On the Wiktionary projects, which have the
modest goal of documenting all words in all languages *on each language
edition*, you can expect to find german, french, arabic, etc. lemmas on
say the russian project. Even worse, you can expect to find multiple
lemmas on one page if the lemma has the same orthography in more than
one language. If the sort order differs between languages it's loads of
fun.

Ariel



_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Jul 21, 2010, 3:28 PM

Post #8 of 36 (3732 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

On Wed, Jul 21, 2010 at 6:18 PM, Daniel Kinzler <daniel [at] brightbyte> wrote:
> While we definitly need efficient retrieval by namespace, the default sort key
> should *not* include the namespace prefix. it's very annoying that all files get
> sorted under "F" currently, or that pages from the Wikipedia namespace all end
> up under "W".

That's totally orthogonal and is like a one-line change. Probably you
just have to change getPrefixedDBkey() to getDBkey() somewhere.

On Wed, Jul 21, 2010 at 6:22 PM, Conrad Irwin <conrad.irwin [at] gmail> wrote:
> Because there are multiple languages on each page - so you need lots
> of different sort keys.

Could you point me to an example of some pages and categories where
this is an issue? I'm not clear on how categories/pages/sort keys are
being used here.

> If it's sorted as starting with "a" it should appear under "a". The
> alternative would be to have different explicit sorting for the
> sections in the category than for the words in the section, which I
> think is unnecessary.

So if we have three pages "Áa", "Ab", "Ác" and they're sorted in the
category in that order, should they be in one section? I don't see
how you'd put them in two or three sections. If they're in one
section, what letter do you use for it, "Á" or "A"? We can figure out
"A" is correct here, but how do you do that in general automatically?

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


conrad.irwin at gmail

Jul 21, 2010, 3:45 PM

Post #9 of 36 (3731 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

On 21 July 2010 15:28, Aryeh Gregor <Simetrical+wikilist [at] gmail> wrote:
> On Wed, Jul 21, 2010 at 6:18 PM, Daniel Kinzler <daniel [at] brightbyte> wrote:
>> While we definitly need efficient retrieval by namespace, the default sort key
>> should *not* include the namespace prefix. it's very annoying that all files get
>> sorted under "F" currently, or that pages from the Wikipedia namespace all end
>> up under "W".
>
> That's totally orthogonal and is like a one-line change.  Probably you
> just have to change getPrefixedDBkey() to getDBkey() somewhere.
>
> On Wed, Jul 21, 2010 at 6:22 PM, Conrad Irwin <conrad.irwin [at] gmail> wrote:
>> Because there are multiple languages on each page - so you need lots
>> of different sort keys.
>
> Could you point me to an example of some pages and categories where
> this is an issue?  I'm not clear on how categories/pages/sort keys are
> being used here.

I don't have an example to hand (as the page is not yet complete on Wiktionary)
The Hungarian letter "cs" sorts after "c", so while in English "cs"
(for centi-seconds) should come before "CV", in Hungarian the entry
for the letter (which is missing) should come afterwards. Both English
and Hungarian would be on the same Wiktionary page.

>
>> If it's sorted as starting with "a" it should appear under "a". The
>> alternative would be to have different explicit sorting for the
>> sections in the category than for the words in the section, which I
>> think is unnecessary.
>
> So if we have three pages "Áa", "Ab", "Ác" and they're sorted in the
> category in that order, should they be in one section?  I don't see
> how you'd put them in two or three sections.  If they're in one
> section, what letter do you use for it, "Á" or "A"?  We can figure out
> "A" is correct here, but how do you do that in general automatically?
>

Some languages treat accented letters as the same primary letter, and
use it only in the secondary or tertiary sort key (Which the current
category table's keys of 80 bytes are in danger of truncating), others
have variations on a theme, again Hungarian makes a good example, ö
and ő are the one letter with two stresses, o and ó is a different
letter. It should be automatically possible to extract the first
letter from the words to be sorted (I don't know if ICU covers that,
if not, just ask some people who speak the language, or Wikipedia) -
but it's not possible to get that information from the sort keys
directly, so either we store the user provided sort key, and our
derived sort key, so we can use the former to find the first letter at
render time, or we just store the first letter.

Conrad

> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


roan.kattouw at gmail

Jul 21, 2010, 4:03 PM

Post #10 of 36 (3735 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

2010/7/22 Aryeh Gregor <Simetrical+wikilist [at] gmail>:
> On Wed, Jul 21, 2010 at 5:49 PM, Roan Kattouw <roan.kattouw [at] gmail> wrote:
>> This is true for categories but not for files:
>> http://www.mediawiki.org/w/index.php?title=Special:Log&dir=prev&offset=20091202100459&limit=2&type=move&user=Catrope
>
> Blech. Does this make any sense? Can we change it? It would simply
> this considerably.
>
It doesn't make a great deal of sense and can be changed fairly easily
in Title::isValidMoveTarget().
> This would also require altering the table. Why is it necessary? For
> paging, we can just use cl_from to stick in the URL, and retrieve
> cl_sortkey based on that and cl_to. That will make it be short and
> not look horribly ugly. When do we ever need a human-readable form of
> the sortkey, as opposed to a human-readable form of the title? API
> users should keep working when this happens with no special code
> changes on server or client, just they'll have horribly long and ugly
> URLs with encoded binary. Sortkeys are often weird and not suitable
> for display to humans anyway, like when "*" is used.
>
> I'm not seeing this as worth adding a fourth field to categorylinks,
> which is a huge table already.
>
Hm, you're right that even in the API there is very little value in
displaying the sortkeys, so let's just drop it.

> But different articles that are sorted as though they started with the
> same letter might not actually start with the same letter, so how do
> we figure out which first letter is the correct one?
Yeah, this is pretty much impossible: when you fold A and , there's
no guarantee that the first entry (or even the majority of entries)
for A will really be A's rather than 's.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


roan.kattouw at gmail

Jul 21, 2010, 4:05 PM

Post #11 of 36 (3729 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

2010/7/22 Aryeh Gregor <Simetrical+wikilist [at] gmail>:
> On Wed, Jul 21, 2010 at 6:18 PM, Daniel Kinzler <daniel [at] brightbyte> wrote:
>> While we definitly need efficient retrieval by namespace, the default sort key
>> should *not* include the namespace prefix. it's very annoying that all files get
>> sorted under "F" currently, or that pages from the Wikipedia namespace all end
>> up under "W".
>
> That's totally orthogonal and is like a one-line change. Probably you
> just have to change getPrefixedDBkey() to getDBkey() somewhere.
>
$wgCategoryPrefixedDefaultSortkey currently defaults to true, we could
make that default to false instead.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


tstarling at wikimedia

Jul 22, 2010, 12:01 AM

Post #12 of 36 (3728 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

On 22/07/10 07:00, Aryeh Gregor wrote:
> Categories, files, and other types of pages cannot
> be moved to one another, as far as I know (it would hardly make
> sense), so it automatically stays consistent this way.

This restriction is enforced by Title::isValidMoveOperation().

> 1) Change the way category sortkeys are generated. Start them with a
> letter depending on namespace, like 'C' for category, 'P' for regular
> page, 'F' for file. After that first letter, append a sortkey
> generated by ICU or whatever.

An alternative would be to add a column to the categorylinks table,
say cl_type. It could be an ENUM or some short text type. Then the
index could be altered to include this field at the start of it.

Presumably the rationale for combining these two things into
cl_sortkey is to avoid a schema change, and to make the paging code
slightly simpler. But I worry that future generations of MediaWiki
developers will curse us for laziness and obfuscation.

> I think Tim has opinions on what would
> be a good choice to convert the article title into sort key -- if not,
> I'll have to research it and hopefully not come up with a completely
> incorrect answer.

Well, I've said ICU, possibly with a PHP simulation of some Western
European sort key algorithm for the benefit of users without access to
ICU. But I formed that opinion years ago, and I never properly
surveyed all the possible solutions in the first place. It probably
makes sense to do a little of your own research.

Note that I specifically excluded the actual implementation of
language-dependent sort keys from the requirements list when I wrote
up this project. It could easily eat up a lot of time, and it's not
necessary for a proof-of-principle implementation.

> 2) On category pages, maintain three offsets and do three queries (or
> maybe UNION them together, doesn't matter), one for each of
> categories/regular pages/files. Because of (1), this will be
> efficient and will also sort less unreasonably for non-English
> languages.
>
> One problem that was pointed out somewhere in the massive useless
> discussion on bug 164 is that we'd have to do something to display the
> first letter for each section. Currently it's just the first letter
> of the sortkey, but if that's some binary string, that becomes a
> problem. I'm not seeing an obvious solution, since the
> sortkey-generation algorithm will be opaque to us. If it sorts Á the
> same as A, then how do we figure out that the "canonical" first letter
> for the section should be "A" and not "Á"? How do we even figure out
> where the sections begin or end? Would that even make sense in all
> cases? At a first pass, I'd say we should just skip the first letter
> and display all the items straight from beginning to end without
> section divisions. I don't think that's a big problem.

Roan is also asking for a store of the plain text form in this thread.

Work out how much space we would need to additionally store the
category keys in plain text. Then we will know what sort of tradeoff
we are looking at. Have you got a toolserver account you can use to do
the sums?

Since we won't be sorting on the plain text form anymore, we could use
some tricks to save space. For instance, if the sort key is the same
as the article title, we could store NULL instead of another copy of
the article title. That should save 95% or so.

-- Tim Starling


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


roan.kattouw at gmail

Jul 22, 2010, 12:37 AM

Post #13 of 36 (3735 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

2010/7/22 Tim Starling <tstarling [at] wikimedia>:
> An alternative would be to add a column to the categorylinks table,
> say cl_type. It could be an ENUM or some short text type. Then the
> index could be altered to include this field at the start of it.
>
> Presumably the rationale for combining these two things into
> cl_sortkey is to avoid a schema change, and to make the paging code
> slightly simpler. But I worry that future generations of MediaWiki
> developers will curse us for laziness and obfuscation.
>
There is another reason to prefer this schema, which is that the
orginially proposed one is susceptible to weird transition bugs. After
this feature is deployed, there will be old-format (i.e. plain)
sortkeys sticking around in the database for quite some time after
(they won't go away until LinksUpdate fixes them), which means that
pages whose sortkey starts with a C or F will be recognized as
categories and files respectively, even if they're normal pages.

The best way to mitigate that is to populate the namespace information
prior to deployment. In Tim's schema, that means filling the cl_type
field based on page_namespace. In the sortkey prefix schema, that
means prefixing the sortkey with the relevant sortkey, but that also
requires the sortkey updating code has already been updated at that
time (so it doesn't overwrite new-style sortkeys with old-style ones),
which means you'd have to partially deploy the code while running the
population script. Yuck.

> Roan is also asking for a store of the plain text form in this thread.
>
I have since conceded that it takes space for almost zero gain (as it
doesn't solve the first letter problem and doesn't have much gain for
the API either), see my previous post.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


dgerard at gmail

Jul 22, 2010, 2:34 AM

Post #14 of 36 (3732 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

The only thing I would ask as a user (and filer of bug 1211, all those
years ago) in terms of the interface:

Please don't remove the feature where the first letter of the sort key
is displayed in the rendered category page, and if necessary add what
it takes to keep it.

There are scripts where this will be a hard problem, but it's still
much-used and much-loved in those where it isn't.


- d.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Jul 22, 2010, 9:34 AM

Post #15 of 36 (3716 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

On Wed, Jul 21, 2010 at 6:45 PM, Conrad Irwin <conrad.irwin [at] gmail> wrote:
> I don't have an example to hand (as the page is not yet complete on Wiktionary)
> The Hungarian letter "cs" sorts after "c", so while in English "cs"
> (for centi-seconds) should come before "CV", in Hungarian the entry
> for the letter (which is missing) should come afterwards. Both English
> and Hungarian would be on the same Wiktionary page.

Okay, I see. I don't think this would be terribly hard, although I
don't think it's needed for an initial implementation. The major
problem I see is that if the sort collation is per-category, then
changing it on a preexisting large category will require reparsing all
the pages, probably. (Unless we store the raw sortkeys as well.)

> Some languages treat accented letters as the same primary letter, and
> use it only in the secondary or tertiary sort key (Which the current
> category table's keys of 80 bytes are in danger of truncating), others
> have variations on a theme, again Hungarian makes a good example, ö
> and ő are the one letter with two stresses, o and ó is a different
> letter. It should be automatically possible to extract the first
> letter from the words to be sorted (I don't know if ICU covers that,
> if not, just ask some people who speak the language, or Wikipedia) -
> but it's not possible to get that information from the sort keys
> directly, so either we store the user provided sort key, and our
> derived sort key, so we can use the former to find the first letter at
> render time, or we just store the first letter.

I don't see an answer to my question here. Given a sorted list of
sortkeys, possibly including the raw sortkey as well as the one that's
been put through ICU/CLDR/whatever, what algorithm do you propose to
break it up into sections labeled with first letters? In particular,
any such algorithm should not conflict with the sort order, in the
sense that you should not have three words A, B, C sorted as A < B < C
where firstLetter(A) == firstLetter(C) != firstLetter(B). Is this
reasonably possible to guarantee in all alphabetic languages'
conventional sort orders?

If we do store the raw sort key, we could have some Language method to
retrieve the section name, and just write our own implementations for
various languages. However, I'm not sure this is worth the effort.

On Wed, Jul 21, 2010 at 7:03 PM, Roan Kattouw <roan.kattouw [at] gmail> wrote:
> It doesn't make a great deal of sense and can be changed fairly easily
> in Title::isValidMoveTarget().

On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling <tstarling [at] wikimedia> wrote:
> This restriction is enforced by Title::isValidMoveOperation().

Any objections to changing this so files can't be moved over non-files
or vice versa?

> An alternative would be to add a column to the categorylinks table,
> say cl_type. It could be an ENUM or some short text type. Then the
> index could be altered to include this field at the start of it.
>
> Presumably the rationale for combining these two things into
> cl_sortkey is to avoid a schema change, and to make the paging code
> slightly simpler. But I worry that future generations of MediaWiki
> developers will curse us for laziness and obfuscation.

I'm okay with this.

> Well, I've said ICU, possibly with a PHP simulation of some Western
> European sort key algorithm for the benefit of users without access to
> ICU. But I formed that opinion years ago, and I never properly
> surveyed all the possible solutions in the first place. It probably
> makes sense to do a little of your own research.

Gerard Meijssen felt strongly that we should use something based on
CLDR. Apparently we have connections there and work with them a lot,
and I guess he feels it's higher-quality or such.

> Note that I specifically excluded the actual implementation of
> language-dependent sort keys from the requirements list when I wrote
> up this project. It could easily eat up a lot of time, and it's not
> necessary for a proof-of-principle implementation.

All right. Then I'll just do whatever's readily available and fits in
the database column.

> Work out how much space we would need to additionally store the
> category keys in plain text. Then we will know what sort of tradeoff
> we are looking at. Have you got a toolserver account you can use to do
> the sums?

Yes, I'm a toolserver root.

> Since we won't be sorting on the plain text form anymore, we could use
> some tricks to save space. For instance, if the sort key is the same
> as the article title, we could store NULL instead of another copy of
> the article title. That should save 95% or so.

It doesn't seem like it would save nearly that much. On the Welsh
Wikipedia (small enough database to be manageable), I get the
following:

cl_from=page_id WHERE REPLACE(cl_sortkey, ' ', '_') != page_title;
+-------------------------+
| SUM(LENGTH(cl_sortkey)) |
+-------------------------+
| 551851 |
+-------------------------+
1 row in set (1.94 sec)

cl_from=page_id;
+-------------------------+
| SUM(LENGTH(cl_sortkey)) |
+-------------------------+
| 1619747 |
+-------------------------+
1 row in set (0.44 sec)

cl_from=page_id WHERE REPLACE(cl_sortkey, ' ', '_') != page_title AND
page_namespace = 0;
+-------------------------+
| SUM(LENGTH(cl_sortkey)) |
+-------------------------+
| 347539 |
+-------------------------+
1 row in set (0.20 sec)

cl_from=page_id WHERE page_namespace = 0;
+-------------------------+
| SUM(LENGTH(cl_sortkey)) |
+-------------------------+
| 1067588 |
+-------------------------+
1 row in set (0.19 sec)

I filtered out the main namespace in the last two to avoid false
positives from namespace prefixes. This suggests savings of maybe
50-75%. The story may be different on larger wikis. It's worth
remembering, though, that a lot of these sortkeys might be set to work
around deficiencies in the current default sortkey generation, so
maybe it would be higher savings in the long term.

It's still not at all clear to me that saving a raw copy in the
database is worth it. If we really need sectioning by first letter on
category pages, we could save the first letter instead, and leave that
NULL when it's the same as the first letter of the page title (all of
this for some locale-specific definition of "first letter"). But I
don't know if we need that.

On Thu, Jul 22, 2010 at 3:37 AM, Roan Kattouw <roan.kattouw [at] gmail> wrote:
> There is another reason to prefer this schema, which is that the
> orginially proposed one is susceptible to weird transition bugs. After
> this feature is deployed, there will be old-format (i.e. plain)
> sortkeys sticking around in the database for quite some time after
> (they won't go away until LinksUpdate fixes them), which means that
> pages whose sortkey starts with a C or F will be recognized as
> categories and files respectively, even if they're normal pages.
>
> The best way to mitigate that is to populate the namespace information
> prior to deployment. In Tim's schema, that means filling the cl_type
> field based on page_namespace. In the sortkey prefix schema, that
> means prefixing the sortkey with the relevant sortkey, but that also
> requires the sortkey updating code has already been updated at that
> time (so it doesn't overwrite new-style sortkeys with old-style ones),
> which means you'd have to partially deploy the code while running the
> population script. Yuck.

This whole problems arises for sortkey changes generally. It will be
just as much of a problem when going to a new sortkey type (based on
CLDR or whatever). The only way to avoid it is to create a new
column, populate it while maintaining both columns at once, start
using the new column once it's fully populated, and then drop the old
column. That seems excessive. Remember that we can convert the
current raw sortkey into ICU/CLDR/whatever without reparsing pages, as
long as we can reliably tell old from new sortkeys (should be pretty
easy to do heuristically). So it shouldn't take forever -- surely no
more than a day or two even for enwiki.

On Thu, Jul 22, 2010 at 5:34 AM, David Gerard <dgerard [at] gmail> wrote:
> Please don't remove the feature where the first letter of the sort key
> is displayed in the rendered category page, and if necessary add what
> it takes to keep it.
>
> There are scripts where this will be a hard problem, but it's still
> much-used and much-loved in those where it isn't.

Is it? What use does it serve? We don't have it for any other type
of list. We have zillions of types of page lists, and category pages
are the only ones that have the first letter displayed. It makes the
columns uneven, and is completely crazy for some scripts (like CJK,
AFAICT).

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


dgerard at gmail

Jul 22, 2010, 9:38 AM

Post #16 of 36 (3718 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

On 22 July 2010 17:34, Aryeh Gregor <Simetrical+wikilist [at] gmail> wrote:
> On Thu, Jul 22, 2010 at 5:34 AM, David Gerard <dgerard [at] gmail> wrote:

>> Please don't remove the feature where the first letter of the sort key
>> is displayed in the rendered category page, and if necessary add what
>> it takes to keep it.
>> There are scripts where this will be a hard problem, but it's still
>> much-used and much-loved in those where it isn't.

> Is it?  What use does it serve?  We don't have it for any other type
> of list.  We have zillions of types of page lists, and category pages
> are the only ones that have the first letter displayed.  It makes the
> columns uneven, and is completely crazy for some scripts (like CJK,
> AFAICT).


I'm basing this claim on the remarkable effort people go to to get
things to sort on the category page *just right*. People clearly use
this feature, and work hard to use it.

Take it to the VP (technical) on en:wp and I would be amazed if people
don't protest hugely at the prospect of it disappearing. Maybe I'm
wrong, but asking the users would be a good idea first.


- d.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


b-jorsch at northwestern

Jul 22, 2010, 10:25 AM

Post #17 of 36 (3721 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

On Wed, Jul 21, 2010 at 05:00:43PM -0400, Aryeh Gregor wrote:
>
> 2) We have no way to efficiently get all items that are in a category
> and also in a particular namespace. Particularly, we can't retrieve
> all subcategories without scanning all items in the category, which is
> inefficient when we have a few (or no) subcategories and tons of
> items. (This is bugs 1211, 23682, and 450.)

"Categorymembers namespace filtering is inefficient, uses ugly hack in
miser mode", https://bugzilla.wikimedia.org/show_bug.cgi?id=19640 , is
also very related.

> I think the goal for (2) should be to allow efficient separate
> retrieval of subcategories, files, and other pages, but not to
> distinguish between namespaces otherwise.

That wouldn't work if you did want to also fix bug 19640. It is quite
plausible that a bot might want to query all ns0 pages in a category, or
all talk pages, or all non-talk pages, or all templates, etc.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Jul 22, 2010, 11:10 AM

Post #18 of 36 (3717 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

On Thu, Jul 22, 2010 at 12:38 PM, David Gerard <dgerard [at] gmail> wrote:
> I'm basing this claim on the remarkable effort people go to to get
> things to sort on the category page *just right*. People clearly use
> this feature, and work hard to use it.
>
> Take it to the VP (technical) on en:wp and I would be amazed if people
> don't protest hugely at the prospect of it disappearing. Maybe I'm
> wrong, but asking the users would be a good idea first.

Feel free to do so if you think it's an issue. It wouldn't be
terribly hard to keep the current behavior, I don't think.

On Thu, Jul 22, 2010 at 1:25 PM, Brad Jorsch <b-jorsch [at] northwestern> wrote:
> "Categorymembers namespace filtering is inefficient, uses ugly hack in
> miser mode", https://bugzilla.wikimedia.org/show_bug.cgi?id=19640 , is
> also very related.

I personally don't think it's worth fixing. The likelihood of error
in keeping the categorylinks table up-to-date with page moves is just
too great to be worth bothering over.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


roan.kattouw at gmail

Jul 22, 2010, 11:32 AM

Post #19 of 36 (3722 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

2010/7/22 Aryeh Gregor <Simetrical+wikilist [at] gmail>:
> On Thu, Jul 22, 2010 at 1:25 PM, Brad Jorsch <b-jorsch [at] northwestern> wrote:
>> "Categorymembers namespace filtering is inefficient, uses ugly hack in
>> miser mode", https://bugzilla.wikimedia.org/show_bug.cgi?id=19640 , is
>> also very related.
>
> I personally don't think it's worth fixing. The likelihood of error
> in keeping the categorylinks table up-to-date with page moves is just
> too great to be worth bothering over.
We could at least partially fix it by allowing to filter by type
(file|category|page) instead of namespace, and provide a mapping from
clnamespace to cltype for backwards compatibility (6 -> file, 14 ->
category, any other number -> page).

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Platonides at gmail

Jul 22, 2010, 12:06 PM

Post #20 of 36 (3715 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

Aryeh Gregor wrote:
> On Wed, Jul 21, 2010 at 7:03 PM, Roan Kattouw <roan.kattouw [at] gmail> wrote:
>> It doesn't make a great deal of sense and can be changed fairly easily
>> in Title::isValidMoveTarget().
>
> On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling <tstarling [at] wikimedia> wrote:
>> This restriction is enforced by Title::isValidMoveOperation().
>
> Any objections to changing this so files can't be moved over non-files
> or vice versa?

It could even make sense to move the text revisions to/from the file
namespace (with an appropiate warning). I once wanted to moving a file
talk page to the file namespace, to archive (delete) them together. Or
you might want to move an overdescription to NS_MAIN.

...

>> Since we won't be sorting on the plain text form anymore, we could use
>> some tricks to save space. For instance, if the sort key is the same
>> as the article title, we could store NULL instead of another copy of
>> the article title. That should save 95% or so.
>
> It doesn't seem like it would save nearly that much. On the Welsh
> Wikipedia (small enough database to be manageable), I get the
> following:
>
(...)
>
> I filtered out the main namespace in the last two to avoid false
> positives from namespace prefixes. This suggests savings of maybe
> 50-75%. The story may be different on larger wikis. It's worth
> remembering, though, that a lot of these sortkeys might be set to work
> around deficiencies in the current default sortkey generation, so
> maybe it would be higher savings in the long term.
>
> It's still not at all clear to me that saving a raw copy in the
> database is worth it. If we really need sectioning by first letter on
> category pages, we could save the first letter instead, and leave that
> NULL when it's the same as the first letter of the page title (all of
> this for some locale-specific definition of "first letter"). But I
> don't know if we need that.

Note that even if you're not storing it, the database is likely to be
reserving the space for that.
It can be useful to discern explicit sortkeys when the rules for
language parsing change, though.


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Jul 22, 2010, 12:21 PM

Post #21 of 36 (3716 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

On Thu, Jul 22, 2010 at 3:06 PM, Platonides <Platonides [at] gmail> wrote:
> It could even make sense to move the text revisions to/from the file
> namespace (with an appropiate warning). I once wanted to moving a file
> talk page to the file namespace, to archive (delete) them together. Or
> you might want to move an overdescription to NS_MAIN.
>
> ...

Is this worth the architectural complications in this case? It could
be done, but . . .

Or maybe this would be no extra overhead? Do we normally do links
updates on page move? We must already be updating categorylinks with
new sortkeys if applicable, no? We could update the namespace at that
point too. Maybe I'm being too averse to this denormalization -- I
got rather disillusioned when the counts in the category table started
constantly falling out of whack.

> Note that even if you're not storing it, the database is likely to be
> reserving the space for that.

The entire point of varchar/varbinary is that it uses a variable
amount of space. It will use one byte per row or such if it's empty.

> It can be useful to discern explicit sortkeys when the rules for
> language parsing change, though.

Yeah, I thought of that too. If we kept no raw copy in the database,
we'd have to reparse all pages to update the collation. Plus we could
use it for the initial letter. Any other uses that it has?

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


tstarling at wikimedia

Jul 22, 2010, 8:02 PM

Post #22 of 36 (3706 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

On 23/07/10 02:34, Aryeh Gregor wrote:
> On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling <tstarling [at] wikimedia> wrote:
>> This restriction is enforced by Title::isValidMoveOperation().
>
> Any objections to changing this so files can't be moved over non-files
> or vice versa?

No objection. That's mostly how it is already, except when the file
doesn't exist but the description page does.

>> Since we won't be sorting on the plain text form anymore, we could use
>> some tricks to save space. For instance, if the sort key is the same
>> as the article title, we could store NULL instead of another copy of
>> the article title. That should save 95% or so.
>
> It doesn't seem like it would save nearly that much. On the Welsh
> Wikipedia (small enough database to be manageable), I get the
> following:

Welsh is not really what I was thinking when I said get statistics. On
the English Wikipedia (db38):

*************************** 1. row ***************************
Name: categorylinks
Engine: InnoDB
Version: 10
Row_format: Compact
Rows: 38875439
Avg_row_length: 161
Data_length: 6271123456
Max_data_length: 0
Index_length: 7946960896
Data_free: 7340032
Auto_increment: NULL
Create_time: 2010-05-24 11:29:52
Update_time: NULL
Check_time: NULL
Collation: binary
Checksum: NULL
Create_options:
Comment:
1 row in set (0.15 sec)

SELECT
count(*),
sum(length(cl_sortkey)) as raw_length,
sum( if(REPLACE(cl_sortkey, ' ', '_') = page_title,
0, length(cl_sortkey) ) ) as compact_length
FROM categorylinks,page
WHERE
cl_from=page_id and
page_namespace=0 and
page_id % 10 = 0

*************************** 1. row ***************************
count(*): 1957629
raw_length: 34177525
compact_length: 14857665
1 row in set (19 min 26.05 sec)

So we're looking at 17 bytes per row for raw text, and 8 bytes per row
for compacted text, plus 1 byte per row for the length byte. Overall,
assuming the lengths are the same across all namespaces, it would be
approximately 680 MB in the raw form for the English Wikipedia, and
presumably several times that for all wikis. Our English Wikipedia
core DB servers have between 700 GB and 2 TB of storage space, with
~450 GB currently in use. So the impact of adding an extra 1 GB or so
would be minimal.

No doubt Domas will complain anyway, but without developers adding new
features, I figure his volunteer DBA work would get very boring.

> It's still not at all clear to me that saving a raw copy in the
> database is worth it. If we really need sectioning by first letter on
> category pages, we could save the first letter instead, and leave that
> NULL when it's the same as the first letter of the page title (all of
> this for some locale-specific definition of "first letter"). But I
> don't know if we need that.

Truncating after the first letter would only save about 260MB for the
entire English Wikipedia. And it would limit the applications. For
instance, it would prevent fast updates of the collation algorithm.
Instead we would have to reparse the pages. That could take weeks,
even with a dozen servers dedicated to the task.

> This whole problems arises for sortkey changes generally. It will be
> just as much of a problem when going to a new sortkey type (based on
> CLDR or whatever). The only way to avoid it is to create a new
> column, populate it while maintaining both columns at once, start
> using the new column once it's fully populated, and then drop the old
> column. That seems excessive.

If we're going to have multiple locale-specific collation algorithms
(and that seems likely), then it may make sense to add a collation ID
foreign key to the categorylinks table, to track updates. Sensible
sorting behaviour mid-way through an update is probably not feasible,
but we can at least make it possible to track the problem.

> On Thu, Jul 22, 2010 at 5:34 AM, David Gerard <dgerard [at] gmail> wrote:
>> Please don't remove the feature where the first letter of the sort key
>> is displayed in the rendered category page, and if necessary add what
>> it takes to keep it.
>>
>> There are scripts where this will be a hard problem, but it's still
>> much-used and much-loved in those where it isn't.
>
> Is it? What use does it serve? We don't have it for any other type
> of list. We have zillions of types of page lists, and category pages
> are the only ones that have the first letter displayed. It makes the
> columns uneven, and is completely crazy for some scripts (like CJK,
> AFAICT).

We have zillions of lists, but category pages are by far the most
visible and heavily-used, that's why so much work has been done on
making them look nice, and why so many people are complaining about
category sorting instead of [[Special:DeadendPages]] sorting.

The CJK issue could be fixed by making the feature optional. The
uneven column issue is fixable using the multi-column layout feature
in CSS 3 and more recent versions of the major browsers.

-- Tim Starling


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


krinklemail at gmail

Jul 22, 2010, 9:40 PM

Post #23 of 36 (3708 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

After reading the entire thread I've got several responses:

On Wed Jul 21 22:27:44 UTC 2010, Ariel T. Glenn ariel at wikimedia.org wrote:
> It's much worse than that. On the Wiktionary projects, which have the
> modest goal of documenting all words in all languages *on each language
> edition*, you can expect to find german, french, arabic, etc. lemmas on
> say the russian project. Even worse, you can expect to find multiple
> lemmas on one page if the lemma has the same orthography in more than
> one language. If the sort order differs between languages it's loads of
> fun.
>
> Ariel

Something that comes to mind is Wikimedia Commons. Would the same category with
a different
uselang be sorted differently ?

So far in this thread I've read that the current idea is to have different
sorting depending on language.
The language would be set to a certain default per wiki. And it would be
possible to overwrite that per
page or per category with a certain parser function (I think that would mean a
certain piece of Technical
metadata in a "Variable" type of Magic word (as defined here:
http://www.mediawiki.org/wiki/Help:Magic_words#Variables ).

Perhaps an idea is to not set a default per wiki, but a default per language.
That way it doesn't have to
be set for every wiki that is in the same language. And is easily overridable
for multilingual wikis by the
uselang-parameter and/or the user's preference. Ofcourse the magic word would
override that.

Though I don't know much about how a store-order is stored. It may be possible
to maintain it in an
interface message. That way a Germanic category on Commons or Wiktionary could
use something like:

{{SORTORDER:{{MediaWiki:Categorysortorder/de}}}}

On Wed Jul 21 22:18:02 UTC 2010, Daniel Kinzler daniel at brightbyte.de wrote:
> Aryeh Gregor schrieb:
> > * "Categories need to be structured by namespace",
> > https://bugzilla.wikimedia.org/show_bug.cgi?id=450
> > * "Natural number sorting in category listings",
> > https://bugzilla.wikimedia.org/show_bug.cgi?id=6948
>
> While we definitly need efficient retrieval by namespace, the default sort
key
> should *not* include the namespace prefix. it's very annoying that all files
get
> sorted under "F" currently, or that pages from the Wikipedia namespace all
end
> up under "W".
>
> -- daniel
A good example of this are Templates.
The vast majority of templates (as far as I can see) are sorted like this, with
PAGENAME:
[[Category:Special templates|{{PAGENAME}}]]

Setting to false by default may be good. But that would case all
types of things to go mixed up. (except ofcourse for Files, Subcategories those
would be seperated
from 'Pages'). But templates, articles and users would be sorted by pagename.
I'm not sure if that behaviour is good. I dont think it's that bad actually, if
at all.

Being able to set per namespace may be a solution. Not sure yet.

On Fri Jul 23 03:02:49 UTC 2010, Tim Starling tstarling at wikimedia.org
wrote:
> On 23/07/10 02:34, Aryeh Gregor wrote:
> > On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling <tstarling at wikimedia.org>
wrote:
> >> This restriction is enforced by Title::isValidMoveOperation().
> >
> > Any objections to changing this so files can't be moved over non-files
> > or vice versa?
>
> No objection. That's mostly how it is already, except when the file
> doesn't exist but the description page does.

No objection either. However, make sure that the other way around is blocked
too. Else one might
accidently move a page to the File-namespace without being able to get it
back.

-- Krinkle


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


roan.kattouw at gmail

Jul 22, 2010, 10:55 PM

Post #24 of 36 (3706 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

2010/7/23 Krinkle <krinklemail [at] gmail>
> Something that comes to mind is Wikimedia Commons. Would the same category with
> a different
> uselang be sorted differently ?
>
No. That would require storing a separate ordering (set of munged
sortkeys) in the database for each of the ~200 languages we support,
which would take way too much space. For this reason, every category
has exactly one ordering. It's fine if that ordering doesn't match the
content language of the wiki, or if a page is sorted differently in
different categories because they have different orderings, or if a
category's sorting order is changeable, as long as each category is
only sorted in one order at any one time.

> No objection either. However, make sure that the other way around is blocked
> too. Else one might
> accidently move a page to the File-namespace without being able to get it
> back.
>
It's also needed to ensure consistency: the goal is to ensure that if
a page is of type X (category, file or page), it cannot ever change
its type. For this reason, we need to disallow moves to and from the
File and Category namespaces.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


midom.lists at gmail

Jul 23, 2010, 12:04 AM

Post #25 of 36 (3709 views)
Permalink
Re: Architectural revisions to improve category sorting [In reply to]

Hi!

> No doubt Domas will complain anyway, but without developers adding new
> features, I figure his volunteer DBA work would get very boring.

I don't complain about well designed features, especially if they don't scan millions of rows to return 0 ;-)

Domas
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

First page Previous page 1 2 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.