Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

So... status of category intersections?

 

 

First page Previous page 1 2 3 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


dan_the_man at telus

Apr 27, 2008, 5:33 PM

Post #51 of 59 (639 views)
Permalink
Re: So... status of category intersections? [In reply to]

What would you think, if I created a runMaintenance.php script into the
/maintenance folder which could be used to call maintenance scripts:
I'll write this from my perspective, I run maintenance scripts inside
the path that a wiki is installed into because that directory is the
real location of the wiki, the actual maintenance and such directories
are all symlinks to a central location:
php ./maintenance/runMaintenance.php --root=$PWD
./maintenance/scriptname.php args...
And similarly for extensions:
php ./maintenance/runMaintenance.php --root=$PWD
./extension/ExtName/maintenance/scriptname.php args...
The point is:
* The script checks for the cli sapi type, aborting if it fails.
* The script defines MEDIAWIKI_CLI.
* The script sets $IP to either a --root= param you give it (must be
specified before maintenance script name), or to a default "realpath(
__FILE__ . '/.. );" which is compatible with default installations while
allowing non-defaults to work.
* The script considers the first non-param argument you give it to be a
maintenance script name. And then strips out the script name and all
arguments before it (considered to be arguments to the runscript).
* The script then includes the maintenance script to be run into itself.
* The maintenance script itself uses an $IP path only if MEDIAWIKI_CLI
is defined. Otherwise, for backwards compatibility but more security it
uses "require_once( dirname(__FILE__) . "/commandLine.inc" );".
Extension scripts would use the ../.. trick they normally use.

So basically the form goes:
php [path to maintenance]runMaintenance.php [args to runscript]
<maintenance script name> [args to maintenance script]
And it maintains a fair bit of security while still being lenient on
those with non-standard paths who don't want to duplicate things everywhere.
You can even run extension maintenance scripts from their extensions
folder, or symlink them to the maintenance folder and they'll still work.

~Daniel Friesen(Dantman) of:
-The Gaiapedia (http://gaia.wikia.com)
-Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG)
-and Wiki-Tools.com (http://wiki-tools.com)

Mark Clements wrote:
> "Simetrical" <Simetrical+wikilist [at] gmail>
> wrote in message
> news:7c2a12e20804221716xaa6f5cflf216ff28c6324015 [at] mail
>
>> On Tue, Apr 22, 2008 at 5:59 PM, Mark Clements wrote:
>>
>>> And, of course, it doesn't help when that's not the case, which is the
>>> situation for us. For technical reasons, all extensions are outside
>>> the MW source folder entirely.
>>>
>> Symlinks work perfectly in that case (as is true for my localhost, for
>> instance, since it's running a checked-out version of
>> mediawiki/trunk/). I agree it's not great practice, though: maybe you
>> could try to use the current working directory? That seems even less
>> reliable.
>>
>
> I imagine an 'updateExtension' script in the 'maintenance' folder that
> include()s the appropriate command line/site settings/etc. files then looks
> for a script with the appropriate name (based on the extension name which is
> supplied as first arg on command line - 'ExtName' in this example) in the
> following places.
>
> */extensions/ExtName/maintenance.php
> */ExtName/maintenance.php
>
> Where * means anywhere in the include path. If the file exists, run it with
> the remaining arguments passed through, for which there should be a
> standardised subset that most extensions use (e.g. 'install' and 'upgrade')
> though extension-specific items are allowed. If no arg (or an unexpected
> arg) is provided then the extension file is expected to print out the
> details about available items (i.e. equivalent to 'help').
>
> - Mark Clements (HappyDog)
>
>
>
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


dan_the_man at telus

May 1, 2008, 5:40 PM

Post #52 of 59 (638 views)
Permalink
Re: So... status of category intersections? [In reply to]

Ok, TimStarling's sollution is to use an environment variable and check
for it using getenv.

So for example, this is what you would do to a extension script to make
it work right:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/CheckUser/install.php?r1=34107&r2=34106&pathrev=34107

I recommend that the form without the environment variable is the one
currently in the maintenance script you modify when you make this kind
of update. It'll be best for backwards compatibility.

~Daniel Friesen(Dantman) of:
-The Gaiapedia (http://gaia.wikia.com)
-Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG)
-and Wiki-Tools.com (http://wiki-tools.com)

DanTMan wrote:
> What would you think, if I created a runMaintenance.php script into the
> /maintenance folder which could be used to call maintenance scripts:
> I'll write this from my perspective, I run maintenance scripts inside
> the path that a wiki is installed into because that directory is the
> real location of the wiki, the actual maintenance and such directories
> are all symlinks to a central location:
> php ./maintenance/runMaintenance.php --root=$PWD
> ./maintenance/scriptname.php args...
> And similarly for extensions:
> php ./maintenance/runMaintenance.php --root=$PWD
> ./extension/ExtName/maintenance/scriptname.php args...
> The point is:
> * The script checks for the cli sapi type, aborting if it fails.
> * The script defines MEDIAWIKI_CLI.
> * The script sets $IP to either a --root= param you give it (must be
> specified before maintenance script name), or to a default "realpath(
> __FILE__ . '/.. );" which is compatible with default installations while
> allowing non-defaults to work.
> * The script considers the first non-param argument you give it to be a
> maintenance script name. And then strips out the script name and all
> arguments before it (considered to be arguments to the runscript).
> * The script then includes the maintenance script to be run into itself.
> * The maintenance script itself uses an $IP path only if MEDIAWIKI_CLI
> is defined. Otherwise, for backwards compatibility but more security it
> uses "require_once( dirname(__FILE__) . "/commandLine.inc" );".
> Extension scripts would use the ../.. trick they normally use.
>
> So basically the form goes:
> php [path to maintenance]runMaintenance.php [args to runscript]
> <maintenance script name> [args to maintenance script]
> And it maintains a fair bit of security while still being lenient on
> those with non-standard paths who don't want to duplicate things everywhere.
> You can even run extension maintenance scripts from their extensions
> folder, or symlink them to the maintenance folder and they'll still work.
>
> ~Daniel Friesen(Dantman) of:
> -The Gaiapedia (http://gaia.wikia.com)
> -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG)
> -and Wiki-Tools.com (http://wiki-tools.com)
>
> Mark Clements wrote:
>
>> "Simetrical" <Simetrical+wikilist [at] gmail>
>> wrote in message
>> news:7c2a12e20804221716xaa6f5cflf216ff28c6324015 [at] mail
>>
>>
>>> On Tue, Apr 22, 2008 at 5:59 PM, Mark Clements wrote:
>>>
>>>
>>>> And, of course, it doesn't help when that's not the case, which is the
>>>> situation for us. For technical reasons, all extensions are outside
>>>> the MW source folder entirely.
>>>>
>>>>
>>> Symlinks work perfectly in that case (as is true for my localhost, for
>>> instance, since it's running a checked-out version of
>>> mediawiki/trunk/). I agree it's not great practice, though: maybe you
>>> could try to use the current working directory? That seems even less
>>> reliable.
>>>
>>>
>> I imagine an 'updateExtension' script in the 'maintenance' folder that
>> include()s the appropriate command line/site settings/etc. files then looks
>> for a script with the appropriate name (based on the extension name which is
>> supplied as first arg on command line - 'ExtName' in this example) in the
>> following places.
>>
>> */extensions/ExtName/maintenance.php
>> */ExtName/maintenance.php
>>
>> Where * means anywhere in the include path. If the file exists, run it with
>> the remaining arguments passed through, for which there should be a
>> standardised subset that most extensions use (e.g. 'install' and 'upgrade')
>> though extension-specific items are allowed. If no arg (or an unexpected
>> arg) is provided then the extension file is expected to print out the
>> details about available items (i.e. equivalent to 'help').
>>
>> - Mark Clements (HappyDog)
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l [at] lists
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>>
>>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


roan.kattouw at home

May 22, 2008, 2:53 PM

Post #53 of 59 (563 views)
Permalink
Re: So... status of category intersections? [In reply to]

Robert Stojnic schreef:
> Let me briefly repeat what I said earlier about my experience with this
> category
> intersection thingy. Adding categories to lucene index is easy *IF* they
> are inside
> the article, e.g. try this:
>
> http://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=%2Bincategory%3A%22Living+people%22+%2Bincategory%3A%22English+comedy+writers%22&ns0=1&fulltext=Search
>
> This will give you category intersection of "Living People" and "English
> comedy writers"
> in fraction of the second.
>
That's the dirty way. I've gone ahead and written an alternative way of
implementing category intersections using a fulltext search, which means
you can run the most crazy intersections; in fact, you can search in an
article's categories as if they were the page's contents. It's part of
the AdvancedSearch extension which I'm paid to write, but it'll be easy
to split off just the intersection functionality into another extension.
The upside is that I also have a special page front end ready to go.
I'll commit AdvancedSearch into SVN once I've worked out the bugs
(provided there are any; it's close to midnight now so I don't really
feel like testing stuff any more) and worked out stuff with my
'employer', which shouldn't take more than a few days.

On a technical level, the extension adds the categorysearch table (you
need to run update.php to actually create the table), which is basically
a rip-off from the searchindex table. It has a cs_page field referencing
page_id, and keeps itself updated using the LinksUpdate and
ArticleDeleteComplete hooks. There's also a maintenance script to
populate the table from scratch.
> What I found that the hard part is keeping the index updated. If we want
> a fancy category
> intersection system discussed here before we need to have an index that
> is frequently updated,
> that will be integrated with the job queue, that will understand
> templates etc..
>
Understanding templates is no problem here, since the updater uses the
parser's notion of which categories the page is in, and the populate
script uses the categorylinks table.
> Lucene is not that good with very frequent updates. The usual setting is
> to have an indexer,
> make snapshots of the index at regular intervals and then rsync it onto
> searchers. The whole
> process takes time, although for a category-only index it will probably
> be fast. I assume there
> would be at least few tens of minutes lag anyhow. Our current lucene
> framework could
> easily be used for index distribution and such.
>
I really don't have the faintest idea how Lucene works or how MediaWiki
interfaces with it, but I do know that Lucene can handle the stuff we
put into the searchindex table. Since the categorysearch table is no
different, I think Lucene *should* be able to handle it pretty easily as
well. Could someone who actually has a clue about all this reply?

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


aerik at thesylvans

May 22, 2008, 9:12 PM

Post #54 of 59 (565 views)
Permalink
Re: So... status of category intersections? [In reply to]

On Thu, May 22, 2008, Roan Kattouw <roan.kattouw [at] home> wrote:

>
> I've gone ahead and written an alternative way of
> implementing category intersections using a fulltext search, which means
> you can run the most crazy intersections; in fact, you can search in an
> article's categories as if they were the page's contents. It's part of
> the AdvancedSearch extension which I'm paid to write, but it'll be easy
> to split off just the intersection functionality into another extension.
> The upside is that I also have a special page front end ready to go.
> I'll commit AdvancedSearch into SVN once I've worked out the bugs
> (provided there are any; it's close to midnight now so I don't really
> feel like testing stuff any more) and worked out stuff with my
> 'employer', which shouldn't take more than a few days.


Wow, awesome! - you (and your employer) beat the heck out of all my good
intentions to acquaint myself with the current version of Mediawiki and
write code good enough for production! I can't wait to see it.

On a technical level, the extension adds the categorysearch table (you
> need to run update.php to actually create the table), which is basically
> a rip-off from the searchindex table. It has a cs_page field referencing
> page_id, and keeps itself updated using the LinksUpdate and
> ArticleDeleteComplete hooks. There's also a maintenance script to
> populate the table from scratch.
> > What I found that the hard part is keeping the index updated. If we want
> > a fancy category
> > intersection system discussed here before we need to have an index that
> > is frequently updated,
> > that will be integrated with the job queue, that will understand
> > templates etc..
> >
> Understanding templates is no problem here, since the updater uses the
> parser's notion of which categories the page is in, and the populate
> script uses the categorylinks table.


Perfect - yes exactly the way to go.

> Lucene is not that good with very frequent updates. The usual setting is
> > to have an indexer,
> > make snapshots of the index at regular intervals and then rsync it onto
> > searchers. The whole
> > process takes time, although for a category-only index it will probably
> > be fast. I assume there
> > would be at least few tens of minutes lag anyhow. Our current lucene
> > framework could
> > easily be used for index distribution and such.
> >
> I really don't have the faintest idea how Lucene works or how MediaWiki
> interfaces with it, but I do know that Lucene can handle the stuff we
> put into the searchindex table. Since the categorysearch table is no
> different, I think Lucene *should* be able to handle it pretty easily as
> well. Could someone who actually has a clue about all this reply?
>
>
Lucene doesn't allow edits, it only allows add and delete. Presumably too
many deletes make the index inefficient or something. But I think all that
is moot - once you've got the categories into their own table, it *should*
be simple to set up another index on the same type schedule/etc. as the base
search index, and point it to that table. Then, change the interface to
point to Lucene instead of MySQL. I'm not familiar with Wikipedia's Lucene
backend, but... It seems reasonable to assume that this is not a major
endeaver.

What's your UI for the intersections look like? That was the killer for me;
I'm a weak UI guy. I'd imagine (and implemented a rough prototype years
ago) that let you "browse" intersections - ie, given intersection a it would
show you the set of all categories B that have documents that have category
a. Ideally the most frequently used categories appear at the top :-) But I
never did any performance testing for this set up, and additionally, I'm not
sure how to do it in Lucene... Anyway, what's your interface like?

Best Regards,
Aerik

--
http://www.wikidweb.com - the Wiki Directory of the Web
http://tagthis.info - Hosted Tagging for your website!
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


roan.kattouw at home

May 23, 2008, 2:04 AM

Post #55 of 59 (551 views)
Permalink
Re: So... status of category intersections? [In reply to]

Aerik Sylvan schreef:
> Wow, awesome! - you (and your employer) beat the heck out of all my good
> intentions to acquaint myself with the current version of Mediawiki and
> write code good enough for production! I can't wait to see it.
>
I guess money helps. As does having more free time to develop it.
> Lucene doesn't allow edits, it only allows add and delete. Presumably too
> many deletes make the index inefficient or something. But I think all that
> is moot - once you've got the categories into their own table, it *should*
> be simple to set up another index on the same type schedule/etc. as the base
> search index, and point it to that table. Then, change the interface to
> point to Lucene instead of MySQL. I'm not familiar with Wikipedia's Lucene
> backend, but... It seems reasonable to assume that this is not a major
> endeaver.
>
That's kind of what I was saying: if we have some kind of searchindex
<--> Lucene interface, a categorysearch <--> Lucene interface should be
easy.
> What's your UI for the intersections look like? That was the killer for me;
> I'm a weak UI guy. I'd imagine (and implemented a rough prototype years
> ago) that let you "browse" intersections - ie, given intersection a it would
> show you the set of all categories B that have documents that have category
> a. Ideally the most frequently used categories appear at the top :-) But I
> never did any performance testing for this set up, and additionally, I'm not
> sure how to do it in Lucene... Anyway, what's your interface like?
I'm also not much of a UI guy, but the UI for this extension was mostly
imposed on me by my 'employer', and after some discussion we settled on
a format where the category intersection part (it does more) is
basically a text box where you can enter "Living people AND American
people OR Presidents of the United States". AND takes precedence over
OR, so the example would get all living Americans plus all deceased
ex-Presidents. Expressions with parentheses like "Living people AND
(American people OR Canadian people)" aren't supported yet, but can be
emulated with "Living people AND American people OR Living people AND
Canadian people" (more complex expressions will probably be impossible
to emulate that way, and of course the extension should really support
parentheses, I'm working on that).

Anyway, you'll be able to play around with it around the beginning of
next week, probably.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


dgerard at gmail

May 23, 2008, 2:17 AM

Post #56 of 59 (567 views)
Permalink
Re: So... status of category intersections? [In reply to]

2008/5/23 Roan Kattouw <roan.kattouw [at] home>:

> I'm also not much of a UI guy, but the UI for this extension was mostly
> imposed on me by my 'employer', and after some discussion we settled on
> a format where the category intersection part (it does more) is
> basically a text box where you can enter "Living people AND American
> people OR Presidents of the United States". AND takes precedence over
> OR, so the example would get all living Americans plus all deceased
> ex-Presidents. Expressions with parentheses like "Living people AND
> (American people OR Canadian people)" aren't supported yet, but can be
> emulated with "Living people AND American people OR Living people AND
> Canadian people" (more complex expressions will probably be impossible
> to emulate that way, and of course the extension should really support
> parentheses, I'm working on that).
> Anyway, you'll be able to play around with it around the beginning of
> next week, probably.


\o/ \o/ \o/


- d.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

May 23, 2008, 6:27 AM

Post #57 of 59 (547 views)
Permalink
Re: So... status of category intersections? [In reply to]

On Fri, May 23, 2008 at 5:04 AM, Roan Kattouw <roan.kattouw [at] home> wrote:
> Expressions with parentheses like "Living people AND
> (American people OR Canadian people)" aren't supported yet, but can be
> emulated with "Living people AND American people OR Living people AND
> Canadian people" (more complex expressions will probably be impossible
> to emulate that way, and of course the extension should really support
> parentheses, I'm working on that).

Any expression involving only AND and OR should be possible to express
without parentheses, if AND binds more tightly than OR or vice versa.

http://en.wikipedia.org/wiki/Canonical_form_(Boolean_algebra)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


dgerard at gmail

May 23, 2008, 6:31 AM

Post #58 of 59 (552 views)
Permalink
Re: So... status of category intersections? [In reply to]

2008/5/23 Simetrical <Simetrical+wikilist [at] gmail>:
> On Fri, May 23, 2008 at 5:04 AM, Roan Kattouw <roan.kattouw [at] home> wrote:

>> Expressions with parentheses like "Living people AND
>> (American people OR Canadian people)" aren't supported yet, but can be
>> emulated with "Living people AND American people OR Living people AND
>> Canadian people" (more complex expressions will probably be impossible
>> to emulate that way, and of course the extension should really support
>> parentheses, I'm working on that).

> Any expression involving only AND and OR should be possible to express
> without parentheses, if AND binds more tightly than OR or vice versa.
> http://en.wikipedia.org/wiki/Canonical_form_(Boolean_algebra)


Yes, but you're a geek, and casual users are very unlikely to be ;-)


- d.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


roan.kattouw at home

May 23, 2008, 9:49 AM

Post #59 of 59 (545 views)
Permalink
Re: So... status of category intersections? [In reply to]

Simetrical schreef:
>
> Any expression involving only AND and OR should be possible to express
> without parentheses, if AND binds more tightly than OR or vice versa.
>
True. Whether it's a good think to let users write "a AND c OR a AND d
OR b AND c OR b AND d" rather than "(a OR b) AND (b OR d)" is another issue.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

First page Previous page 1 2 3 Next page Last page  View All Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.