Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Bug/change request for rebuildtextindex.php

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


alexp at exscien

Aug 22, 2008, 8:18 AM

Post #1 of 7 (747 views)
Permalink
Bug/change request for rebuildtextindex.php

Hi,

Every release of MW since 1.10 I've been making a tweak to the
rebuildtextindex.php to replace line 60 with the following code:

if($s->page_namespace != NS_MAIN)
{
global $wgContLang;
$title = $wgContLang->getNsText( $s->page_namespace ) . ':'
. $s->page_title;
}
else
{
$title = $s->page_title;
}
$u = new SearchUpdate( $s->page_id, $title, $revtext );

I have no idea how the best way to get this into the main code base is, but
I'm pretty sure as it stands its wrong in everyone's eyes - since the
namespace information is currently lost. Ideally SearchUpdate would be
refactored to take a namespace parameter, but this at least allows it to be
retrieved intact. I found this problem when adding an extension to index the
other namespaces. Or should I be using something else to rebuild the text
search?

Kind regards,

Alex

--
Alex Powell

Exscien Training Ltd
Tel: +44 (0) 1865 876562
Mob: +44 (0) 7717 765210

skype: alexp700
mailto:alexp [at] exscien
http://www.exscien.com

Registered in England and Wales 05927635, Unit 10 Wheatley Business Centre,
Old London Road, Wheatley, OX33 1XW, England
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


alexp at exscien

Aug 22, 2008, 8:22 AM

Post #2 of 7 (709 views)
Permalink
Re: Bug/change request for rebuildtextindex.php [In reply to]

Oh and while I'm at it is there any reason why line 161 of
\includes\SearchEngine.php cannot be updated to :

return "\\x22A-Za-z_'0-9\\x80-\\xFF\\-";

This allows for quoted searches to be passed into the MySQL query engine. Or
am I introducing a security hole?

Alex

On Fri, Aug 22, 2008 at 4:18 PM, Alex Powell <alexp [at] exscien> wrote:

> Hi,
>
> Every release of MW since 1.10 I've been making a tweak to the
> rebuildtextindex.php to replace line 60 with the following code:
>
> if($s->page_namespace != NS_MAIN)
> {
> global $wgContLang;
> $title = $wgContLang->getNsText( $s->page_namespace ) .
> ':' . $s->page_title;
> }
> else
> {
> $title = $s->page_title;
> }
> $u = new SearchUpdate( $s->page_id, $title, $revtext );
>
> I have no idea how the best way to get this into the main code base is, but
> I'm pretty sure as it stands its wrong in everyone's eyes - since the
> namespace information is currently lost. Ideally SearchUpdate would be
> refactored to take a namespace parameter, but this at least allows it to be
> retrieved intact. I found this problem when adding an extension to index the
> other namespaces. Or should I be using something else to rebuild the text
> search?
>
> Kind regards,
>
> Alex
>
> --
> Alex Powell
>
> Exscien Training Ltd
> Tel: +44 (0) 1865 876562
> Mob: +44 (0) 7717 765210
>
> skype: alexp700
> mailto:alexp [at] exscien
> http://www.exscien.com
>
> Registered in England and Wales 05927635, Unit 10 Wheatley Business Centre,
> Old London Road, Wheatley, OX33 1XW, England
>



--
Alex Powell

Exscien Training Ltd
Tel: +44 (0) 1865 876562
Mob: +44 (0) 7717 765210

skype: alexp700
mailto:alexp [at] exscien
http://www.exscien.com

Registered in England and Wales 05927635, Unit 10 Wheatley Business Centre,
Old London Road, Wheatley, OX33 1XW, England
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


brion at wikimedia

Aug 22, 2008, 9:29 AM

Post #3 of 7 (711 views)
Permalink
Re: Bug/change request for rebuildtextindex.php [In reply to]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Alex Powell wrote:
> Oh and while I'm at it is there any reason why line 161 of
> \includes\SearchEngine.php cannot be updated to :
>
> return "\\x22A-Za-z_'0-9\\x80-\\xFF\\-";
>
> This allows for quoted searches to be passed into the MySQL query engine. Or
> am I introducing a security hole?

Are you looking at a really old version? We've supported quoted searches
for a while.

- -- brion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiu6XoACgkQwRnhpk1wk452YQCgjvzlBhnremGXVI4xbXJkP1Aw
3nEAn33cADmdwCcgqLJnJfvFWAMaljfn
=mo1X
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


brion at wikimedia

Aug 22, 2008, 9:35 AM

Post #4 of 7 (720 views)
Permalink
Re: Bug/change request for rebuildtextindex.php [In reply to]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Alex Powell wrote:
> Every release of MW since 1.10 I've been making a tweak to the
> rebuildtextindex.php to replace line 60 with the following code:
[snip]
> I have no idea how the best way to get this into the main code base is, but
> I'm pretty sure as it stands its wrong in everyone's eyes - since the
> namespace information is currently lost. Ideally SearchUpdate would be
> refactored to take a namespace parameter, but this at least allows it to be
> retrieved intact.

Every call to SearchUpdate from Article and Title passes in the title
text portion without the namespace, and even if you included the
namespace on the title text, SearchUpdate itself discards it in its
constructor!

The namespace information is kept in the page table, in the
page_namespace field.


At some point we'll want to refactor SearchUpdate along with the search
backend classes to ensure everything does their index updates in a
cleaner way.

- -- brion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEUEARECAAYFAkiu6tgACgkQwRnhpk1wk46N3wCYm4NtQOOWfMiKPv1R1MGmTUo7
QgCfR+z9OXnElW/SmSzTz3ddwACt7gk=
=KqVv
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


alexp at exscien

Aug 23, 2008, 11:46 AM

Post #5 of 7 (693 views)
Permalink
Re: Bug/change request for rebuildtextindex.php [In reply to]

Sorry, yes I have extensions that create the index which seem to be getting
the full title text. It does look like it was an oversight.

I guess I'll just keep hacking the file!

Cheers,

Alex

On Fri, Aug 22, 2008 at 5:35 PM, Brion Vibber <brion [at] wikimedia> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Alex Powell wrote:
> > Every release of MW since 1.10 I've been making a tweak to the
> > rebuildtextindex.php to replace line 60 with the following code:
> [snip]
> > I have no idea how the best way to get this into the main code base is,
> but
> > I'm pretty sure as it stands its wrong in everyone's eyes - since the
> > namespace information is currently lost. Ideally SearchUpdate would be
> > refactored to take a namespace parameter, but this at least allows it to
> be
> > retrieved intact.
>
> Every call to SearchUpdate from Article and Title passes in the title
> text portion without the namespace, and even if you included the
> namespace on the title text, SearchUpdate itself discards it in its
> constructor!
>
> The namespace information is kept in the page table, in the
> page_namespace field.
>
>
> At some point we'll want to refactor SearchUpdate along with the search
> backend classes to ensure everything does their index updates in a
> cleaner way.
>
> - -- brion
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.8 (Darwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEUEARECAAYFAkiu6tgACgkQwRnhpk1wk46N3wCYm4NtQOOWfMiKPv1R1MGmTUo7
> QgCfR+z9OXnElW/SmSzTz3ddwACt7gk=
> =KqVv
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Alex Powell

Exscien Training Ltd
Tel: +44 (0) 1865 876562
Mob: +44 (0) 7717 765210

skype: alexp700
mailto:alexp [at] exscien
http://www.exscien.com

Registered in England and Wales 05927635, Unit 10 Wheatley Business Centre,
Old London Road, Wheatley, OX33 1XW, England
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


alexp at exscien

Aug 23, 2008, 12:01 PM

Post #6 of 7 (691 views)
Permalink
Re: Bug/change request for rebuildtextindex.php [In reply to]

Ah. I branched a version of the search functions at 1.11 and directed them
towards a centralized text store. THis was done by overriding the
Special::SearchPage. I noticed in my code that the legalSearchChars stuff
was filtering out " 's from the query, that meant a search:

"fish pie"

would include articles with the string "fish and pie" only in them and not
an exact match to "fish pie" only. By adding \x22 to the legal chars in the
core class, which in my 1.11 version was called regardless of any derived
classes overridden legalSearchChars(). Possibly this issue is fixed in 1.13,
but the line remains the same.

BTW the relevance metric for the MySQL search is also quite borked in the
1.13 codebase. It needs to be more like on line 191, SearchMySQL.php:

$match = $this->parseQuery( $filteredTerm, $fulltext );

$m2 = str_replace(" IN BOOLEAN MODE", "", $match);

return "SELECT page_id, page_namespace, page_title, {$m2} as relevance " .
"FROM masterwiki.$page, masterwiki.$searchindex " .
"WHERE pid=si_masterid AND $match";

That will give properly ranked results - got this from the MySQL man page on
free text search:

http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

First comment.

Hope that helps!

Kind regards,

Alex

On Fri, Aug 22, 2008 at 5:29 PM, Brion Vibber <brion [at] wikimedia> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Alex Powell wrote:
> > Oh and while I'm at it is there any reason why line 161 of
> > \includes\SearchEngine.php cannot be updated to :
> >
> > return "\\x22A-Za-z_'0-9\\x80-\\xFF\\-";
> >
> > This allows for quoted searches to be passed into the MySQL query engine.
> Or
> > am I introducing a security hole?
>
> Are you looking at a really old version? We've supported quoted searches
> for a while.
>
> - -- brion
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.8 (Darwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAkiu6XoACgkQwRnhpk1wk452YQCgjvzlBhnremGXVI4xbXJkP1Aw
> 3nEAn33cADmdwCcgqLJnJfvFWAMaljfn
> =mo1X
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



--
Alex Powell

Exscien Training Ltd
Tel: +44 (0) 1865 876562
Mob: +44 (0) 7717 765210

skype: alexp700
mailto:alexp [at] exscien
http://www.exscien.com

Registered in England and Wales 05927635, Unit 10 Wheatley Business Centre,
Old London Road, Wheatley, OX33 1XW, England
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


alexp at exscien

Aug 24, 2008, 8:54 AM

Post #7 of 7 (689 views)
Permalink
Re: Bug/change request for rebuildtextindex.php [In reply to]

Sorry, I meant second :-s

On Sat, Aug 23, 2008 at 8:01 PM, Alex Powell <alexp [at] exscien> wrote:

> Ah. I branched a version of the search functions at 1.11 and directed them
> towards a centralized text store. THis was done by overriding the
> Special::SearchPage. I noticed in my code that the legalSearchChars stuff
> was filtering out " 's from the query, that meant a search:
>
> "fish pie"
>
> would include articles with the string "fish and pie" only in them and not
> an exact match to "fish pie" only. By adding \x22 to the legal chars in the
> core class, which in my 1.11 version was called regardless of any derived
> classes overridden legalSearchChars(). Possibly this issue is fixed in 1.13,
> but the line remains the same.
>
> BTW the relevance metric for the MySQL search is also quite borked in the
> 1.13 codebase. It needs to be more like on line 191, SearchMySQL.php:
>
> $match = $this->parseQuery( $filteredTerm, $fulltext );
>
> $m2 = str_replace(" IN BOOLEAN MODE", "", $match);
>
> return "SELECT page_id, page_namespace, page_title, {$m2} as relevance " .
> "FROM masterwiki.$page, masterwiki.$searchindex " .
> "WHERE pid=si_masterid AND $match";
>
> That will give properly ranked results - got this from the MySQL man page
> on free text search:
>
> http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
>
> First comment.
>
> Hope that helps!
>
> Kind regards,
>
> Alex
>
>
> On Fri, Aug 22, 2008 at 5:29 PM, Brion Vibber <brion [at] wikimedia> wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Alex Powell wrote:
>> > Oh and while I'm at it is there any reason why line 161 of
>> > \includes\SearchEngine.php cannot be updated to :
>> >
>> > return "\\x22A-Za-z_'0-9\\x80-\\xFF\\-";
>> >
>> > This allows for quoted searches to be passed into the MySQL query
>> engine. Or
>> > am I introducing a security hole?
>>
>> Are you looking at a really old version? We've supported quoted searches
>> for a while.
>>
>> - -- brion
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.8 (Darwin)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>>
>> iEYEARECAAYFAkiu6XoACgkQwRnhpk1wk452YQCgjvzlBhnremGXVI4xbXJkP1Aw
>> 3nEAn33cADmdwCcgqLJnJfvFWAMaljfn
>> =mo1X
>> -----END PGP SIGNATURE-----
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l [at] lists
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>
>
>
> --
> Alex Powell
>
> Exscien Training Ltd
> Tel: +44 (0) 1865 876562
> Mob: +44 (0) 7717 765210
>
> skype: alexp700
> mailto:alexp [at] exscien
> http://www.exscien.com
>
> Registered in England and Wales 05927635, Unit 10 Wheatley Business Centre,
> Old London Road, Wheatley, OX33 1XW, England
>



--
Alex Powell

Exscien Training Ltd
Tel: +44 (0) 1865 876562
Mob: +44 (0) 7717 765210

skype: alexp700
mailto:alexp [at] exscien
http://www.exscien.com

Registered in England and Wales 05927635, Unit 10 Wheatley Business Centre,
Old London Road, Wheatley, OX33 1XW, England
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.