Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

Regular expressions searching

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


amir.aharoni at gmail

Jul 6, 2009, 4:05 AM

Post #1 of 9 (669 views)
Permalink
Regular expressions searching

(Originally asked at [[Wikipedia talk:Searching]] and [[WP:VPT]].)

Is there any existing way to search Wikipedia or MediaWiki in general
using full-fledged regular expressions? Such as those found in Perl,
PCRE, Python, JavaScript?

I started writing a Perl program that uses Parse::MediaWikiDump, goes
over a dump and searches for regexes, but there are two problems with
this:

1. Such a program probably already exists, although i don't know
where. Can anyone point me to an existing tool? It can be in any other
language, not necessarily Perl, but it should be portable - not
Windows-only/Mac-only/Linux-only.

2. The info won't be up-to-date. Would it be too much to ask to search
the database directly using regexes?

If problem number 2 is too hard to solve and nobody knows the answer
to problem number one, then i guess that i'll publish my Perl dump
searching program for the common good. (Why not Python? Because i know
Perl better and Parse::MediaWikiDump works well enough for me.)

--
Amir Elisha Aharoni

http://aharoni.wordpress.com

"We're living in pieces,
I want to live in peace." - T. Moore

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


agarrett at wikimedia

Jul 6, 2009, 4:43 AM

Post #2 of 9 (645 views)
Permalink
Re: Regular expressions searching [In reply to]

On 06/07/2009, at 12:05 PM, Amir E. Aharoni wrote:

> 2. The info won't be up-to-date. Would it be too much to ask to search
> the database directly using regexes?

Yes.

We wouldn't allow direct searching from the web interface with regexes
for two related reasons:

1/ A single search with the most basic of regexes would take several
minutes, if not hours. It isn't computationally trivial to search for
a small string in a very large string of over 10 GB, let alone a
regex. Words can be indexed, regexes cannot.

2/ Even if we could find a way to make the former performant, a
malicious regex could significantly expand this time taken, leading to
a denial of service.

--
Andrew Garrett
Contract Developer, Wikimedia Foundation
agarrett [at] wikimedia
http://werdn.us




_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


amir.aharoni at gmail

Jul 6, 2009, 4:50 AM

Post #3 of 9 (645 views)
Permalink
Re: Regular expressions searching [In reply to]

On Mon, Jul 6, 2009 at 14:43, Andrew Garrett <agarrett [at] wikimedia> wrote:
>
> On 06/07/2009, at 12:05 PM, Amir E. Aharoni wrote:
>
> > 2. The info won't be up-to-date. Would it be too much to ask to search
> > the database directly using regexes?
>
> Yes.
>
> We wouldn't allow direct searching from the web interface with regexes
> for two related reasons:
>
> 1/ A single search with the most basic of regexes would take several
> minutes, if not hours. [...]
>
> 2/ Even if we could find a way to make the former performant, a
> malicious regex could significantly expand this time taken, leading to
> a denial of service.

I imagined that that would be the answer. Maybe it is possible to set
up a service that searches some back-up database, which is almost
up-to-date, or to have someone approve the regexes to prevent DOS.

Just throwing a bunch of ideas to the air.

Mostly i was interested to find out whether there is an existing
script that searches dumps, instead of writing one myself.

--
אמיר אלישע אהרוני
Amir Elisha Aharoni

http://aharoni.wordpress.com

"We're living in pieces,
I want to live in peace." - T. Moore

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Simetrical+wikilist at gmail

Jul 6, 2009, 11:37 AM

Post #4 of 9 (629 views)
Permalink
Re: Regular expressions searching [In reply to]

On Mon, Jul 6, 2009 at 7:43 AM, Andrew Garrett<agarrett [at] wikimedia> wrote:
> Yes.
>
> We wouldn't allow direct searching from the web interface with regexes
> for two related reasons:
>
> 1/ A single search with the most basic of regexes would take several
> minutes, if not hours. It isn't computationally trivial to search for
> a small string in a very large string of over 10 GB, let alone a
> regex. Words can be indexed, regexes cannot.
>
> 2/ Even if we could find a way to make the former performant, a
> malicious regex could significantly expand this time taken, leading to
> a denial of service.

I seem to recall Gregory Maxwell describing a setup that made this
feasible, given the appropriate amount of dedicated hardware. It was
run with the entire database in memory; it only permitted "real"
regular expressions (compilable to finite-state machines, no
backreferences etc.); and it limited the length of the finite-state
machine generated. Running a regex took several minutes, but he'd run
a lot of them in parallel, since it was mostly memory-bound, so he got
fairly good throughout. Something like that.

But probably not practical without an undue amount of effort and
hardware, yeah. :)

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


mickewiki at gmail

Jul 6, 2009, 11:37 AM

Post #5 of 9 (630 views)
Permalink
Re: Regular expressions searching [In reply to]

Amir E. Aharoni <amir.aharoni <at> gmail.com> writes:

>
> Mostly i was interested to find out whether there is an existing
> script that searches dumps, instead of writing one myself.
>

I wrote a c++-program that can search through a databasedump (with regexes),
it's one of the first apps I wrote, so the source is not pretty to read. You can
find it here:

http://meta.wikimedia.org/wiki/User:Micke/WikiFind

I would not implement it that way (and I would not write it in c++ :), if I was
to write it to day, but if you are helped by it please go ahead and use it.

Use the second version of the program if you want to pipe the dump in from bzcat.

/Micke


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


gmaxwell at gmail

Jul 6, 2009, 12:05 PM

Post #6 of 9 (640 views)
Permalink
Re: Regular expressions searching [In reply to]

On Mon, Jul 6, 2009 at 2:37 PM, Aryeh
Gregor<Simetrical+wikilist [at] gmail> wrote:
> On Mon, Jul 6, 2009 at 7:43 AM, Andrew Garrett<agarrett [at] wikimedia> wrote:
>> Yes.
>>
>> We wouldn't allow direct searching from the web interface with regexes
>> for two related reasons:
>>
>> 1/ A single search with the most basic of regexes would take several
>> minutes, if not hours. It isn't computationally trivial to search for
>> a small string in a very large string of over 10 GB, let alone a
>> regex. Words can be indexed, regexes cannot.
>>
>> 2/ Even if we could find a way to make the former performant, a
>> malicious regex could significantly expand this time taken, leading to
>> a denial of service.
>
> I seem to recall Gregory Maxwell describing a setup that made this
> feasible, given the appropriate amount of dedicated hardware.  It was
> run with the entire database in memory; it only permitted "real"
> regular expressions (compilable to finite-state machines, no
> backreferences etc.); and it limited the length of the finite-state
> machine generated.  Running a regex took several minutes, but he'd run
> a lot of them in parallel, since it was mostly memory-bound, so he got
> fairly good throughout.  Something like that.
>
> But probably not practical without an undue amount of effort and
> hardware, yeah.  :)

Yes, I didn't comment on the initial comment because "full PCRE" is
simply far too much to ask for.

Basic regexps of the sort that can be complied into a deterministic
finite state machine (i.e. no backtracking) can be merged together
into a single larger state machine.

So long as the state machine fits in cache, the entire DB can be
scanned in not much more time than it takes to read it in from memory,
even if there are hundreds of parallel regexpes.

So you batch up user requests then run them in parallel groups. Good
throughput, poor latency.

Insufficiently selective queries are problematic. I never came up with
a really good solution to people feeding in patterns like '.' and
stalling the whole process by wasting a lot of memory bandwidth
updating the result set. (an obvious solution might just be to limit
the number of results)

The latency can be reduced by partitioning the database across
multiple machines (more aggregate memory bandwidth). By doing this
you could achieve arbitrarily low latency and enormous throughput.
Dunno if it's actually worthwhile, however.

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


stevagewp at gmail

Jul 6, 2009, 6:49 PM

Post #7 of 9 (632 views)
Permalink
Re: Regular expressions searching [In reply to]

On Mon, Jul 6, 2009 at 9:05 PM, Amir E. Aharoni<amir.aharoni [at] gmail> wrote:
> 2. The info won't be up-to-date. Would it be too much to ask to search
> the database directly using regexes?

What's your use case? Obviously all the points below are valid and
rule out directly regex searching on the entire Wikipedia database,
for instance, but I wonder if you could have hybrid cases like "return
pages that contain X and regex Y". Since X can be indexed, you're
immediately working on a (much) smaller subset.

(Then again, AWB already supports this I think)

Not sure what cases you need full regex for though. Obviously regexes
like "foo(.*)bar" have wide applicability...but the more esoteric
forms?

Steve

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


amir.aharoni at gmail

Jul 6, 2009, 11:10 PM

Post #8 of 9 (636 views)
Permalink
Re: Regular expressions searching [In reply to]

On Tue, Jul 7, 2009 at 04:49, Steve Bennett<stevagewp [at] gmail> wrote:
> On Mon, Jul 6, 2009 at 9:05 PM, Amir E. Aharoni<amir.aharoni [at] gmail> wrote:
>> 2. The info won't be up-to-date. Would it be too much to ask to search
>> the database directly using regexes?
>
> What's your use case? Obviously all the points below are valid and
> rule out directly regex searching on the entire Wikipedia database,
> for instance, but I wonder if you could have hybrid cases like "return
> pages that contain X and regex Y". Since X can be indexed, you're
> immediately working on a (much) smaller subset.

It is not really that important for me to search the live Wikipedia.

Currently i mostly want to satisfy my linguistic curiosity and find
out statistics about usage of different spellings in the Hebrew
Wikipedia (modern [[Hebrew spelling]] is wildly inconsistent). The
regular search engine, which is mostly tailored for English, is almost
useless for this task. But searching a dump would be enough.

(AWB is ruled out, because i frequently need to run it on GNU/Linux.)

--
אמיר אלישע אהרוני
Amir Elisha Aharoni

http://aharoni.wordpress.com

"We're living in pieces,
I want to live in peace." - T. Moore

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


wiki at marcusbuck

Jul 8, 2009, 5:33 AM

Post #9 of 9 (606 views)
Permalink
Re: Regular expressions searching [In reply to]

Micke Nordin hett schreven:
> Amir E. Aharoni <amir.aharoni <at> gmail.com> writes:
>
>
>> Mostly i was interested to find out whether there is an existing
>> script that searches dumps, instead of writing one myself.
>>
>>
>
> I wrote a c++-program that can search through a databasedump (with regexes),
> it's one of the first apps I wrote, so the source is not pretty to read. You can
> find it here:
>
> http://meta.wikimedia.org/wiki/User:Micke/WikiFind
>
> I would not implement it that way (and I would not write it in c++ :), if I was
> to write it to day, but if you are helped by it please go ahead and use it.
>
> Use the second version of the program if you want to pipe the dump in from bzcat.
>
> /Micke
>
I tried to install it, but I couldn't manage. If somebody has managed to
compile a Vista executable I would appreciate, if he could send me the file.

Thanks
Marcus Buck
User:Slomox

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.