Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

flexbisonparse

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


cyril.buttay at free

Aug 27, 2006, 3:14 PM

Post #1 of 15 (826 views)
Permalink
flexbisonparse

Hi everyone,

I'm currently working on a (python) wiki to pdf converter, based on
wiki2pdf, which is no longer actively maintained.

One of the -big- modifications to this script is to replace the
all-python parser by a combination of wiki->xml, using flexbisonparse,
and xml-> pdf, using parts of wiki2pdf. This was proposed by one of the
developers of wiki2pdf to make it a bit... more maintainable.

At the moment, things work fine, except for 2 things with flexbisonparse:
1- the external urls are not handled, and are included verbatim in the
output.
2- some people use <ref name="XX"> </ref> and <ref name="XX"/> to make
references, but these are modified by flexbisonparse, into &gt;ref
name="XX"&lt;, which is not very convenient.

I understand that the first problem is somehow difficult to manage, as
many cases are to be analysed, but I have the impression that the second
is simpler. However, I'm not (at all) an expert in flex and bison, and
have the biggest difficulties understanding the code of flexbisonparse.

Is there a developer of this parser around? Do you think the
modifications are feasible?

Also, I proposed a patch for minor modifications of the code on
bugzilla, but I don't know if it is the right place to do that.
http://bugzilla.wikimedia.org/show_bug.cgi?id=7001

Regards

Cyril
>

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


jra at baylink

Aug 27, 2006, 3:29 PM

Post #2 of 15 (807 views)
Permalink
Re: flexbisonparse [In reply to]

On Sun, Aug 27, 2006 at 11:14:51PM +0100, Buttay cyril wrote:
> I'm currently working on a (python) wiki to pdf converter, based on
> wiki2pdf, which is no longer actively maintained.

My hobby horse is "Get The Glue Right", which leans me to ask:

would it not be {easier,more useful} to direct effort towards
wikitext2docbook? Doesn't docbook already know how to get to PDF?

/derail

Cheers,
-- jra
--
Jay R. Ashworth jra [at] baylink
Designer Baylink RFC 2100
Ashworth & Associates The Things I Think '87 e24
St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274

The Internet: We paved paradise, and put up a snarking lot.
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


cyril.buttay at free

Aug 27, 2006, 4:01 PM

Post #3 of 15 (813 views)
Permalink
Re: flexbisonparse [In reply to]

Jay R. Ashworth wrote:
> On Sun, Aug 27, 2006 at 11:14:51PM +0100, Buttay cyril wrote:
>
>> I'm currently working on a (python) wiki to pdf converter, based on
>> wiki2pdf, which is no longer actively maintained.
>>
>
> My hobby horse is "Get The Glue Right", which leans me to ask:
>
> would it not be {easier,more useful} to direct effort towards
> wikitext2docbook? Doesn't docbook already know how to get to PDF?
>
Well, I picked one of the existing project (wiki2pdf) for the following
reasons:
1- it uses python, which is one of the only languages I am nearly
comfortable with (I'm not a programmer)
2- the objective for me is to create a wikibook-to-Latex converter. I
plan to keep this piece of code client-side, because I know that even a
really good parser will need some tweaking of the LaTeX source to
produce a good pdf on something as long as a wikibook. Among the
features of the program is the automatic download of images and wiki
pages, something I can easily do with python
3- the list of alternative parsers (
http://meta.wikimedia.org/wiki/Alternative_parsers ) does not mention
wikitext2docbook, and says that flexbisonparse is "Intended as an
eventual replacement to the parsing code inside MediaWiki itself", which
is rather promising!

In an other hand, the tests I made with docbook were not very good from
a typographic point of view (I think the docbook to pdf conversion uses
LaTeX, but the stylesheets are oriented towards automation rather than
quality). I will have a look at wikitext2docbook, though.

Regards

Cyril
> /derail
>
> Cheers,
> -- jra
>

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


jra at baylink

Aug 27, 2006, 4:08 PM

Post #4 of 15 (847 views)
Permalink
Re: flexbisonparse [In reply to]

On Mon, Aug 28, 2006 at 12:01:00AM +0100, Buttay cyril wrote:
> Jay R. Ashworth wrote:
> > On Sun, Aug 27, 2006 at 11:14:51PM +0100, Buttay cyril wrote:
> >
> >> I'm currently working on a (python) wiki to pdf converter, based on
> >> wiki2pdf, which is no longer actively maintained.
> >>
> >
> > My hobby horse is "Get The Glue Right", which leans me to ask:
> >
> > would it not be {easier,more useful} to direct effort towards
> > wikitext2docbook? Doesn't docbook already know how to get to PDF?
>
> Well, I picked one of the existing project (wiki2pdf) for the following
> reasons:
> 1- it uses python, which is one of the only languages I am nearly
> comfortable with (I'm not a programmer)
> 2- the objective for me is to create a wikibook-to-Latex converter. I
> plan to keep this piece of code client-side, because I know that even a
> really good parser will need some tweaking of the LaTeX source to
> produce a good pdf on something as long as a wikibook. Among the
> features of the program is the automatic download of images and wiki
> pages, something I can easily do with python

Indeed, and I vote for client side, as well.

> 3- the list of alternative parsers (
> http://meta.wikimedia.org/wiki/Alternative_parsers ) does not mention
> wikitext2docbook, and says that flexbisonparse is "Intended as an
> eventual replacement to the parsing code inside MediaWiki itself", which
> is rather promising!

I don't know that that is what Magnus is calling it, but that's what it
does. I forget what language he's doing it in. Check the list
archives; he's mentioned it here in the last couple of months (and may
well chime in here).

> In an other hand, the tests I made with docbook were not very good from
> a typographic point of view (I think the docbook to pdf conversion uses
> LaTeX, but the stylesheets are oriented towards automation rather than
> quality). I will have a look at wikitext2docbook, though.

Worth a couple minutes, at least, I would think. My perception of it
is that if someone's going to put work into fixing the typography, the
more people who can benefit from that, the better. Hence, I kibitz.

Cheers,
-- jra
--
Jay R. Ashworth jra [at] baylink
Designer Baylink RFC 2100
Ashworth & Associates The Things I Think '87 e24
St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274

The Internet: We paved paradise, and put up a snarking lot.
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


eastor1 at swarthmore

Aug 27, 2006, 7:15 PM

Post #5 of 15 (809 views)
Permalink
Re: flexbisonparse [In reply to]

Jay R. Ashworth wrote:
> On Mon, Aug 28, 2006 at 12:01:00AM +0100, Buttay cyril wrote:
> > 3- the list of alternative parsers (
> > http://meta.wikimedia.org/wiki/Alternative_parsers ) does
> not mention
> > wikitext2docbook, and says that flexbisonparse is "Intended as an
> > eventual replacement to the parsing code inside MediaWiki itself",
> > which is rather promising!
>
> I don't know that that is what Magnus is calling it, but
> that's what it does. I forget what language he's doing it
> in. Check the list archives; he's mentioned it here in the
> last couple of months (and may well chime in here).

As someone who's been playing with alternative parsers (though not Magnus),
I'm pretty sure the flexbisonparse project is currently dead. Magnus moved
to his wiki2xml project (also available in the MediaWiki repository), which
is actually coded in PHP. As far as I know, though, it's the single most
feature-complete alternative parser we have. Not claiming it's perfect, but
it's good... I haven't worked with flexbisonparse, though, so maybe it's
better than I know.

I've actually been working on a Python-based wikitext parser, using some
techniques that should make the system a bit faster and cleaner... With a
lot of luck, I should start making progress on that again in the next month
or so.

For anyone who cares, I'll probably be trying to implement a PEG-based
parser using mxTextTools, since I think that should be able to parse all of
MediaWiki's wikitext, and should be about twice as fast as the current
Parser.php (which is about as fast as wiki2xml)... Or I might just end up
using ANTLR, if I can bully my current semi-grammar into working in that
framework... If anyone knows of a decent PEG parser with a Python API (a
packrat parser might be ideal), that'd be great too. *shrugs*

- Eric Astor

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.405 / Virus Database: 268.11.6/428 - Release Date: 8/25/2006


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


timwi at gmx

Aug 28, 2006, 5:10 AM

Post #6 of 15 (803 views)
Permalink
Re: flexbisonparse [In reply to]

Jay R. Ashworth wrote:
>
>> 3- the list of alternative parsers (
>> http://meta.wikimedia.org/wiki/Alternative_parsers ) does not
>> mention wikitext2docbook, and says that flexbisonparse is "Intended
>> as an eventual replacement to the parsing code inside MediaWiki
>> itself", which is rather promising!
>
> I don't know that that is what Magnus is calling it, but

Where and why did this misconception arise that flexbisonparse was
written by Magnus? Quite honestly, it is driving me nuts...

Timwi

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


timwi at gmx

Aug 28, 2006, 5:17 AM

Post #7 of 15 (820 views)
Permalink
Re: flexbisonparse [In reply to]

Hi,

> At the moment, things work fine, except for 2 things with flexbisonparse:

You really think these two problems you have listed are the only ones?
:-) I assure you there are heaps of other problems with it.

> However, I'm not (at all) an expert in flex and bison, and
> have the biggest difficulties understanding the code of flexbisonparse.

I find this quite amazing. I am not, and have never been, an "expert"
in flex and bison either, and I do agree that the code is not easy, but
is it really that much harder than MediaWiki itself?

> Is there a developer of this parser around? Do you think the
> modifications are feasible?

Yes, the modifications are feasible, but in order to implement them, you
need to find someone who has the motivation to do that. I am currently
not very motivated myself.

Timwi

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


magnus.manske at web

Aug 28, 2006, 5:21 AM

Post #8 of 15 (815 views)
Permalink
Re: flexbisonparse [In reply to]

Timwi schrieb:
> Jay R. Ashworth wrote:
>
>>> 3- the list of alternative parsers (
>>> http://meta.wikimedia.org/wiki/Alternative_parsers ) does not
>>> mention wikitext2docbook, and says that flexbisonparse is "Intended
>>> as an eventual replacement to the parsing code inside MediaWiki
>>> itself", which is rather promising!
>>>
>> I don't know that that is what Magnus is calling it, but
>>
>
> Where and why did this misconception arise that flexbisonparse was
> written by Magnus? Quite honestly, it is driving me nuts...
>
Official clarification: flexbisonparse was written by Timwi, and Timwi
alone :-)

I had a look at it once, and didn't find my way through the flex jungle,
so I gave up quickly. I did, however, base the XML of wiki2xml on the
flexbisonparse output; they're not identical, however.

Magnus
Attachments: signature.asc (0.24 KB)


cyril.buttay at free

Aug 28, 2006, 8:21 AM

Post #9 of 15 (829 views)
Permalink
Re: flexbisonparse [In reply to]

Magnus Manske wrote:
> Official clarification: flexbisonparse was written by Timwi, and Timwi
> alone :-)
>
> I had a look at it once, and didn't find my way through the flex jungle,
> so I gave up quickly. I did, however, base the XML of wiki2xml on the
> flexbisonparse output; they're not identical, however.
>
>
I've just given wiki2xml a go on a big page (
http://en.wikipedia.org/wiki/The_Adventures_of_Tintin ) that is full of
references, and I noticed that there is a problem with some of them:
the following wikicode:
<ref name="Farr">{{cite journal | last =Farr | first =Michael |
authorlink =Michael Farr | coauthors = | year =2004 | month =March |
title =Thundering Typhoons | journal =History Today | volume =54 | issue
=3 | pages =62 | id = | url = | format = | accessdate = }}</ref>

is translated as:
<extension extension_name="ref" name="Farr">
<xhtml:cite style="font-style:normal">.</xhtml:cite>
</extension>

As you can see, there is quite a bit of missing information.

At the moment, I think I'll carry on with flexbisonparse, adding some
python patches to correct the output. Maybe later I'll switch to
wiki2xml instead (although it is a bit slower than flexbisonparse to say
the less). This shouldn't be to difficult as they both use some dialect
of XML.

Concerning docbook, I'll also have to give it a try, but one of my
concerns is that (as far as I know) there is no way to give specific
formatting instructions, which, IMO is mandatory for a nice print
output. I have nothing against semantic description, but sometimes you
have to fine-tune some specific part (figure position, alignment
tolerance...). I'm sure those who use LaTeX intensively will understand...

Cyril
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


jra at baylink

Aug 28, 2006, 8:30 AM

Post #10 of 15 (817 views)
Permalink
Re: flexbisonparse [In reply to]

On Mon, Aug 28, 2006 at 02:21:46PM +0200, Magnus Manske wrote:
> Timwi schrieb:
> > Jay R. Ashworth wrote:
> >
> >>> 3- the list of alternative parsers (
> >>> http://meta.wikimedia.org/wiki/Alternative_parsers ) does not
> >>> mention wikitext2docbook, and says that flexbisonparse is "Intended
> >>> as an eventual replacement to the parsing code inside MediaWiki
> >>> itself", which is rather promising!
> >>>
> >> I don't know that that is what Magnus is calling it, but
> >>
> >
> > Where and why did this misconception arise that flexbisonparse was
> > written by Magnus? Quite honestly, it is driving me nuts...
> >
> Official clarification: flexbisonparse was written by Timwi, and Timwi
> alone :-)

And I wan't suggesting otherwise.

> I had a look at it once, and didn't find my way through the flex jungle,
> so I gave up quickly. I did, however, base the XML of wiki2xml on the
> flexbisonparse output; they're not identical, however.

wiki2xml: that's what you call it.

Yeah: that.

Cheers,
-- jra
--
Jay R. Ashworth jra [at] baylink
Designer Baylink RFC 2100
Ashworth & Associates The Things I Think '87 e24
St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274

The Internet: We paved paradise, and put up a snarking lot.
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


jra at baylink

Aug 28, 2006, 8:33 AM

Post #11 of 15 (814 views)
Permalink
Re: flexbisonparse [In reply to]

On Mon, Aug 28, 2006 at 04:21:41PM +0100, Buttay cyril wrote:
> At the moment, I think I'll carry on with flexbisonparse, adding some
> python patches to correct the output. Maybe later I'll switch to
> wiki2xml instead (although it is a bit slower than flexbisonparse to say
> the less). This shouldn't be to difficult as they both use some dialect
> of XML.

Well, if we end up with a standalone parser that gives output that can
be transliterated to DocBook, I'll be happy, no matter who wrote it. :-)

> Concerning docbook, I'll also have to give it a try, but one of my
> concerns is that (as far as I know) there is no way to give specific
> formatting instructions, which, IMO is mandatory for a nice print
> output. I have nothing against semantic description, but sometimes you
> have to fine-tune some specific part (figure position, alignment
> tolerance...). I'm sure those who use LaTeX intensively will understand...

Yeah, it's called a stylesheet, and it's the responsibility of the
person who needs a specific kind of final output. Wiring it into the
parser/converter would be A Bad Design. Been dealing with them since
Ventura Publisher 3.1...

Cheers,
-- jra
--
Jay R. Ashworth jra [at] baylink
Designer Baylink RFC 2100
Ashworth & Associates The Things I Think '87 e24
St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274

The Internet: We paved paradise, and put up a snarking lot.
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


neil at tonal

Aug 28, 2006, 8:38 AM

Post #12 of 15 (820 views)
Permalink
Re: flexbisonparse [In reply to]

Eric Astor wrote:
>
> I've actually been working on a Python-based wikitext parser, using some
> techniques that should make the system a bit faster and cleaner... With a
> lot of luck, I should start making progress on that again in the next month
> or so.
>
> For anyone who cares, I'll probably be trying to implement a PEG-based
> parser using mxTextTools, since I think that should be able to parse all of
> MediaWiki's wikitext, and should be about twice as fast as the current
> Parser.php (which is about as fast as wiki2xml)... Or I might just end up
> using ANTLR, if I can bully my current semi-grammar into working in that
> framework... If anyone knows of a decent PEG parser with a Python API (a
> packrat parser might be ideal), that'd be great too. *shrugs*
>
> - Eric Astor
>
>
Yes! I also believe that PEGs and [[packrat parser]]s are the way to go
with parsing wikitext, because of the very ad-hoc definition of wikitext.

A basic packrat parser is pretty easy to implement; it's simply a
brute-force recursive-descent parser with memoization of (offset, term)
-> production mappings. Scheme is a pretty good language to write a
packrat parser in, since the grammar itself can be written as an
S-expression, and is easy to use for program transformation (see below).

A simple implementation just interprets the grammar tree, matching as it
goes.

You can achieve considerable speedups by:

1 using the grammar to generate code, and compiling and executing that
instead of interpreting the grammar by hand

2 allowing the grammar to contain both PEG expressions and regexps for
low-level lexical matching: regexps will be at least an order of
magnitude faster than even compiled PEGs for matching low-level lexical
tokens like numbers and names, without removing the ability of PEGs to
blur the distinction between lexical and syntactic analysis, which is
important for parsing strange things like wikitext.

I've implemented packrat parsing in both Python and Scheme: Scheme was
faster, and ultimately more natural.

The one awkward bit is left-recursion removal, which breaks packrat
parsers unless you alter the grammar to an equivalent form without left
recursion. I did it by hand on my input grammars, but it could easily be
done programatically at grammar-generation time.

I'm not sure about the best way to implement an API: have you considered
just using the parser to convert from wikitext to somthing like PYX,
which is a very simple-to-parse and Python-friendly representation of an
XML data structure, and can then be used either to build an in-core
parse tree, or drive something like a SAX API, or whatever other form of
post-processing you like (for example, direct procedural text-to-text
generation, which could be very simple indeed, since the output of a
successful parse is guaranteed by definition to _exactly_ conform to the
grammar specification).

-- Neil

[PYX: http://www.xml.com/pub/a/2000/03/15/feature/index.html]


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


lars at aronsson

Aug 28, 2006, 2:51 PM

Post #13 of 15 (800 views)
Permalink
Re: flexbisonparse [In reply to]

Magnus Manske wrote:

> Official clarification: flexbisonparse was written by Timwi, and
> Timwi alone :-)

The important thing is that it is "a piece of German engineering".
You might not realize how strong this brand is in the
English-speaking world. Perhaps we should add that line just under
"the free encyclopedia" in the logotype?

Number of Google hits reported for:

976,000 "German engineering"
Just look at the distance to No. 2.
273,000 "American engineering"
259,000 "Texas engineering"
147,000 "Michigan engineering"
119,000 "Swiss engineering"
110,000 "Canadian engineering"
104,000 "British engineering"
76,000 "European engineering"
74,700 "French engineering"
71,000 "Florida engineering"
58,600 "California engineering"
56,700 "Indian engineering"
40,200 "Swedish engineering"
37,500 "Chinese engineering"
35,900 "Utah engineering"
33,600 "Italian engineering"
33,400 "English engineering"
33,200 "Japanese engineering"
31,300 "Minnesota engineering"
31,100 "New York engineering"
29,800 "Washington engineering"
27,200 "Thai engineering"
26,200 "Russian engineering"
25,200 "Scottish engineering"
24,000 "Pennsylvania engineering"
21,700 "Oklahoma engineering"
19,600 "Arizona engineering"
19,500 "Ohio engineering"
18,500 "Alabama engineering"
17,600 "Norwegian engineering"
16,400 "Oregon engineering"
16,200 "New Jersey engineering"
13,700 "Dutch engineering"
13,200 "Connecticut engineering"
12,300 "Nevada engineering"
11,000 "Danish engineering"
10,600 "Egyptian engineering"
9,780 "Spanish engineering"
9,670 "Finnish engineering"
891 "Slovak engineering"
806 "Massachusetts engineering"
723 "Austrian engineering"
681 "Czech engineering"
650 "Israeli engineering"
544 "Ukrainian engineering"
529 "Iranian engineering"
485 "Greek engineering"
478 "Mexican engineering"
451 "Hungarian engineering"
432 "Polish engineering"
388 "Portuguese engineering"
325 "Scandinavian engineering"
269 "Welsh engineering"
223 "Belgian engineering"


--
Lars Aronsson (lars [at] aronsson)
Aronsson Datateknik - http://aronsson.se
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


robchur at gmail

Aug 28, 2006, 3:07 PM

Post #14 of 15 (812 views)
Permalink
Re: flexbisonparse [In reply to]

On 28/08/06, Lars Aronsson <lars [at] aronsson> wrote:
> Magnus Manske wrote:
>
> > Official clarification: flexbisonparse was written by Timwi, and
> > Timwi alone :-)
>
> The important thing is that it is "a piece of German engineering".
> You might not realize how strong this brand is in the
> English-speaking world. Perhaps we should add that line just under
> "the free encyclopedia" in the logotype?

Flexbisonparse isn't used on Wikipedia, though. However, MediaWiki
started life as a script written by a German biology student...so...
;)


Rob Church
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l


jra at baylink

Aug 28, 2006, 3:38 PM

Post #15 of 15 (797 views)
Permalink
Re: flexbisonparse [In reply to]

On Mon, Aug 28, 2006 at 11:51:14PM +0200, Lars Aronsson wrote:
> Magnus Manske wrote:
> > Official clarification: flexbisonparse was written by Timwi, and
> > Timwi alone :-)
>
> The important thing is that it is "a piece of German engineering".

Works for me.

Cheers,
-- jr "'87 e24" a
--
Jay R. Ashworth jra [at] baylink
Designer Baylink RFC 2100
Ashworth & Associates The Things I Think '87 e24
St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274

The Internet: We paved paradise, and put up a snarking lot.
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] wikimedia
http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.