Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Dev

Fixing the XML batteries

 

 

First page Previous page 1 2 Next page Last page  View All Python dev RSS feed   Index | Next | Previous | View Threaded


stefan_ml at behnel

Dec 9, 2011, 12:02 AM

Post #1 of 47 (606 views)
Permalink
Fixing the XML batteries

Hi everyone,

I think Py3.3 would be a good milestone for cleaning up the stdlib support
for XML. Note upfront: you may or may not know me as the maintainer of
lxml, the de-facto non-stdlib standard Python XML tool. This (lengthy) post
was triggered by the following kind of conversation that I keep having with
new XML users in Python (mostly on c.l.py), which hints at some serious
flaw in the stdlib.

User: I'm trying to do XML stuff XYZ in Python and have problem ABC.
Me: What library are you using? Could you show us some code?
User: My code looks like this snippet: ...
Me: You are using minidom which is known to be hard to use, slow and uses
lots of memory. Use the xml.etree.ElementTree package instead, or rather
its C implementation cElementTree, also in the stdlib.
User (coming back after a while): thanks, that was exactly what [I didn't
know] I was looking for.

What does this tell us?

1) MiniDOM is what new users find first. It's highly visible because there
are still lots of ancient "Python and XML" web pages out there that date
back from the time before Python 2.5 (or rather something like 2.2), when
it was the only XML tree library in the stdlib. It's also the first hit
from the top when you search for "XML" on the stdlib docs page and contains
the (to some people) familiar word "DOM", which lets users stop their
search and start writing code, not expecting to find a separate alternative
in the same stdlib, way further down. And the description as "mini",
"simple" and "lightweight" suggests to users that it's going to be easy to
use and efficient.

2) MiniDOM is not what users want. It leads to complicated, unpythonic code
and lots of problems. It is neither easy to use, nor efficient, nor
"lightweight", "simple" or "mini", not in absolute numbers (see
http://bugs.python.org/issue11379#msg148584 and following for a recent
discussion). It's also badly maintained in the sense that its performance
characteristics could likely be improved, but no-one is seriously
interested in doing that, because it would not lead to something that
actually *is* fast or memory friendly compared to any of the 'real'
alternatives that are available right now.

3) ElementTree is what users should use, MiniDOM is not. ElementTree was
added to the stdlib in Py2.5 on popular demand, exactly because it is very
easy to use, very fast, and very memory friendly. And because users did not
want to use MiniDOM any more. Today, ElementTree has a rather straight
upgrade path towards lxml.etree if more XML features like validation or
XSLT are needed. MiniDOM has nothing like that to offer. It's a dead end.

4) In the stdlib, cElementTree is independent of ElementTree, but totally
hidden in the documentation. In conversations like the above, it's
unnecessarily complex to explain to users that there is ElementTree (which
is documented in the stdlib), but that what they want to use is really
cElementTree, which has the same API but does not have a stdlib
documentation page that I can send them to. Note that the other Python
implementations simply provide cElementTree as an alias for ElementTree.
That leaves CPython as the only Python implementation that really has these
two separate modules.

So, there are many problems here. And I think they make it unnecessarily
complicated for users to process XML in Python and that the current
situation helps in turning away new users from Python as a language for XML
processing. Python does have impressively great tools for working with XML.
It's just that the stdlib and its documentation do not reflect or even
appreciate that.

What should change?

a) The stdlib documentation should help users to choose the right tool
right from the start. Instead of using the totally misleading wording that
it uses now, it should be honest about the performance characteristics of
MiniDOM and should actively suggest that those who don't know what to
choose (or even *that* they can choose) should not use MiniDOM in the first
place. I created a ticket (issue11379) for a minor step in this direction,
but given the responses, I'm rather convinced that there's a lot more that
can be done and should be done, and that it should be done now, right for
the next release.

b) cElementTree should finally loose it's "special" status as a separate
library and disappear as an accelerator module behind ElementTree. This has
been suggested a couple of times already, and AFAIR, there was some
opposition because 1) ET was maintained outside of the stdlib and 2) the
APIs of both were not identical. However, getting ET 1.3 into Py2.7 and 3.2
was a U-turn. Today, ET is *only* being maintained in the stdlib by Florent
Xicluna (who is doing a good job with it), and ET 1.3 has basically made
the APIs of both implementations compatible again. So, 3.3 would be the
right milestone for fixing the "two libs for one" quirk.

Given that this is the third time during the last couple of years that I'm
suggesting to finally fix the stdlib and its documentation, I won't provide
any further patches before it has finally been accepted that a) this is a
problem and b) it should be fixed, thus allowing the patches to actually
serve a purpose. If we can agree on that, I'll happily help in making this
change happen.

Stefan

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


martin at v

Dec 9, 2011, 12:41 AM

Post #2 of 47 (598 views)
Permalink
Re: Fixing the XML batteries [In reply to]

> a) The stdlib documentation should help users to choose the right tool
> right from the start. Instead of using the totally misleading wording
> that it uses now, it should be honest about the performance
> characteristics of MiniDOM and should actively suggest that those who
> don't know what to choose (or even *that* they can choose) should not
> use MiniDOM in the first place.

I disagree. The right approach is not to document performance problems,
but to fix them.

> b) cElementTree should finally loose it's "special" status as a separate
> library and disappear as an accelerator module behind ElementTree. This
> has been suggested a couple of times already, and AFAIR, there was some
> opposition because 1) ET was maintained outside of the stdlib and 2) the
> APIs of both were not identical. However, getting ET 1.3 into Py2.7 and
> 3.2 was a U-turn.

Unfortunately (?), there is a near-contract-like agreement with Fredrik
Lundh that any significant changes to ElementTree in the standard
library have to be agreed by him. So whatever change you plan: make sure
Fredrik gives his explicit support.

Regards,
Martin
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


stefan_ml at behnel

Dec 9, 2011, 12:59 AM

Post #3 of 47 (598 views)
Permalink
Re: Fixing the XML batteries [In reply to]

"Martin v. Löwis", 09.12.2011 09:41:
>> a) The stdlib documentation should help users to choose the right tool
>> right from the start. Instead of using the totally misleading wording
>> that it uses now, it should be honest about the performance
>> characteristics of MiniDOM and should actively suggest that those who
>> don't know what to choose (or even *that* they can choose) should not
>> use MiniDOM in the first place.
>
> I disagree. The right approach is not to document performance problems,
> but to fix them.

Here's the relevant part of my mail that you stripped:

>> It's also badly maintained in the sense that its performance
>> characteristics could likely be improved, but no-one is seriously
>> interested in doing that, because it would not lead to something that
>> actually *is* fast or memory friendly compared to any of the 'real'
>> alternatives that are available right now.

I can't recall anyone working on any substantial improvements during the
last six years or so, and the reason for that seems obvious to me.


>> b) cElementTree should finally loose it's "special" status as a separate
>> library and disappear as an accelerator module behind ElementTree. This
>> has been suggested a couple of times already, and AFAIR, there was some
>> opposition because 1) ET was maintained outside of the stdlib and 2) the
>> APIs of both were not identical. However, getting ET 1.3 into Py2.7 and
>> 3.2 was a U-turn.
>
> Unfortunately (?), there is a near-contract-like agreement with Fredrik
> Lundh that any significant changes to ElementTree in the standard
> library have to be agreed by him. So whatever change you plan: make sure
> Fredrik gives his explicit support.

Ok, I'll try to contact him.

Stefan

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


python-dev at masklinn

Dec 9, 2011, 1:09 AM

Post #4 of 47 (589 views)
Permalink
Re: Fixing the XML batteries [In reply to]

On 2011-12-09, at 09:41 , Martin v. Löwis wrote:
>> a) The stdlib documentation should help users to choose the right tool
>> right from the start. Instead of using the totally misleading wording
>> that it uses now, it should be honest about the performance
>> characteristics of MiniDOM and should actively suggest that those who
>> don't know what to choose (or even *that* they can choose) should not
>> use MiniDOM in the first place.
>
> I disagree. The right approach is not to document performance problems,
> but to fix them.
Even if performance problems "should not be documented", I think Stefan's point that users should be steered away from minidom and towards ET and cET is completely valid and worthy of support: the *only* advantage minidom has over ET is that it uses an interface familiar to Java users[0] (they are about the only people using actual W3C DOM, while the DOM exists in javascript I'd say most code out there actively tries to not touch it with anything less than a 10-foot library pole like jQuery). That interface is also, of course, absolutely dreadful.

Minidom is inferior in interface flow and pythonicity, in terseness, in speed, in memory consumption (even more so using cElementTree, and that's not something which can be fixed unless minidom gets a C accelerator), etc… Even after fixing minidom (if anybody has the time and drive to commit to it), ET/cET should be preferred over it.

And that's not even considering the ease of switching to lxml (if only for validators), which Stefan outlined.

[0] not 100% true now that I think about it: handling mixed content is simpler in minidom as there is no .text/.tail duality and text nodes are nodes like every other, but I really can't think of an other reason to prefer minidom
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


ncoghlan at gmail

Dec 9, 2011, 1:10 AM

Post #5 of 47 (592 views)
Permalink
Re: Fixing the XML batteries [In reply to]

On Fri, Dec 9, 2011 at 6:41 PM, "Martin v. Löwis" <martin [at] v> wrote:
>> a) The stdlib documentation should help users to choose the right tool
>> right from the start. Instead of using the totally misleading wording
>> that it uses now, it should be honest about the performance
>> characteristics of MiniDOM and should actively suggest that those who
>> don't know what to choose (or even *that* they can choose) should not
>> use MiniDOM in the first place.
>
> I disagree. The right approach is not to document performance problems,
> but to fix them.

When we offer a better way to do something that new users are want to
do, we generally redirect them to the more recent alternative. I
believe the redirection from the getopt module to the argparse module
strikes the right tone for that kind of thing:
http://docs.python.org/library/getopt

For the various XML libraries, a message along the lines of "Note: The
<whatever> module is a <yada, yada, DOM based, whatever>. If all you
are trying to do is read and write XML files, consider using the
xml.etree.ElementTree module instead".

I'd also be +1 on adjusting the order of the XML pages in the main
index such that xml.etree.ElementTree appeared before xml.parser.expat
and all the others slid down one entry.

These are simple changes that don't harm current users of the modules
in the least, while being up front and very helpful for beginners.
Again, I think argparse vs getopt is a good comparison: argparse
appears first in the main index, and there's a redirection from getopt
to argparse that says "if you don't have a specific reason to be using
getopt, you probably want argparse instead".

--
Nick Coghlan   |   ncoghlan [at] gmail   |   Brisbane, Australia
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


solipsis at pitrou

Dec 9, 2011, 1:15 AM

Post #6 of 47 (589 views)
Permalink
Re: Fixing the XML batteries [In reply to]

Mostly uninformed +1 to Stefan's suggestions from me.

Regards

Antoine.


On Fri, 09 Dec 2011 09:02:35 +0100
Stefan Behnel <stefan_ml [at] behnel> wrote:
> Hi everyone,
>
> I think Py3.3 would be a good milestone for cleaning up the stdlib support
> for XML. Note upfront: you may or may not know me as the maintainer of
> lxml, the de-facto non-stdlib standard Python XML tool. This (lengthy) post
> was triggered by the following kind of conversation that I keep having with
> new XML users in Python (mostly on c.l.py), which hints at some serious
> flaw in the stdlib.
[etc.]


_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


dirkjan at ochtman

Dec 9, 2011, 7:09 AM

Post #7 of 47 (581 views)
Permalink
Re: Fixing the XML batteries [In reply to]

On Fri, Dec 9, 2011 at 09:02, Stefan Behnel <stefan_ml [at] behnel> wrote:
> a) The stdlib documentation should help users to choose the right tool right
> from the start.
> b) cElementTree should finally loose it's "special" status as a separate
> library and disappear as an accelerator module behind ElementTree.

An at least somewhat informed +1 from me. The ElementTree API is a
very good way to deal with XML from Python, and it deserves to be
promoted over the included alternatives.

Let's deprecate the NiCad batteries and try to guide users toward the
Li-Ion ones.

Cheers,

Dirkjan
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


anacrolix at gmail

Dec 9, 2011, 9:02 AM

Post #8 of 47 (581 views)
Permalink
Re: Fixing the XML batteries [In reply to]

+1

On Sat, Dec 10, 2011 at 2:09 AM, Dirkjan Ochtman <dirkjan [at] ochtman> wrote:
> On Fri, Dec 9, 2011 at 09:02, Stefan Behnel <stefan_ml [at] behnel> wrote:
>> a) The stdlib documentation should help users to choose the right tool right
>> from the start.
>> b) cElementTree should finally loose it's "special" status as a separate
>> library and disappear as an accelerator module behind ElementTree.
>
> An at least somewhat informed +1 from me. The ElementTree API is a
> very good way to deal with XML from Python, and it deserves to be
> promoted over the included alternatives.
>
> Let's deprecate the NiCad batteries and try to guide users toward the
> Li-Ion ones.
>
> Cheers,
>
> Dirkjan
> _______________________________________________
> Python-Dev mailing list
> Python-Dev [at] python
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/anacrolix%40gmail.com



--
ಠ_ಠ
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


mwm at mired

Dec 9, 2011, 10:07 AM

Post #9 of 47 (580 views)
Permalink
Re: Fixing the XML batteries [In reply to]

On Fri, 09 Dec 2011 09:02:35 +0100
Stefan Behnel <stefan_ml [at] behnel> wrote:

> a) The stdlib documentation should help users to choose the right
> tool right from the start.
> b) cElementTree should finally loose it's "special" status as a
> separate library and disappear as an accelerator module behind
> ElementTree.

+1 and +1.

I've done a lot of xml work in Python, and unless you've got a
particular reason for wanting to use the dom, ElementTree is the only
sane way to go.

I recently converted a middling-sized app from using the dom to using
ElementTree, and wrote up some guidelines for the process for the
client. I can try and shake it out of my clients lawyers if it would
help with this or others are interested.

<mike
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


janssen at parc

Dec 9, 2011, 10:15 AM

Post #10 of 47 (582 views)
Permalink
Re: Fixing the XML batteries [In reply to]

Mike Meyer <mwm [at] mired> wrote:

> On Fri, 09 Dec 2011 09:02:35 +0100
> Stefan Behnel <stefan_ml [at] behnel> wrote:
>
> > a) The stdlib documentation should help users to choose the right
> > tool right from the start.
> > b) cElementTree should finally loose it's "special" status as a
> > separate library and disappear as an accelerator module behind
> > ElementTree.
>
> +1 and +1.
>
> I've done a lot of xml work in Python, and unless you've got a
> particular reason for wanting to use the dom, ElementTree is the only
> sane way to go.

I use ElementTree for parsing valid XML, but minidom for producing it.

I think another thing that might go into "refreshing the batteries" is a
feature comparison of BeautifulSoup and HTML5lib against the stdlib
competition, to see what needs to be added/revised. Having to switch to
an outside package for parsing possibly invalid HTML is a pain.

Bill
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


p.f.moore at gmail

Dec 9, 2011, 10:24 AM

Post #11 of 47 (576 views)
Permalink
Re: Fixing the XML batteries [In reply to]

On 9 December 2011 18:15, Bill Janssen <janssen [at] parc> wrote:
> I use ElementTree for parsing valid XML, but minidom for producing it.
>
> I think another thing that might go into "refreshing the batteries" is a
> feature comparison of BeautifulSoup and HTML5lib against the stdlib
> competition, to see what needs to be added/revised.  Having to switch to
> an outside package for parsing possibly invalid HTML is a pain.

For what little use I make of XML/HTML parsing, I use lxml, simply
because it has a parser that covers the sort of HTML I have to deal
with in real life. As I have lxml installed, I use it for any XML
parsing tasks, just because I'm used to it.

Paul
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


python-dev at masklinn

Dec 9, 2011, 10:39 AM

Post #12 of 47 (578 views)
Permalink
Re: Fixing the XML batteries [In reply to]

On 2011-12-09, at 19:15 , Bill Janssen wrote:
> I use ElementTree for parsing valid XML, but minidom for producing it.
Could you expand on your reasons to use minidom for producing XML?
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


janssen at parc

Dec 9, 2011, 11:33 AM

Post #13 of 47 (576 views)
Permalink
Re: Fixing the XML batteries [In reply to]

Xavier Morel <python-dev [at] masklinn> wrote:

> On 2011-12-09, at 19:15 , Bill Janssen wrote:
> > I use ElementTree for parsing valid XML, but minidom for producing it.
> Could you expand on your reasons to use minidom for producing XML?

Inertia, I guess. I tried that first, and it seems to work.

I tend to use html5lib and/or BeautifulSoup instead of ElementTree, and
that's mainly because I find the documentation for ElementTree is
confusing and partial and inconsistent. Having various undated but
obsolete tutorials and documentation still up on effbot.org doesn't
help.


Bill
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


anacrolix at gmail

Dec 9, 2011, 2:43 PM

Post #14 of 47 (576 views)
Permalink
Re: Fixing the XML batteries [In reply to]

I second this. The doco is very bad.
On Dec 10, 2011 6:34 AM, "Bill Janssen" <janssen [at] parc> wrote:

> Xavier Morel <python-dev [at] masklinn> wrote:
>
> > On 2011-12-09, at 19:15 , Bill Janssen wrote:
> > > I use ElementTree for parsing valid XML, but minidom for producing it.
> > Could you expand on your reasons to use minidom for producing XML?
>
> Inertia, I guess. I tried that first, and it seems to work.
>
> I tend to use html5lib and/or BeautifulSoup instead of ElementTree, and
> that's mainly because I find the documentation for ElementTree is
> confusing and partial and inconsistent. Having various undated but
> obsolete tutorials and documentation still up on effbot.org doesn't
> help.
>
>
> Bill
> _______________________________________________
> Python-Dev mailing list
> Python-Dev [at] python
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> http://mail.python.org/mailman/options/python-dev/anacrolix%40gmail.com
>


eliben at gmail

Dec 9, 2011, 7:28 PM

Post #15 of 47 (572 views)
Permalink
Re: Fixing the XML batteries [In reply to]

On Sat, Dec 10, 2011 at 00:43, Matt Joiner <anacrolix [at] gmail> wrote:

> I second this. The doco is very bad.
>

It would be constructive to open issues for specific problems in the
documentation. I'm sure this won't be hard to fix. Documentation should not
be the roadblock for using a library.
Eli


stefan_ml at behnel

Dec 9, 2011, 11:38 PM

Post #16 of 47 (571 views)
Permalink
Re: Fixing the XML batteries [In reply to]

Bill Janssen, 09.12.2011 19:15:
> I think another thing that might go into "refreshing the batteries" is a
> feature comparison of BeautifulSoup and HTML5lib against the stdlib
> competition, to see what needs to be added/revised. Having to switch to
> an outside package for parsing possibly invalid HTML is a pain.

Such a feature request should be worth a separate thread.

Note, however, that html5lib is likely way too big to add it to the stdlib,
and that BeautifulSoup lacks a parser for non-conforming HTML in Python 3,
which would be the target release series for better HTML support. So,
whatever library or API you would want to use for HTML processing is
currently only the second question as long as Py3 lacks a real-world HTML
parser in the stdlib, as well as a robust character detection mechanism. I
don't think that can be fixed all that easily.

Stefan

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


timwintle at gmail

Dec 10, 2011, 12:28 AM

Post #17 of 47 (572 views)
Permalink
Re: Fixing the XML batteries [In reply to]

On Fri, 2011-12-09 at 19:39 +0100, Xavier Morel wrote:
> On 2011-12-09, at 19:15 , Bill Janssen wrote:
> > I use ElementTree for parsing valid XML, but minidom for producing it.
> Could you expand on your reasons to use minidom for producing XML?

To throw my 2c in here:

I personally normally use minidom for manipulating (x)html data (through
html5lib), and for writing XML.

I think it's primarily because DOM:

a) matches the way I think about XML documents.

b) Provides the same API as I use in other languages. (FWIW, I do a lot
of DOM manipulation in javascript)

c) "Feels" (to me) more similar to other formats I work with.


All three may be because I haven't spent enough time with ElementTree -
again I've found the documentation lacking.

Tim

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


janssen at parc

Dec 10, 2011, 12:54 PM

Post #18 of 47 (566 views)
Permalink
Re: Fixing the XML batteries [In reply to]

Stefan Behnel <stefan_ml [at] behnel> wrote:

> Bill Janssen, 09.12.2011 19:15:
> > I think another thing that might go into "refreshing the batteries" is a
> > feature comparison of BeautifulSoup and HTML5lib against the stdlib
> > competition, to see what needs to be added/revised. Having to switch to
> > an outside package for parsing possibly invalid HTML is a pain.
>
> Such a feature request should be worth a separate thread.
>
> Note, however, that html5lib is likely way too big to add it to the
> stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML
> in Python 3, which would be the target release series for better HTML
> support. So, whatever library or API you would want to use for HTML
> processing is currently only the second question as long as Py3 lacks
> a real-world HTML parser in the stdlib, as well as a robust character
> detection mechanism. I don't think that can be fixed all that easily.

Sounds like it needs a PEP.

I'm only advocating spending some thought on what needs to be done --
whether outside libraries need to be adopted into the stdlib would be a
step after that. But understanding *why* those libraries exist and are
widely used should be a prerequisite to "refreshing" the stdlib's support.

Bill
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


glyph at twistedmatrix

Dec 10, 2011, 1:32 PM

Post #19 of 47 (567 views)
Permalink
Re: Fixing the XML batteries [In reply to]

On Dec 10, 2011, at 2:38 AM, Stefan Behnel wrote:

> Note, however, that html5lib is likely way too big to add it to the stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML in Python 3, which would be the target release series for better HTML support. So, whatever library or API you would want to use for HTML processing is currently only the second question as long as Py3 lacks a real-world HTML parser in the stdlib, as well as a robust character detection mechanism. I don't think that can be fixed all that easily.


Here's the problem in a nutshell, I think:

Everybody wants an HTML parser in the stdlib, because it's inconvenient to pull in a dependency for such a "simple" task.
Everybody wants the stdlib to remain small, stable, and simple and not get "overcomplicated".
Parsing arbitrary HTML5 is a monstrously complex problem, for which there exist rapidly-evolving standards and libraries to deal with it. Parsing 'the web' (which is rapidly growing to include stuff like SVG, MathML etc) is even harder.

My personal opinion is that HTML5Lib gets this problem almost completely right, and so it should be absorbed by the stdlib. Trying to re-invent this from scratch, or even use something like BeautifulSoup which uses a bunch of heuristics and hacks rather than reference to the laboriously-crafted standard that says exactly how parsing malformed stuff has to go to be "like a browser", seems like it will just give the stdlib solution a reputation for working on the test input but not working in the real world.

(No disrespect to BeautifulSoup: it was a great attempt in the pre-HTML5 world which it was born into, and I've used it numerous times to implement useful things. But much more effort has been poured into this problem since then, and the problems are better understood now.)

-glyph


tjreedy at udel

Dec 10, 2011, 3:30 PM

Post #20 of 47 (561 views)
Permalink
Re: Fixing the XML batteries [In reply to]

On 12/10/2011 4:32 PM, Glyph Lefkowitz wrote:
> On Dec 10, 2011, at 2:38 AM, Stefan Behnel wrote:
>
>> Note, however, that html5lib is likely way too big to add it to the
>> stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML
>> in Python 3, which would be the target release series for better HTML
>> support. So, whatever library or API you would want to use for HTML
>> processing is currently only the second question as long as Py3 lacks
>> a real-world HTML parser in the stdlib, as well as a robust character
>> detection mechanism. I don't think that can be fixed all that easily.
>
> Here's the problem in a nutshell, I think:
>
> 1. Everybody wants an HTML parser in the stdlib, because it's
> inconvenient to pull in a dependency for such a "simple" task.
> 2. Everybody wants the stdlib to remain small, stable, and simple and
> not get "overcomplicated".
> 3. Parsing arbitrary HTML5 is a monstrously complex problem, for which
> there exist rapidly-evolving standards and libraries to deal with
> it. Parsing 'the web' (which is rapidly growing to include stuff
> like SVG, MathML etc) is even harder.
>
>
> My personal opinion is that HTML5Lib gets this problem almost completely
> right, and so it should be absorbed by the stdlib.

A little data: the HTML5lib project lives at
https://code.google.com/p/html5lib/
It has 4 owners and 22 other committers.

The most recent release, html5lib 0.90 for Python, is nearly 2 years
old. Since there is a separate Python3 repository, and there is no
mention on Python3 compatibility elsewhere that I saw, including the
pypi listing, I assume that is for Python2 only.

A comment on a recent (July 11) Python3 issue
https://code.google.com/p/html5lib/issues/detail?id=187&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20Port
suggest that the Python3 version still has problems. "Merged in now,
though still lots of errors and failures in the testsuite."

--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


glyph at twistedmatrix

Dec 10, 2011, 6:25 PM

Post #21 of 47 (558 views)
Permalink
Re: Fixing the XML batteries [In reply to]

On Dec 10, 2011, at 6:30 PM, Terry Reedy wrote:

> A little data: the HTML5lib project lives at
> https://code.google.com/p/html5lib/
> It has 4 owners and 22 other committers.
>
> The most recent release, html5lib 0.90 for Python, is nearly 2 years old. Since there is a separate Python3 repository, and there is no mention on Python3 compatibility elsewhere that I saw, including the pypi listing, I assume that is for Python2 only.

I believe that you are correct.

> A comment on a recent (July 11) Python3 issue
> https://code.google.com/p/html5lib/issues/detail?id=187&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20Port
> suggest that the Python3 version still has problems. "Merged in now, though still lots of errors and failures in the testsuite."


I don't see what bearing this has on the discussion. There are three possible ways I can imagine to interpret this information.

First, you could believe that porting a codebase from Python 2 to Python 3 is much easier than solving a difficult domain-specific problem. In that case, html5lib has done the hard part and someone interested in html-in-the-stdlib should do the rest.

Second, you could believe that porting a codebase from Python 2 to Python 3 is harder than solving a difficult domain-specific problem, in which case something is seriously wrong with Python 3 or its attendant migration tools and that needs to be fixed, so someone should fix that rather than worrying about parsing HTML right now. (I doubt that many subscribers to this list would share this opinion, though.)

Third, you could believe that parsing HTML is not a difficult domain-specific problem. But only a crazy person would believe that, so you're left with one of the previous options :).

-glyph


tjreedy at udel

Dec 10, 2011, 9:55 PM

Post #22 of 47 (559 views)
Permalink
Re: Fixing the XML batteries [In reply to]

On 12/10/2011 9:25 PM, Glyph Lefkowitz wrote:
> On Dec 10, 2011, at 6:30 PM, Terry Reedy wrote:

>> A little data: the HTML5lib project lives at
>> https://code.google.com/p/html5lib/
>> It has 4 owners and 22 other committers.

If there really are 4 'owners' rather than 4 people with admin access to
the site, then there are 4 people to negotiate with.

>> The most recent release, html5lib 0.90 for Python, is nearly 2 years
>> old. Since there is a separate Python3 repository, and there is no
>> mention on Python3 compatibility elsewhere that I saw, including the
>> pypi listing, I assume that is for Python2 only.
>
> I believe that you are correct.

There are issues pointing to a 1.0 release, but I could not find any
current timetable. The project lots a bit stagnant. That does not bode
well for a commitment to future active maintenance.

>> A comment on a recent (July 11) Python3 issue
>> https://code.google.com/p/html5lib/issues/detail?id=187&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary%20Port
>> <https://code.google.com/p/html5lib/issues/detail?id=187&colspec=ID
>> Type Status Priority Milestone Owner Summary Port>
>> suggest that the Python3 version still has problems. "Merged in now,
>> though still lots of errors and failures in the testsuite."
>
> I don't see what bearing this has on the discussion.

I think both points above show that 'absorbing HTML5Lib in the stdlib'
will involve more sociological and technical problems than doing so with
a active one-person module that already runs on 3.2. One is that the
multiple version Python 2.x codebase is the reference version and that
will not be incorporated. A serious plan will have to address the real
situation.

---
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


martin at v

Dec 11, 2011, 2:03 PM

Post #23 of 47 (525 views)
Permalink
Re: Fixing the XML batteries [In reply to]

Am 09.12.2011 10:09, schrieb Xavier Morel:
> On 2011-12-09, at 09:41 , Martin v. Löwis wrote:
>>> a) The stdlib documentation should help users to choose the right
>>> tool right from the start. Instead of using the totally
>>> misleading wording that it uses now, it should be honest about
>>> the performance characteristics of MiniDOM and should actively
>>> suggest that those who don't know what to choose (or even *that*
>>> they can choose) should not use MiniDOM in the first place.
>>
[...]
>
> Minidom is inferior in interface flow and pythonicity, in terseness,
> in speed, in memory consumption (even more so using cElementTree, and
> that's not something which can be fixed unless minidom gets a C
> accelerator), etc… Even after fixing minidom (if anybody has the time
> and drive to commit to it), ET/cET should be preferred over it.

I don't mind pointing people to ElementTree, despite that I disagree
whether the ET interface is "superior" to DOM. It's Stefan's reasoning
as to *why* people should be pointed to ET, and what words should be
used to do that. IOW, I detest bashing some part of the standard
library, just to urge users to use some other part of the standard library.

People are still using PyXML, despite it's not being maintained anymore.
Telling them to replace 4DOM with minidom is much more appropriate than
telling them to rewrite in ET.

Regards,
Martin
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


martin at v

Dec 11, 2011, 2:07 PM

Post #24 of 47 (528 views)
Permalink
Re: Fixing the XML batteries [In reply to]

> For the various XML libraries, a message along the lines of "Note: The
> <whatever> module is a <yada, yada, DOM based, whatever>. If all you
> are trying to do is read and write XML files, consider using the
> xml.etree.ElementTree module instead".

I wouldn't mind such a wording. I still would mind the changes that
Stefan proposed (which are actually different from yours).

Regards,
Martin
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


martin at v

Dec 11, 2011, 2:39 PM

Post #25 of 47 (528 views)
Permalink
Re: Fixing the XML batteries [In reply to]

> I can't recall anyone working on any substantial improvements during the
> last six years or so, and the reason for that seems obvious to me.

What do you think is the reason? It's not at all obvious to me.

Regards,
Martin
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com

First page Previous page 1 2 Next page Last page  View All Python dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.