Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Dev

Changes in html.parser may cause breakage in client code

 

 

Python dev RSS feed   Index | Next | Previous | View Threaded


vinay_sajip at yahoo

Apr 26, 2012, 12:10 PM

Post #1 of 9 (452 views)
Permalink
Changes in html.parser may cause breakage in client code

Following recent changes in html.parser, the Python 3 port of Django I'm working
on has started failing while parsing HTML.

The reason appears to be that Django uses some module-level data in html.parser,
for example tagfind, which is a regular expression pattern. This has changed
recently (Ezio changed it in ba4baaddac8d).

Now tagfind (and other such patterns) are not marked as private (though not
documented), but should they be? The following script (tagfind.py):

import html.parser as Parser

data = '<select name="stuff">'

m = Parser.tagfind.match(data, 1)
print('%r -> %r' % (Parser.tagfind.pattern, data[1:m.end()]))

gives different results on 3.2 and 3.3:

$ python3.2 tagfind.py
'[a-zA-Z][-.a-zA-Z0-9:_]*' -> 'select'
$ python3.3 tagfind.py
'([a-zA-Z][-.a-zA-Z0-9:_]*)(?:\\s|/(?!>))*' -> 'select '

The trailing space later causes a mismatch with the end tag, and leads to the
errors. Django's use of the tagfind pattern is in a subclass of HTMLParser, in
an overridden parse_startag method.

Do we need to indicate more strongly that data like tagfind are private? Or has
the change introduced inadvertent breakage, requiring a fix in Python?

Regards,

Vinay Sajip

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


guido at python

Apr 26, 2012, 12:21 PM

Post #2 of 9 (440 views)
Permalink
Re: Changes in html.parser may cause breakage in client code [In reply to]

On Thu, Apr 26, 2012 at 12:10 PM, Vinay Sajip <vinay_sajip [at] yahoo> wrote:
> Following recent changes in html.parser, the Python 3 port of Django I'm working
> on has started failing while parsing HTML.
>
> The reason appears to be that Django uses some module-level data in html.parser,
> for example tagfind, which is a regular expression pattern. This has changed
> recently (Ezio changed it in ba4baaddac8d).
>
> Now tagfind (and other such patterns) are not marked as private (though not
> documented), but should they be? The following script (tagfind.py):
>
>    import html.parser as Parser
>
>    data = '<select name="stuff">'
>
>    m = Parser.tagfind.match(data, 1)
>    print('%r -> %r' % (Parser.tagfind.pattern, data[1:m.end()]))
>
> gives different results on 3.2 and 3.3:
>
>    $ python3.2 tagfind.py
>    '[a-zA-Z][-.a-zA-Z0-9:_]*' -> 'select'
>    $ python3.3 tagfind.py
>    '([a-zA-Z][-.a-zA-Z0-9:_]*)(?:\\s|/(?!>))*' -> 'select '
>
> The trailing space later causes a mismatch with the end tag, and leads to the
> errors. Django's use of the tagfind pattern is in a subclass of HTMLParser, in
> an overridden parse_startag method.
>
> Do we need to indicate more strongly that data like tagfind are private? Or has
> the change introduced inadvertent breakage, requiring a fix in Python?

I think both. Looks like it wasn't meant to be exported. But it should
have been marked as such. And I think it would behoove us to reduce
random failures in important 3rd party libraries by keeping the old
version around (but mark it as deprecated with an explaining comment,
and submit a Django fix to stop using it).

Also the module should be updated to use _tagfind internally (and
likewise for other accidental exports).

Traditionally we've been really lax about this stuff. We should strive
to improve and clarify the exact boundaries of our APIs better.

--
--Guido van Rossum (python.org/~guido)
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


g.brandl at gmx

Apr 26, 2012, 12:26 PM

Post #3 of 9 (441 views)
Permalink
Re: Changes in html.parser may cause breakage in client code [In reply to]

On 26.04.2012 21:10, Vinay Sajip wrote:
> Following recent changes in html.parser, the Python 3 port of Django I'm working
> on has started failing while parsing HTML.
>
> The reason appears to be that Django uses some module-level data in html.parser,
> for example tagfind, which is a regular expression pattern. This has changed
> recently (Ezio changed it in ba4baaddac8d).
>
> Now tagfind (and other such patterns) are not marked as private (though not
> documented), but should they be? The following script (tagfind.py):
>
> import html.parser as Parser
>
> data = '<select name="stuff">'
>
> m = Parser.tagfind.match(data, 1)
> print('%r -> %r' % (Parser.tagfind.pattern, data[1:m.end()]))
>
> gives different results on 3.2 and 3.3:
>
> $ python3.2 tagfind.py
> '[a-zA-Z][-.a-zA-Z0-9:_]*' -> 'select'
> $ python3.3 tagfind.py
> '([a-zA-Z][-.a-zA-Z0-9:_]*)(?:\\s|/(?!>))*' -> 'select '
>
> The trailing space later causes a mismatch with the end tag, and leads to the
> errors. Django's use of the tagfind pattern is in a subclass of HTMLParser, in
> an overridden parse_startag method.
>
> Do we need to indicate more strongly that data like tagfind are private? Or has
> the change introduced inadvertent breakage, requiring a fix in Python?

Since it's a module level constant without a leading underscore, IMO it was
okay for Django to use it, even if not documented.

In this case, especially since we actually have evidence of someone using the
constant, I would keep it as-is and use a new (underscored, this time) name for
the new pattern.

And yes, I think that we do need to indicate private-ness of module-level data.

Georg

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


ncoghlan at gmail

Apr 26, 2012, 5:33 PM

Post #4 of 9 (440 views)
Permalink
Re: Changes in html.parser may cause breakage in client code [In reply to]

On Fri, Apr 27, 2012 at 5:21 AM, Guido van Rossum <guido [at] python> wrote:
> Traditionally we've been really lax about this stuff. We should strive
> to improve and clarify the exact boundaries of our APIs better.

Yeah, I must admit in my own projects these days I habitually mark all
module level and class level names with a leading underscore until I
make a conscious decision to make them part of the relevant public
API. I also do this for any new helper attributes and
functions/methods I add to the stdlib.

One key catalyst for this was when PJE pointed out a bug years ago in
the behaviour of the -m switch that meant I had to introduce a *new*
helper function to runpy, because runpy.run_module was public, and I
needed to change the signature in a backwards incompatible way to fix
the bug (and thus the current runpy._run_module_as_main hook was
born).

When I use dir() and help() as much as I do to explore unfamiliar
APIs, I feel obliged to make sure that introspecting my own code
accurately reflects which names are part of the public API and which
are just implementation details.

Cheers,
Nick.

--
Nick Coghlan   |   ncoghlan [at] gmail   |   Brisbane, Australia
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


ezio.melotti at gmail

Apr 26, 2012, 10:23 PM

Post #5 of 9 (430 views)
Permalink
Re: Changes in html.parser may cause breakage in client code [In reply to]

Hi,

On 26/04/2012 22.10, Vinay Sajip wrote:
> Following recent changes in html.parser, the Python 3 port of Django I'm working
> on has started failing while parsing HTML.
>
> The reason appears to be that Django uses some module-level data in html.parser,
> for example tagfind, which is a regular expression pattern. This has changed
> recently (Ezio changed it in ba4baaddac8d).

html.parser doesn't use any private _name, so I was considering part of
the public API only the documented names. Several methods are marked
with an "# internal" comment, but that's not visible unless you go read
the source code.

> Now tagfind (and other such patterns) are not marked as private (though not
> documented), but should they be? The following script (tagfind.py):
>
> import html.parser as Parser
>
> data = '<select name="stuff">'
>
> m = Parser.tagfind.match(data, 1)
> print('%r -> %r' % (Parser.tagfind.pattern, data[1:m.end()]))
>
> gives different results on 3.2 and 3.3:
>
> $ python3.2 tagfind.py
> '[a-zA-Z][-.a-zA-Z0-9:_]*' -> 'select'
> $ python3.3 tagfind.py
> '([a-zA-Z][-.a-zA-Z0-9:_]*)(?:\\s|/(?!>))*' -> 'select'
>
> The trailing space later causes a mismatch with the end tag, and leads to the
> errors. Django's use of the tagfind pattern is in a subclass of HTMLParser, in
> an overridden parse_startag method.

Django shouldn't override parse_starttag (internal and undocumented),
but just use handle_starttag (public and documented).
I see two possible reasons why it's overriding parse_starttag:
1) Django is working around an HTMLParser bug. In this case the bug
could have been fixed (leading to the breakage of the now-useless
workaround), and now you could be able to use the original
parse_starttag and have the correct result. If it is indeed working
around a bug and the bug is still present, you should report it upstream.
2) Django is implementing an additional feature. Depending on what
exactly the code is doing you might want to open a new feature request
on the bug tracker. For example the original parse_starttag sets a
self.lasttag attribute with the correct name of the last tag parsed.
Note however that both parse_starttag and self.lasttag are internal and
shouldn't be used directly (but lasttag could be exposed and documented
if people really think that it's useful).

> Do we need to indicate more strongly that data like tagfind are private? Or has
> the change introduced inadvertent breakage, requiring a fix in Python?

I'm not sure that reverting the regex, deprecate all the exposed
internal names, and add/use internal _names instead is a good idea at
this point. This will cause more breakage, and it would require an
extensive renaming. I can add notes to the documentation/docstrings and
specify what's private and what's not though.
OTOH, if this specific fix is not released yet I can still do something
to limit/avoid the breakage.

Best Regards,
Ezio Melotti

> Regards,
>
> Vinay Sajip
>

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


guido at python

Apr 27, 2012, 7:36 AM

Post #6 of 9 (427 views)
Permalink
Re: Changes in html.parser may cause breakage in client code [In reply to]

Someone should contact the Django folks. Alex Gaynor?

On Thursday, April 26, 2012, Ezio Melotti wrote:

> Hi,
>
> On 26/04/2012 22.10, Vinay Sajip wrote:
>
>> Following recent changes in html.parser, the Python 3 port of Django I'm
>> working
>> on has started failing while parsing HTML.
>>
>> The reason appears to be that Django uses some module-level data in
>> html.parser,
>> for example tagfind, which is a regular expression pattern. This has
>> changed
>> recently (Ezio changed it in ba4baaddac8d).
>>
>
> html.parser doesn't use any private _name, so I was considering part of
> the public API only the documented names. Several methods are marked with
> an "# internal" comment, but that's not visible unless you go read the
> source code.
>
> Now tagfind (and other such patterns) are not marked as private (though
>> not
>> documented), but should they be? The following script (tagfind.py):
>>
>> import html.parser as Parser
>>
>> data = '<select name="stuff">'
>>
>> m = Parser.tagfind.match(data, 1)
>> print('%r -> %r' % (Parser.tagfind.pattern, data[1:m.end()]))
>>
>> gives different results on 3.2 and 3.3:
>>
>> $ python3.2 tagfind.py
>> '[a-zA-Z][-.a-zA-Z0-9:_]*' -> 'select'
>> $ python3.3 tagfind.py
>> '([a-zA-Z][-.a-zA-Z0-9:_]*)(?:**\\s|/(?!>))*' -> 'select'
>>
>> The trailing space later causes a mismatch with the end tag, and leads to
>> the
>> errors. Django's use of the tagfind pattern is in a subclass of
>> HTMLParser, in
>> an overridden parse_startag method.
>>
>
> Django shouldn't override parse_starttag (internal and undocumented), but
> just use handle_starttag (public and documented).
> I see two possible reasons why it's overriding parse_starttag:
> 1) Django is working around an HTMLParser bug. In this case the bug
> could have been fixed (leading to the breakage of the now-useless
> workaround), and now you could be able to use the original parse_starttag
> and have the correct result. If it is indeed working around a bug and the
> bug is still present, you should report it upstream.
> 2) Django is implementing an additional feature. Depending on what
> exactly the code is doing you might want to open a new feature request on
> the bug tracker. For example the original parse_starttag sets a
> self.lasttag attribute with the correct name of the last tag parsed. Note
> however that both parse_starttag and self.lasttag are internal and
> shouldn't be used directly (but lasttag could be exposed and documented if
> people really think that it's useful).
>
> Do we need to indicate more strongly that data like tagfind are private?
>> Or has
>> the change introduced inadvertent breakage, requiring a fix in Python?
>>
>
> I'm not sure that reverting the regex, deprecate all the exposed internal
> names, and add/use internal _names instead is a good idea at this point.
> This will cause more breakage, and it would require an extensive renaming.
> I can add notes to the documentation/docstrings and specify what's private
> and what's not though.
> OTOH, if this specific fix is not released yet I can still do something to
> limit/avoid the breakage.
>
> Best Regards,
> Ezio Melotti
>
> Regards,
>>
>> Vinay Sajip
>>
>>
> ______________________________**_________________
> Python-Dev mailing list
> Python-Dev [at] python
> http://mail.python.org/**mailman/listinfo/python-dev<http://mail.python.org/mailman/listinfo/python-dev>
> Unsubscribe: http://mail.python.org/**mailman/options/python-dev/**
> guido%40python.org<http://mail.python.org/mailman/options/python-dev/guido%40python.org>
>


--
--Guido van Rossum (python.org/~guido)


tjreedy at udel

Apr 27, 2012, 10:23 AM

Post #7 of 9 (426 views)
Permalink
Re: Changes in html.parser may cause breakage in client code [In reply to]

On 4/27/2012 1:23 AM, Ezio Melotti wrote:

> html.parser doesn't use any private _name, so I was considering part of
> the public API only the documented names. Several methods are marked
> with an "# internal" comment, but that's not visible unless you go read
> the source code.

I could not find __all__ defined. Perhaps defining that would help.

--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


carl at meyerloewen

Apr 27, 2012, 3:07 PM

Post #8 of 9 (422 views)
Permalink
Re: Changes in html.parser may cause breakage in client code [In reply to]

On 04/27/2012 08:36 AM, Guido van Rossum wrote:
> Someone should contact the Django folks. Alex Gaynor?

I committed the relevant code to Django (though I didn't write the
patch), and I've been following this thread. I have it on my todo list
to review this code again with Ezio's suggestions in mind. So you can
consider "the Django folks" contacted.

Carl
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


guido at python

Apr 27, 2012, 4:29 PM

Post #9 of 9 (424 views)
Permalink
Re: Changes in html.parser may cause breakage in client code [In reply to]

Awesome!

On Fri, Apr 27, 2012 at 3:07 PM, Carl Meyer <carl [at] meyerloewen> wrote:
> On 04/27/2012 08:36 AM, Guido van Rossum wrote:
>>
>> Someone should contact the Django folks. Alex Gaynor?
>
>
> I committed the relevant code to Django (though I didn't write the patch),
> and I've been following this thread. I have it on my todo list to review
> this code again with Ezio's suggestions in mind. So you can consider "the
> Django folks" contacted.
>
> Carl
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev [at] python
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> http://mail.python.org/mailman/options/python-dev/guido%40python.org



--
--Guido van Rossum (python.org/~guido)
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com

Python dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.