Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Dev

Is this a bug of the HTMLParser?

 

 

Python dev RSS feed   Index | Next | Previous | View Threaded


pluskid at gmail

Nov 11, 2009, 8:16 AM

Post #1 of 3 (384 views)
Permalink
Is this a bug of the HTMLParser?

Hi all,

I'm using BeautifulSoup to parsing an HTML page and find it refused to
parse the page. By looking at the backtrace, I found it is a problem
with the python built-in HTMLParser.py. In fact, the web page I'm
parsing is with some Chinese characters. there is a tag like <img
src=/foo/bar.png alt=中文> , note this is legacy html page where the
attributes are not quoted. However, the regexp defined in
HTMLParser.py is :

attrfind = re.compile(
r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?')

Note that the Chinese character (also any other non-english
characters), so it fire an error parsing this. I'm not sure whether
the HTML standard allow un-quoted non-ASCII characters in the
attributes. If it allows, this seems to be a bug. and the regexp to
better be [^>\s] IMHO.

BTW: It seems something like :

<script>
var st = "<a></";
</script>

can not be parsed. :-/

--
pluskid
http://blog.pluskid.org
_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


fuzzyman at voidspace

Nov 11, 2009, 8:24 AM

Post #2 of 3 (359 views)
Permalink
Re: Is this a bug of the HTMLParser? [In reply to]

Hello Zhang Chiyuan,

Can you file a bug on the Python issue tracker please:

http://bugs.python.org

Thanks

Michael Foord

Zhang Chiyuan wrote:
> Hi all,
>
> I'm using BeautifulSoup to parsing an HTML page and find it refused to
> parse the page. By looking at the backtrace, I found it is a problem
> with the python built-in HTMLParser.py. In fact, the web page I'm
> parsing is with some Chinese characters. there is a tag like <img
> src=/foo/bar.png alt=中文> , note this is legacy html page where the
> attributes are not quoted. However, the regexp defined in
> HTMLParser.py is :
>
> attrfind = re.compile(
> r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
> r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?')
>
> Note that the Chinese character (also any other non-english
> characters), so it fire an error parsing this. I'm not sure whether
> the HTML standard allow un-quoted non-ASCII characters in the
> attributes. If it allows, this seems to be a bug. and the regexp to
> better be [^>\s] IMHO.
>
> BTW: It seems something like :
>
> <script>
> var st = "<a></";
> </script>
>
> can not be parsed. :-/
>
> --
> pluskid
> http://blog.pluskid.org
> _______________________________________________
> Python-Dev mailing list
> Python-Dev [at] python
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
>


--
http://www.ironpythoninaction.com/

_______________________________________________
Python-Dev mailing list
Python-Dev [at] python
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com


pluskid at gmail

Nov 12, 2009, 8:27 AM

Post #3 of 3 (352 views)
Permalink
Re: Is this a bug of the HTMLParser? [In reply to]

filed: http://bugs.python.org/issue7311

On Thu, Nov 12, 2009 at 12:24 AM, Michael Foord
<fuzzyman [at] voidspace>wrote:

> Hello Zhang Chiyuan,
>
> Can you file a bug on the Python issue tracker please:
>
> http://bugs.python.org
>
> Thanks
>
> Michael Foord
>
> Zhang Chiyuan wrote:
>
>> Hi all,
>>
>> I'm using BeautifulSoup to parsing an HTML page and find it refused to
>> parse the page. By looking at the backtrace, I found it is a problem
>> with the python built-in HTMLParser.py. In fact, the web page I'm
>> parsing is with some Chinese characters. there is a tag like <img
>> src=/foo/bar.png alt=中文> , note this is legacy html page where the
>> attributes are not quoted. However, the regexp defined in
>> HTMLParser.py is :
>>
>> attrfind = re.compile(
>> r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
>> r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?')
>>
>> Note that the Chinese character (also any other non-english
>> characters), so it fire an error parsing this. I'm not sure whether
>> the HTML standard allow un-quoted non-ASCII characters in the
>> attributes. If it allows, this seems to be a bug. and the regexp to
>> better be [^>\s] IMHO.
>>
>> BTW: It seems something like :
>>
>> <script>
>> var st = "<a></";
>> </script>
>>
>> can not be parsed. :-/
>>
>> --
>> pluskid
>> http://blog.pluskid.org
>> _______________________________________________
>> Python-Dev mailing list
>> Python-Dev [at] python
>> http://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe:
>> http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
>>
>>
>
>
> --
> http://www.ironpythoninaction.com/
>
>


--
pluskid
http://blog.pluskid.org

Python dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.