Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

Py 2.5: Bug in sgmllib

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


mbutscher at gmx

Oct 22, 2006, 4:20 AM

Post #1 of 3 (156 views)
Permalink
Py 2.5: Bug in sgmllib

Hi,

if I execute the following two lines in Python 2.5 (to feed in a
*unicode* string):

import sgmllib
sgmllib.SGMLParser().feed(u'<a title="te&#223;t"></a>')



I get the exception:

Traceback (most recent call last):
File "<pyshell#10>", line 1, in <module>
sgmllib.SGMLParser().feed(u'<a title="te&#223;t"></a>')
File "C:\Programme\Python25\Lib\sgmllib.py", line 99, in feed
self.goahead(0)
File "C:\Programme\Python25\Lib\sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "C:\Programme\Python25\Lib\sgmllib.py", line 285, in
parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0:
ordinal not in range(128)



The reason is that the character reference &#223; is converted to
*byte* string "\xdf" by SGMLParser.convert_codepoint. Adding this byte
string to the remaining unicode string fails.


Workaround (not thoroughly tested): Override convert_codepoint in a
derived class with:

def convert_codepoint(self, codepoint):
return unichr(codepoint)



Is this a bug or is SGMLParser not meant to be used for unicode strings
(it should be documented then)?



Michael
--
http://mail.python.org/mailman/listinfo/python-list


fredrik at pythonware

Oct 22, 2006, 4:47 AM

Post #2 of 3 (141 views)
Permalink
Re: Py 2.5: Bug in sgmllib [In reply to]

Michael Butscher wrote:


> if I execute the following two lines in Python 2.5 (to feed in a
> *unicode* string):
>
> import sgmllib
> sgmllib.SGMLParser().feed(u'<a title="te&#223;t"></a>')

source documents are encoded byte streams, not decoded Unicode
sequences. I suggest reading up on how Python's Unicode string
type is, and what a Unicode string represents. it's not the same
thing as a byte string.

</F>

--
http://mail.python.org/mailman/listinfo/python-list


martin at v

Oct 22, 2006, 5:54 AM

Post #3 of 3 (141 views)
Permalink
Re: Py 2.5: Bug in sgmllib [In reply to]

Michael Butscher schrieb:
> Is this a bug or is SGMLParser not meant to be used for unicode strings
> (it should be documented then)?

In a sense, SGML itself is not meant to be used for Unicode. In SGML,
the document character set is subject to the SGML application. So what
specific character a character reference refers to is also subject to
the SGML application.

This entire issue is already documented; see the discussion of
convert_charref and convert_codepoint in

http://docs.python.org/lib/module-sgmllib.html

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.