Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Bugs

[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

 

 

Python bugs RSS feed   Index | Next | Previous | View Threaded


report at bugs

Nov 24, 2009, 8:09 AM

Post #1 of 2 (189 views)
Permalink
[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

Andy <strangefeatures [at] users> added the comment:

I'm also of the opinion that this would be a valuable feature to have. I
think it's a reasonable expectation that an XML library produces valid
XML. It's particularly strange that ET would output XML that it can't
itself read. Surely the job of making the input valid falls on the XML
creator - that's the point of using libraries in the first place, to
abstract away from details like not being able to use characters in the
0-32 range, in the same way that ampersands etc are auto-escaped.
Granted, it's not as clear-cut here since the low-range ASCII characters
are likely to be less frequent and the strategy to handle them is less
clear. I think the sanest behaviour would be to raise an exception by
default, although a user-configurable option to replace or omit the
characters would also make sense. If impacting performance is a concern,
maybe it would make sense to be off by default, but I would have thought
that the single regex that could perform the check would have relatively
minimal impact - and it seems to be an acceptable overhead on the
parsing side, so why not on generation?

----------
nosy: +strangefeatures

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue5166>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

Nov 24, 2009, 9:26 AM

Post #2 of 2 (176 views)
Permalink
[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML [In reply to]

Denis S. Otkidach <denis.otkidach [at] gmail> added the comment:

Here is a regexp I use to clean up text (note, that I don't touch
"compatibility characters" that are also not recommended in XML; some
other developers remove them too):

# http://www.w3.org/TR/REC-xml/#NT-Char
# Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
# [#x10000- #x10FFFF]
# (any Unicode character, excluding the surrogate blocks, FFFE, and
FFFF)
_char_tail = ''
if sys.maxunicode > 0x10000:
_char_tail = u'%s-%s' % (unichr(0x10000),
unichr(min(sys.maxunicode, 0x10FFFF)))
_nontext_sub = re.compile(
ur'[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD%s]' %
_char_tail,
re.U).sub
def replace_nontext(text, replacement=u'\uFFFD'):
return _nontext_sub(replacement, text)

----------

_______________________________________
Python tracker <report [at] bugs>
<http://bugs.python.org/issue5166>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

Python bugs RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.