Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Bugs

[issue2746] ElementTree ProcessingInstruction uses character entities in content

 

 

Python bugs RSS feed   Index | Next | Previous | View Threaded


report at bugs

May 3, 2008, 8:12 AM

Post #1 of 2 (57 views)
Permalink
[issue2746] ElementTree ProcessingInstruction uses character entities in content

New submission from Dave Hughes <dave[at]waveform.plus.com>:

In the ElementTree and cElementTree implementations in Python 2.5 (and
possibly Python 2.6 as I also found this issue when testing an SVN
checkout of ElementTree 1.3), the conversion of a ProcessingInstruction
to a string converts XML reserved characters (<, >, &) to character
entities:

>>> from xml.etree.ElementTree import *
>>> tostring(ProcessingInstruction('test', '<testing&>'))
'<?test &lt;testing&amp;&gt;?>'

>>> from xml.etree.cElementTree import *
>>> tostring(ProcessingInstruction('test', '<testing&>'))
'<?test &lt;testing&amp;&gt;?>'

The XML 1.0 spec is rather vague on whether character entities are
permitted in PIs (it explicitly states parameter entities are not
parsed in PIs, but says nothing about parsing character entities).
However, it does have this to say in section 2.4 "Character Data and
Markup":

"The ampersand character (&) and the left angle bracket (<) MUST NOT
appear in their literal form, except when used as markup delimiters, or
within a comment, a processing instruction, or a CDATA section."

So, XML reserved chars don't need converting in PIs (the only string
not permitted in a PI's content according to the spec, section 2.6, is
'?>'), which sort of implies that they shouldn't be. As for practical
reasons why they shouldn't be:

Breaks generated PHP:

>>> from xml.etree.cElementTree import *
>>> doc = Element('html')
>>> SubElement(doc, 'head')
<Element 'head' at 0x2af4e3b8a9f0>
>>> SubElement(doc, 'body')
<Element 'body' at 0x2af4e3b922a0>
>>> doc[1].append(ProcessingInstruction('php', 'if (2 < 1) print
"<p>Something has gone horribly wrong!</p>";'))
>>> tostring(doc)
'<html><head /><body><?php if (2 &lt; 1) print "&lt;p&gt;Something has
gone horribly wrong!&lt;/p&gt;";?></body></html>'

Different from xml.dom:

>>> from xml.dom.minidom import *
>>> i = getDOMImplementation()
>>> doc = i.createDocument(None, 'html', None)
>>> doc.documentElement.appendChild(doc.createElement('head'))
<DOM Element: head at 0x8c6170>
>>> doc.documentElement.appendChild(doc.createElement('body'))
<DOM Element: body at 0x8c6290>
>>>
doc.documentElement.lastChild.appendChild(doc.createProcessingInstruction('test',
'<testing&>'))
<xml.dom.minidom.ProcessingInstruction instance at 0x8c63b0>
>>> doc.toxml()
'<?xml version="1.0" ?>\n<html><head/><body><?test <testing&>?></body></
html>'

Different from lxml:

>>> from lxml.etree import *
>>> tostring(ProcessingInstruction('test', '<testing&>'))
'<?test <testing&>?>'

I suspect the only change necessary to fix this is to replace the
_escape_cdata() call for ProcessingInstruction (and possibly Comment
too given the spec quote above) in ElementTree._write() with an
_encode() call, as shown in this patch (which includes the Comment
change as well):

Index: elementtree/ElementTree.py
===================================================================
--- elementtree/ElementTree.py (revision 511)
+++ elementtree/ElementTree.py (working copy)
@@ -663,9 +663,9 @@
# write XML to file
tag = node.tag
if tag is Comment:
- file.write("<!-- %s -->" % _escape_cdata(node.text,
encoding))
+ file.write("<!-- %s -->" % _encode(node.text, encoding))
elif tag is ProcessingInstruction:
- file.write("<?%s?>" % _escape_cdata(node.text, encoding))
+ file.write("<?%s?>" % _encode(node.text, encoding))
else:
items = node.items()
xmlns_items = [] # new namespaces in this scope

Sorry I haven't got a similar patch for cElementTree. I've had a quick
look through the source, but haven't yet figured out where the change
should be made (unless it's not required - does cElementTree reuse that
bit of ElementTree?).

----------
components: XML
messages: 66154
nosy: waveform
severity: normal
status: open
title: ElementTree ProcessingInstruction uses character entities in content
type: behavior
versions: Python 2.5

__________________________________
Tracker <report[at]bugs.python.org>
<http://bugs.python.org/issue2746>
__________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com


report at bugs

May 10, 2008, 5:57 AM

Post #2 of 2 (43 views)
Permalink
[issue2746] ElementTree ProcessingInstruction uses character entities in content [In reply to]

Simon Cross <hodgestar[at]gmail.com> added the comment:

cElementTree.ElementTree is a copy of ElementTree.ElementTree with the
.parse(...) method replaced, so the original patch for ElementTree
should fix cElementTree too.

The copying of the ElementTree class into cElementTree happens in the
call to boostrap in the init_elementtree() function at the bottom of
_elementtree.c.

----------
nosy: +hodgestar

__________________________________
Tracker <report[at]bugs.python.org>
<http://bugs.python.org/issue2746>
__________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

Python bugs RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.