Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

elementtree XML() unicode

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


kee at kagi

Nov 3, 2009, 4:01 PM

Post #1 of 11 (470 views)
Permalink
elementtree XML() unicode

Having an issue with elementtree XML() in python 2.6.4.

This code works fine:

from xml.etree import ElementTree as et
getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
<customer><shipping><state>bobble</state><city>head</
city><street>city</street></shipping></customer>'''
theResponseXml = et.XML(getResponse)

This code errors out when it tries to do the et.XML()

from xml.etree import ElementTree as et
getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
<customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
\ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
shipping></customer>'''
theResponseXml = et.XML(getResponse)

In my real code, I'm pulling the getResponse data from a web page that
returns as XML and when I display it in the browser you can see the
Japanese characters in the data. I've removed all the stuff in my code
and tried to distill it down to just what is failing. Hopefully I have
not removed something essential.

Why is this not working and what do I need to do to use Elementtree
with unicode?

Thanks, Kee Nethery
--
http://mail.python.org/mailman/listinfo/python-list


gagsl-py2 at yahoo

Nov 3, 2009, 4:44 PM

Post #2 of 11 (453 views)
Permalink
Re: elementtree XML() unicode [In reply to]

En Tue, 03 Nov 2009 21:01:46 -0300, Kee Nethery <kee [at] kagi> escribió:

> Having an issue with elementtree XML() in python 2.6.4.
>
> This code works fine:
>
> from xml.etree import ElementTree as et
> getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
> <customer><shipping><state>bobble</state><city>head</
> city><street>city</street></shipping></customer>'''
> theResponseXml = et.XML(getResponse)
>
> This code errors out when it tries to do the et.XML()
>
> from xml.etree import ElementTree as et
> getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
> <customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
> \ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
> shipping></customer>'''
> theResponseXml = et.XML(getResponse)
>
> In my real code, I'm pulling the getResponse data from a web page that
> returns as XML and when I display it in the browser you can see the
> Japanese characters in the data. I've removed all the stuff in my code
> and tried to distill it down to just what is failing. Hopefully I have
> not removed something essential.
>
> Why is this not working and what do I need to do to use Elementtree with
> unicode?

et expects bytes as input, not unicode. You're decoding too early
(decoding early is good, but not in this case, because et does the work
for you). Either feed et.XML with the bytes before decoding, or reencode
the received xml text in UTF-8 (since this is the declared encoding).

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


kee at kagi

Nov 3, 2009, 5:14 PM

Post #3 of 11 (452 views)
Permalink
Re: elementtree XML() unicode [In reply to]

On Nov 3, 2009, at 4:44 PM, Gabriel Genellina wrote:

> En Tue, 03 Nov 2009 21:01:46 -0300, Kee Nethery <kee [at] kagi>
> escribió:
>
>> I've removed all the stuff in my code and tried to distill it down
>> to just what is failing. Hopefully I have not removed something
>> essential.

Sounds like I did remove something essential.

>
> et expects bytes as input, not unicode. You're decoding too early
> (decoding early is good, but not in this case, because et does the
> work for you). Either feed et.XML with the bytes before decoding, or
> reencode the received xml text in UTF-8 (since this is the declared
> encoding).

Here is the code that hits the URL:
getResponse1 = urllib2.urlopen(theUrl)
getResponse2 = getResponse1.read()
getResponse3 = unicode(getResponse2,'UTF-8')
theResponseXml = et.XML(getResponse3)

So are you saying I want to do:
getResponse1 = urllib2.urlopen(theUrl)
getResponse4 = getResponse1.read()
theResponseXml = et.XML(getResponse4)

The reason I am confused is that getResponse2 is classified as an
"str" in the Komodo IDE. I want to make sure I don't lose the non-
ASCII characters coming from the URL. If I do the second set of code,
does elementtree auto convert the str into unicode? How do I deal with
the XML as unicode when I put it into elementtree as a string?

Very confusing. Thanks for the help.

Kee
--
http://mail.python.org/mailman/listinfo/python-list


sjmachin at lexicon

Nov 3, 2009, 5:27 PM

Post #4 of 11 (448 views)
Permalink
Re: elementtree XML() unicode [In reply to]

On Nov 4, 11:01 am, Kee Nethery <k...@kagi.com> wrote:
> Having an issue with elementtree XML() in python 2.6.4.
>
> This code works fine:
>
>       from xml.etree import ElementTree as et
>       getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>  
> <customer><shipping><state>bobble</state><city>head</
> city><street>city</street></shipping></customer>'''
>       theResponseXml = et.XML(getResponse)
>
> This code errors out when it tries to do the et.XML()
>
>       from xml.etree import ElementTree as et
>       getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>  
> <customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
> \ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
> shipping></customer>'''
>       theResponseXml = et.XML(getResponse)
>
> In my real code, I'm pulling the getResponse data from a web page that  
> returns as XML and when I display it in the browser you can see the  
> Japanese characters in the data. I've removed all the stuff in my code  
> and tried to distill it down to just what is failing. Hopefully I have  
> not removed something essential.
>
> Why is this not working and what do I need to do to use Elementtree  
> with unicode?

On Nov 4, 11:01 am, Kee Nethery <k...@kagi.com> wrote:
> Having an issue with elementtree XML() in python 2.6.4.
>
> This code works fine:
>
> from xml.etree import ElementTree as et
> getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
> <customer><shipping><state>bobble</state><city>head</
> city><street>city</street></shipping></customer>'''
> theResponseXml = et.XML(getResponse)
>
> This code errors out when it tries to do the et.XML()
>
> from xml.etree import ElementTree as et
> getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
> <customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
> \ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
> shipping></customer>'''
> theResponseXml = et.XML(getResponse)
>
> In my real code, I'm pulling the getResponse data from a web page that
> returns as XML and when I display it in the browser you can see the
> Japanese characters in the data. I've removed all the stuff in my code
> and tried to distill it down to just what is failing. Hopefully I have
> not removed something essential.
>
> Why is this not working and what do I need to do to use Elementtree
> with unicode?

What you need to do is NOT feed it unicode. You feed it a str object
and it gets decoded according to the encoding declaration found in the
first line. So take the str object that you get from the web (should
be UTF8-encoded already unless the header is lying), and throw that at
ET ... like this:

| Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit
(Intel)] on win32
| Type "help", "copyright", "credits" or "license" for more
information.
| >>> from xml.etree import ElementTree as et
| >>> ucode = u'''<?xml version="1.0" encoding="UTF-8"?>
| ... <customer><shipping>
| ... <state>\ue58d83\ue89189\ue79c8C</state>
| ... <city>\ue69f8f\ue5b882</city>
| ... <street>\ue9ab98\ue58d97\ue58fb03</street>
| ... </shipping></customer>'''
| >>> xml= et.XML(ucode)
| Traceback (most recent call last):
| File "<stdin>", line 1, in <module>
| File "C:\python26\lib\xml\etree\ElementTree.py", line 963, in XML
| parser.feed(text)
| File "C:\python26\lib\xml\etree\ElementTree.py", line 1245, in
feed
| self._parser.Parse(data, 0)
| UnicodeEncodeError: 'ascii' codec can't encode character u'\ue58d'
in position 69: ordinal not in range(128)
| # as expected
| >>> strg = ucode.encode('utf8')
| # encoding as utf8 is for DEMO purposes.
| # i.e. use the original web str object, don't convert it to unicode
| # and back to utf8.
| >>> xml2 = et.XML(strg)
| >>> xml2.tag
| 'customer'
| >>> for c in xml2.getchildren():
| ... print c.tag, repr(c.text)
| ...
| shipping '\n'
| >>> for c in xml2[0].getchildren():
| ... print c.tag, repr(c.text)
| ...
| state u'\ue58d83\ue89189\ue79c8C'
| city u'\ue69f8f\ue5b882'
| street u'\ue9ab98\ue58d97\ue58fb03'
| >>>

By the way: (1) it usually helps to be more explicit than "errors
out", preferably the exact copied/pasted output as shown above; this
is one of the rare cases where the error message is predictable (2)
PLEASE don't start a new topic in a reply in somebody else's thread.

--
http://mail.python.org/mailman/listinfo/python-list


sjmachin at lexicon

Nov 3, 2009, 5:56 PM

Post #5 of 11 (449 views)
Permalink
Re: elementtree XML() unicode [In reply to]

On Nov 4, 12:14 pm, Kee Nethery <k...@kagi.com> wrote:
> On Nov 3, 2009, at 4:44 PM, Gabriel Genellina wrote:
>
> > En Tue, 03 Nov 2009 21:01:46 -0300, Kee Nethery <k...@kagi.com>  
> > escribió:
>
> >> I've removed all the stuff in my code and tried to distill it down  
> >> to just what is failing. Hopefully I have not removed something  
> >> essential.
>
> Sounds like I did remove something essential.

No, you added something that was not only inessential but caused
trouble.

> > et expects bytes as input, not unicode. You're decoding too early  
> > (decoding early is good, but not in this case, because et does the  
> > work for you). Either feed et.XML with the bytes before decoding, or  
> > reencode the received xml text in UTF-8 (since this is the declared  
> > encoding).
>
> Here is the code that hits the URL:
>          getResponse1 = urllib2.urlopen(theUrl)
>          getResponse2 = getResponse1.read()
>          getResponse3 = unicode(getResponse2,'UTF-8')
>         theResponseXml = et.XML(getResponse3)
>
> So are you saying I want to do:
>          getResponse1 = urllib2.urlopen(theUrl)
>          getResponse4 = getResponse1.read()
>         theResponseXml = et.XML(getResponse4)

You got the essence. Note: that in no way implies any approval of your
naming convention :-)

> The reason I am confused is that getResponse2 is classified as an  
> "str" in the Komodo IDE. I want to make sure I don't lose the non-
> ASCII characters coming from the URL.

str is all about 8-bit bytes. Your data comes from the web in 8-bit
bytes. No problem. Just don't palpate it unnecessarily.

> If I do the second set of code,  
> does elementtree auto convert the str into unicode?

Yes. See the example I gave in my earlier posting:

| ... print c.tag, repr(c.text)
| state u'\ue58d83\ue89189\ue79c8C'

That first u means the type is unicode.

> How do I deal with  
> the XML as unicode when I put it into elementtree as a string?

That's unfortunately rather ambiguous: (1) put past/present? (2)
string unicode/str? (3) what is referent of "it"?

All text in what et returns is unicode [*] so you read it out as
unicode (see above example) or written as unicode if you want to
change it:

your_element.text = u'a unicode object'

[*] As an "optimisation", et stores strings as str objects if they
contain only ASCII bytes (and are thus losslessly convertible to
unicode). In preparation for running your code under Python 3.X, it's
best to ignore this and use unicode constants u'foo' (if you need text
constants at all) even if et would let you get away with 'foo'.

HTH,
John
--
http://mail.python.org/mailman/listinfo/python-list


kee at kagi

Nov 3, 2009, 6:06 PM

Post #6 of 11 (451 views)
Permalink
Re: elementtree XML() unicode [In reply to]

On Nov 3, 2009, at 5:27 PM, John Machin wrote:

> On Nov 4, 11:01 am, Kee Nethery <k...@kagi.com> wrote:
>> Having an issue with elementtree XML() in python 2.6.4.
>>
>> This code works fine:
>>
>> from xml.etree import ElementTree as et
>> getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
>> <customer><shipping><state>bobble</state><city>head</
>> city><street>city</street></shipping></customer>'''
>> theResponseXml = et.XML(getResponse)
>>
>> This code errors out when it tries to do the et.XML()
>>
>> from xml.etree import ElementTree as et
>> getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
>> <customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
>> \ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
>> shipping></customer>'''
>> theResponseXml = et.XML(getResponse)
>>
>> In my real code, I'm pulling the getResponse data from a web page
>> that
>> returns as XML and when I display it in the browser you can see the
>> Japanese characters in the data. I've removed all the stuff in my
>> code
>> and tried to distill it down to just what is failing. Hopefully I
>> have
>> not removed something essential.
>>
>> Why is this not working and what do I need to do to use Elementtree
>> with unicode?
>
> On Nov 4, 11:01 am, Kee Nethery <k...@kagi.com> wrote:
>> Having an issue with elementtree XML() in python 2.6.4.
>>
>> This code works fine:
>>
>> from xml.etree import ElementTree as et
>> getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
>> <customer><shipping><state>bobble</state><city>head</
>> city><street>city</street></shipping></customer>'''
>> theResponseXml = et.XML(getResponse)
>>
>> This code errors out when it tries to do the et.XML()
>>
>> from xml.etree import ElementTree as et
>> getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
>> <customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
>> \ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
>> shipping></customer>'''
>> theResponseXml = et.XML(getResponse)
>>
>> In my real code, I'm pulling the getResponse data from a web page
>> that
>> returns as XML and when I display it in the browser you can see the
>> Japanese characters in the data. I've removed all the stuff in my
>> code
>> and tried to distill it down to just what is failing. Hopefully I
>> have
>> not removed something essential.
>>
>> Why is this not working and what do I need to do to use Elementtree
>> with unicode?
>
> What you need to do is NOT feed it unicode. You feed it a str object
> and it gets decoded according to the encoding declaration found in the
> first line.

That it uses "the encoding declaration found in the first line" is the
nugget of data that is not in the documentation that has stymied me
for days. Thank you!

The other thing that has been confusing is that I've been using "dump"
to view what is in the elementtree instance and the non-ASCII
characters have been displayed as "numbered
entities" (<city>&#26575;&#24066;</city>) and I know that is not the
representation I want the data to be in. A co-worker suggested that
instead of "dump" that I use "et.tostring(theResponseXml,
encoding='utf-8')" and then print that to see the characters. That
process causes the non-ASCII characters to display as the glyphs I
know them to be.

If there was a place in the official docs for me to append these
nuggets of information to the sections for
"xml.etree.ElementTree.XML(text)" and
"xml.etree.ElementTree.dump(elem)" I would absolutely do so.

Thank you!
Kee Nethery


> So take the str object that you get from the web (should
> be UTF8-encoded already unless the header is lying), and throw that at
> ET ... like this:
>
> | Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit
> (Intel)] on win32
> | Type "help", "copyright", "credits" or "license" for more
> information.
> | >>> from xml.etree import ElementTree as et
> | >>> ucode = u'''<?xml version="1.0" encoding="UTF-8"?>
> | ... <customer><shipping>
> | ... <state>\ue58d83\ue89189\ue79c8C</state>
> | ... <city>\ue69f8f\ue5b882</city>
> | ... <street>\ue9ab98\ue58d97\ue58fb03</street>
> | ... </shipping></customer>'''
> | >>> xml= et.XML(ucode)
> | Traceback (most recent call last):
> | File "<stdin>", line 1, in <module>
> | File "C:\python26\lib\xml\etree\ElementTree.py", line 963, in XML
> | parser.feed(text)
> | File "C:\python26\lib\xml\etree\ElementTree.py", line 1245, in
> feed
> | self._parser.Parse(data, 0)
> | UnicodeEncodeError: 'ascii' codec can't encode character u'\ue58d'
> in position 69: ordinal not in range(128)
> | # as expected
> | >>> strg = ucode.encode('utf8')
> | # encoding as utf8 is for DEMO purposes.
> | # i.e. use the original web str object, don't convert it to unicode
> | # and back to utf8.
> | >>> xml2 = et.XML(strg)
> | >>> xml2.tag
> | 'customer'
> | >>> for c in xml2.getchildren():
> | ... print c.tag, repr(c.text)
> | ...
> | shipping '\n'
> | >>> for c in xml2[0].getchildren():
> | ... print c.tag, repr(c.text)
> | ...
> | state u'\ue58d83\ue89189\ue79c8C'
> | city u'\ue69f8f\ue5b882'
> | street u'\ue9ab98\ue58d97\ue58fb03'
> | >>>
>
> By the way: (1) it usually helps to be more explicit than "errors
> out", preferably the exact copied/pasted output as shown above; this
> is one of the rare cases where the error message is predictable (2)
> PLEASE don't start a new topic in a reply in somebody else's thread.
>
> --
> http://mail.python.org/mailman/listinfo/python-list




-------------------------------------------------
I check email roughly 2 to 3 times per business day.
Kagi main office: +1 (510) 550-1336


--
http://mail.python.org/mailman/listinfo/python-list


sjmachin at lexicon

Nov 3, 2009, 7:02 PM

Post #7 of 11 (454 views)
Permalink
Re: elementtree XML() unicode [In reply to]

On Nov 4, 1:06 pm, Kee Nethery <k...@kagi.com> wrote:
> On Nov 3, 2009, at 5:27 PM, John Machin wrote:
>
>
>
> > On Nov 4, 11:01 am, Kee Nethery <k...@kagi.com> wrote:

> >> Why is this not working and what do I need to do to use Elementtree
> >> with unicode?
>
> > What you need to do is NOT feed it unicode. You feed it a str object
> > and it gets decoded according to the encoding declaration found in the
> > first line.
>
> That it uses "the encoding declaration found in the first line" is the  
> nugget of data that is not in the documentation that has stymied me  
> for days. Thank you!

And under the "don't repeat" principle, it shouldn't be in the
Elementtree docs; it's nothing special about ET -- it's part of the
definition of an XML document (which for universal loss-free
transportability naturally must be encoded somehow, and the document
must state what its own encoding is (if it's not the default
(UTF-8))).

> The other thing that has been confusing is that I've been using "dump"  
> to view what is in the elementtree instance and the non-ASCII  
> characters have been displayed as "numbered  
> entities" (<city>&#26575;&#24066;</city>) and I know that is not the  
> representation I want the data to be in. A co-worker suggested that  
> instead of "dump" that I use "et.tostring(theResponseXml,  
> encoding='utf-8')" and then print that to see the characters. That  
> process causes the non-ASCII characters to display as the glyphs I  
> know them to be.
>
> If there was a place in the official docs for me to append these  
> nuggets of information to the sections for  
> "xml.etree.ElementTree.XML(text)" and  
> "xml.etree.ElementTree.dump(elem)" I would absolutely do so.

I don't understand ... tostring() is in the same section as dump(),
about two screen-heights away. You want to include the tostring() docs
in the dump() docs? The usual idea is not to get bogged down in the
first function that looks at first glance like it might do what you
want ("look at the glyphs") but doesn't (it writes a (transportable)
XML stream) but press on to the next plausible candidate.
--
http://mail.python.org/mailman/listinfo/python-list


gagsl-py2 at yahoo

Nov 3, 2009, 7:06 PM

Post #8 of 11 (449 views)
Permalink
Re: elementtree XML() unicode [In reply to]

En Tue, 03 Nov 2009 23:06:58 -0300, Kee Nethery <kee [at] kagi> escribió:

> If there was a place in the official docs for me to append these nuggets
> of information to the sections for "xml.etree.ElementTree.XML(text)" and
> "xml.etree.ElementTree.dump(elem)" I would absolutely do so.

http://bugs.python.org/ applies to documentation too.

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


kee at kagi

Nov 3, 2009, 8:42 PM

Post #9 of 11 (447 views)
Permalink
Re: elementtree XML() unicode [In reply to]

On Nov 3, 2009, at 7:06 PM, Gabriel Genellina wrote:

> En Tue, 03 Nov 2009 23:06:58 -0300, Kee Nethery <kee [at] kagi>
> escribió:
>
>> If there was a place in the official docs for me to append these
>> nuggets of information to the sections for
>> "xml.etree.ElementTree.XML(text)" and
>> "xml.etree.ElementTree.dump(elem)" I would absolutely do so.
>
> http://bugs.python.org/ applies to documentation too.

I've submitted documentation bugs in the past and no action was taken
on them, the bugs were closed. I'm guessing that information "that
everyone knows" not being in the documentation is not a bug. It's my
fault I'm a newbie and I accept that. Thanks to you two for helping me
get past this block.

Kee
--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Nov 4, 2009, 5:35 AM

Post #10 of 11 (433 views)
Permalink
Re: elementtree XML() unicode [In reply to]

John Machin, 04.11.2009 02:56:
> On Nov 4, 12:14 pm, Kee Nethery wrote:
>> The reason I am confused is that getResponse2 is classified as an
>> "str" in the Komodo IDE. I want to make sure I don't lose the non-
>> ASCII characters coming from the URL.
>
> str is all about 8-bit bytes.

True in Py2.x, false in Py3.

What you mean is the "bytes" type, which, sadly, was named "str" in Python 2.x.

The problem the OP ran into was due to the fact that Python 2.x handled
"ASCII characters in a unicode string" <-> "ASCII encoded byte string"
conversion behind the scenes, which lead to all sorts of trouble to loads
of people, and was finally discarded in Python 3.0.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


sjmachin at lexicon

Nov 5, 2009, 1:18 PM

Post #11 of 11 (407 views)
Permalink
Re: elementtree XML() unicode [In reply to]

On Nov 5, 12:35 am, Stefan Behnel <stefan...@behnel.de> wrote:
> John Machin, 04.11.2009 02:56:
>
> > On Nov 4, 12:14 pm, Kee Nethery wrote:
> >> The reason I am confused is that getResponse2 is classified as an  
> >> "str" in the Komodo IDE. I want to make sure I don't lose the non-
> >> ASCII characters coming from the URL.
>
> > str is all about 8-bit bytes.
>
> True in Py2.x, false in Py3.

And the context was 2.x.

> What you mean is the "bytes" type, which, sadly, was named "str" in Python 2.x.

What you mean is the "bytes" concept.

> The problem the OP ran into was due to the fact that Python 2.x handled
> "ASCII characters in a unicode string" <-> "ASCII encoded byte string"
> conversion behind the scenes, which lead to all sorts of trouble to loads
> of people, and was finally discarded in Python 3.0.

What you describe is the symptom. The problems are (1) 2.X ET expects
a str object but the OP supplied a unicode object, and (2) 2.X ET
didn't check that, so it accidentally "worked" provided the contents
were ASCII-only, and otherwise gave a novice-mystifying error message.
--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.