Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

problem parsing utf-8 encoded xml - minidom

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


ashmir.d at gmail

Jul 3, 2008, 10:50 PM

Post #1 of 3 (47 views)
Permalink
problem parsing utf-8 encoded xml - minidom

Hi,
I am trying to parse an xml file using the minidom parser.

<code>
from xml.dom import minidom
xmlfilename = "sample.xml"
xmldoc = minidom.parse(xmlfilename)
</code>

The parser is failing on this line:

<mrcb245-c>Heinrich Kčufner, Norbert Nedopil, Heinz Schčoch (Hrsg.).</
mrcb245-c>

This is the error message I get:

Traceback (most recent call last):
File "readXML.py", line 11, in <module>
xmldoc = minidom.parse(xmlfilename)
File "C:\Python25\lib\xml\dom\minidom.py", line 1913, in parse
return expatbuilder.parse(file)
File "C:\Python25\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python25\lib\xml\dom\expatbuilder.py", line 207, in
parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line
2254, column 21

It seems to me that it is having an issue with the 'č' character. I
have even tried the following to make sure it recognises the file as
utf-8 file:

<code>
from xml.dom import minidom
import codecs

xmlfilename = "sample.xml"
xmlfile = codecs.open(xmlfilename,"r","utf-8")
xmlstring = xmlfile.read()
xmldoc = minidom.parse(xmlfilename)
</code>

However, this doesn't work either and I get the following error
message:

Traceback (most recent call last):
File "readXML.py", line 9, in <module>
xmlstring = xmlfile.read()
File "C:\Python25\lib\codecs.py", line 618, in read
return self.reader.read(size)
File "C:\Python25\lib\codecs.py", line 424, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position
69343-69345: invalid data

I'm assuming here that it is failing at the same place...

Can someone please point me in the right direction?
Thanks,
Ashmir
--
http://mail.python.org/mailman/listinfo/python-list


martin at v

Jul 3, 2008, 11:36 PM

Post #2 of 3 (44 views)
Permalink
Re: problem parsing utf-8 encoded xml - minidom [In reply to]

> The parser is failing on this line:
>
> <mrcb245-c>Heinrich Kčufner, Norbert Nedopil, Heinz Schčoch (Hrsg.).</
> mrcb245-c>

If it is literally this line, it's no surprise: there must not be a line
break between the slash and the closing element name.

However, since you are getting the error in a different column, it's
indeed more likely that there is a problem with the encoding.

Given that the Python UTF-8 codec refuses the data, most likely, the
data is *not* encoded in UTF-8 (but perhaps in Latin-1). If so, you
need to prefix the XML document with a proper XML declaration, such
as

<?xml version="1.0" encoding="iso-8859-1"?>

Alternatively, make sure that the file is really encoded in UTF-8.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


ashmir.d at gmail

Jul 4, 2008, 12:28 AM

Post #3 of 3 (40 views)
Permalink
Re: problem parsing utf-8 encoded xml - minidom [In reply to]

On Jul 4, 2:36 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
> > The parser is failing on this line:
>
> > <mrcb245-c>Heinrich Kčufner, Norbert Nedopil, Heinz Schčoch (Hrsg.).</
> > mrcb245-c>
>
> If it is literally this line, it's no surprise: there must not be a line
> break between the slash and the closing element name.
>
> However, since you are getting the error in a different column, it's
> indeed more likely that there is a problem with the encoding.
>
> Given that the Python UTF-8 codec refuses the data, most likely, the
> data is *not* encoded in UTF-8 (but perhaps in Latin-1). If so, you
> need to prefix the XML document with a proper XML declaration, such
> as
>
> <?xml version="1.0" encoding="iso-8859-1"?>
>
> Alternatively, make sure that the file is really encoded in UTF-8.
>
> Regards,
> Martin


There is no line break in the xml file. It was just a formatting issue
on this forum.

However, you were right about the encoding not being
utf-8. The xml file is autogenerated by a different script so that's
probably where it is going wrong.
The parser works fine if I change the first line to
<?xml version="1.0" encoding="iso-8859-1"?>

Thank you very much
--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.