Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

catch UnicodeDecodeError

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


jaroslav.dobrek at gmail

Jul 25, 2012, 4:05 AM

Post #1 of 19 (2542 views)
Permalink
catch UnicodeDecodeError

Hello,

very often I have the following problem: I write a program that processes many files which it assumes to be encoded in utf-8. Then, some day, I there is a non-utf-8 character in one of several hundred or thousand (new) files. The program exits with an error message like this:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 60: invalid continuation byte

I usually solve the problem by moving files around and by recoding them.

What I really want to do is use something like

try:
# open file, read line, or do something else, I don't care
except UnicodeDecodeError:
sys.exit("Found a bad char in file " + file + " line " + str(line_number)

Yet, no matter where I put this try-except, it doesn't work.

How should I use try-except with UnicodeDecodeError?

Jaroslav
--
http://mail.python.org/mailman/listinfo/python-list


bahamutzero8825 at gmail

Jul 25, 2012, 4:34 AM

Post #2 of 19 (2508 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

On 7/25/2012 6:05 AM, jaroslav.dobrek [at] gmail wrote:
> What I really want to do is use something like
>
> try:
> # open file, read line, or do something else, I don't care
> except UnicodeDecodeError:
> sys.exit("Found a bad char in file " + file + " line " + str(line_number)
>
> Yet, no matter where I put this try-except, it doesn't work.
>
> How should I use try-except with UnicodeDecodeError?
The same way you handle any other exception. The traceback will tell you
the exact line that raised the exception. It helps us help you if you
include the full traceback and give more detail than "it doesn't work".

--
CPython 3.3.0b1 | Windows NT 6.1.7601.17803
--
http://mail.python.org/mailman/listinfo/python-list


phihag at phihag

Jul 25, 2012, 4:35 AM

Post #3 of 19 (2507 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

Hi Jaroslav,

you can catch a UnicodeDecodeError just like any other exception. Can
you provide a full example program that shows your problem?

This works fine on my system:


import sys
open('tmp', 'wb').write(b'\xff\xff')
try:
buf = open('tmp', 'rb').read()
buf.decode('utf-8')
except UnicodeDecodeError as ude:
sys.exit("Found a bad char in file " + "tmp")


Note that you cannot possibly determine the line number if you don't
know what encoding the file is in (and what EOL it uses).

What you can do is count the number of bytes with the value 10 before
ude.start, like this:

lineGuess = buf[:ude.start].count(b'\n') + 1

- Philipp

On 07/25/2012 01:05 PM, jaroslav.dobrek [at] gmail wrote:
> it doesn't work
Attachments: signature.asc (0.19 KB)


jaroslav.dobrek at gmail

Jul 25, 2012, 5:09 AM

Post #4 of 19 (2505 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

On Wednesday, July 25, 2012 1:35:09 PM UTC+2, Philipp Hagemeister wrote:
> Hi Jaroslav,
>
> you can catch a UnicodeDecodeError just like any other exception. Can
> you provide a full example program that shows your problem?
>
> This works fine on my system:
>
>
> import sys
> open('tmp', 'wb').write(b'\xff\xff')
> try:
> buf = open('tmp', 'rb').read()
> buf.decode('utf-8')
> except UnicodeDecodeError as ude:
> sys.exit("Found a bad char in file " + "tmp")
>

Thank you. I got it. What I need to do is explicitly decode text.

But I think trial and error with moving files around will in most cases be faster. Usually, such a problem occurs with some (usually complex) program that I wrote quite a long time ago. I don't like editing old and complex programs that work under all normal circumstances.

What I am missing (especially for Python3) is something like:

try:
for line in sys.stdin:
except UnicodeDecodeError:
sys.exit("Encoding problem in line " + str(line_number))

I got the point that there is no such thing as encoding-independent lines. But if no line ending can be found, then the file simply has one single line.
--
http://mail.python.org/mailman/listinfo/python-list


jaroslav.dobrek at gmail

Jul 25, 2012, 5:09 AM

Post #5 of 19 (2506 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

On Wednesday, July 25, 2012 1:35:09 PM UTC+2, Philipp Hagemeister wrote:
> Hi Jaroslav,
>
> you can catch a UnicodeDecodeError just like any other exception. Can
> you provide a full example program that shows your problem?
>
> This works fine on my system:
>
>
> import sys
> open('tmp', 'wb').write(b'\xff\xff')
> try:
> buf = open('tmp', 'rb').read()
> buf.decode('utf-8')
> except UnicodeDecodeError as ude:
> sys.exit("Found a bad char in file " + "tmp")
>

Thank you. I got it. What I need to do is explicitly decode text.

But I think trial and error with moving files around will in most cases be faster. Usually, such a problem occurs with some (usually complex) program that I wrote quite a long time ago. I don't like editing old and complex programs that work under all normal circumstances.

What I am missing (especially for Python3) is something like:

try:
for line in sys.stdin:
except UnicodeDecodeError:
sys.exit("Encoding problem in line " + str(line_number))

I got the point that there is no such thing as encoding-independent lines. But if no line ending can be found, then the file simply has one single line.
--
http://mail.python.org/mailman/listinfo/python-list


d at davea

Jul 25, 2012, 11:50 AM

Post #6 of 19 (2511 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

On 07/25/2012 08:09 AM, jaroslav.dobrek [at] gmail wrote:
> On Wednesday, July 25, 2012 1:35:09 PM UTC+2, Philipp Hagemeister wrote:
>> Hi Jaroslav,
>>
>> you can catch a UnicodeDecodeError just like any other exception. Can
>> you provide a full example program that shows your problem?
>>
>> This works fine on my system:
>>
>>
>> import sys
>> open('tmp', 'wb').write(b'\xff\xff')
>> try:
>> buf = open('tmp', 'rb').read()
>> buf.decode('utf-8')
>> except UnicodeDecodeError as ude:
>> sys.exit("Found a bad char in file " + "tmp")
>>
> Thank you. I got it. What I need to do is explicitly decode text.
>
> But I think trial and error with moving files around will in most cases be faster. Usually, such a problem occurs with some (usually complex) program that I wrote quite a long time ago. I don't like editing old and complex programs that work under all normal circumstances.
>
> What I am missing (especially for Python3) is something like:
>
> try:
> for line in sys.stdin:
> except UnicodeDecodeError:
> sys.exit("Encoding problem in line " + str(line_number))
>
> I got the point that there is no such thing as encoding-independent lines. But if no line ending can be found, then the file simply has one single line.

i can't understand your question. if the problem is that the system
doesn't magically produce a variable called line_number, then generate
it yourself, by counting
in the loop.

Don't forget that you can tell the unicode decoder to ignore bad
characters, or to convert them to a specified placeholder.



--

DaveA

--
http://mail.python.org/mailman/listinfo/python-list


jaroslav.dobrek at gmail

Jul 26, 2012, 12:46 AM

Post #7 of 19 (2503 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

On Jul 25, 8:50pm, Dave Angel <d...@davea.name> wrote:
> On 07/25/2012 08:09 AM, jaroslav.dob...@gmail.com wrote:
>
>
>
>
>
>
>
>
>
> > On Wednesday, July 25, 2012 1:35:09 PM UTC+2, Philipp Hagemeister wrote:
> >> Hi Jaroslav,
>
> >> you can catch a UnicodeDecodeError just like any other exception. Can
> >> you provide a full example program that shows your problem?
>
> >> This works fine on my system:
>
> >> import sys
> >> open(&#39;tmp&#39;, &#39;wb&#39;).write(b&#39;\xff\xff&#39;)
> >> try:
> >> buf = open(&#39;tmp&#39;, &#39;rb&#39;).read()
> >> buf.decode(&#39;utf-8&#39;)
> >> except UnicodeDecodeError as ude:
> >> sys.exit(&quot;Found a bad char in file &quot; + &quot;tmp&quot;)
>
> > Thank you. I got it. What I need to do is explicitly decode text.
>
> > But I think trial and error with moving files around will in most cases be faster. Usually, such a problem occurs with some (usually complex) program that I wrote quite a long time ago. I don't like editing old and complex programs that work under all normal circumstances.
>
> > What I am missing (especially for Python3) is something like:
>
> > try:
> > for line in sys.stdin:
> > except UnicodeDecodeError:
> > sys.exit("Encoding problem in line " + str(line_number))
>
> > I got the point that there is no such thing as encoding-independent lines. But if no line ending can be found, then the file simply has one single line.
>
> i can't understand your question. if the problem is that the system
> doesn't magically produce a variable called line_number, then generate
> it yourself, by counting
> in the loop.


That was just a very incomplete and general example.

My problem is solved. What I need to do is explicitly decode text when
reading it. Then I can catch exceptions. I might do this in future
programs.

I dislike about this solution that it complicates most programs
unnecessarily. In programs that open, read and process many files I
don't want to explicitly decode and encode characters all the time. I
just want to write:

for line in f:

or something like that. Yet, writing this means to *implicitly* decode
text. And, because the decoding is implicit, you cannot say

try:
for line in f: # here text is decoded implicitly
do_something()
except UnicodeDecodeError():
do_something_different()

This isn't possible for syntactic reasons.

The problem is that vast majority of the thousands of files that I
process are correctly encoded. But then, suddenly, there is a bad
character in a new file. (This is so because most files today are
generated by people who don't know that there is such a thing as
encodings.) And then I need to rewrite my very complex program just
because of one single character in one single file.
--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Jul 26, 2012, 1:28 AM

Post #8 of 19 (2502 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

Jaroslav Dobrek, 26.07.2012 09:46:
> My problem is solved. What I need to do is explicitly decode text when
> reading it. Then I can catch exceptions. I might do this in future
> programs.

Yes, that's the standard procedure. Decode on the way in, encode on the way
out, use Unicode everywhere in between.


> I dislike about this solution that it complicates most programs
> unnecessarily. In programs that open, read and process many files I
> don't want to explicitly decode and encode characters all the time. I
> just want to write:
>
> for line in f:

And the cool thing is: you can! :)

In Python 2.6 and later, the new Py3 open() function is a bit more hidden,
but it's still available:

from io import open

filename = "somefile.txt"
try:
with open(filename, encoding="utf-8") as f:
for line in f:
process_line(line) # actually, I'd use "process_file(f)"
except IOError, e:
print("Reading file %s failed: %s" % (filename, e))
except UnicodeDecodeError, e:
print("Some error occurred decoding file %s: %s" % (filename, e))


Ok, maybe with a better way to handle the errors than "print" ...

For older Python versions, you'd use "codecs.open()" instead. That's a bit
messy, but only because it was finally cleaned up for Python 3.


> or something like that. Yet, writing this means to *implicitly* decode
> text. And, because the decoding is implicit, you cannot say
>
> try:
> for line in f: # here text is decoded implicitly
> do_something()
> except UnicodeDecodeError():
> do_something_different()
>
> This isn't possible for syntactic reasons.

Well, you'd normally want to leave out the parentheses after the exception
type, but otherwise, that's perfectly valid Python code. That's how these
things work.


> The problem is that vast majority of the thousands of files that I
> process are correctly encoded. But then, suddenly, there is a bad
> character in a new file. (This is so because most files today are
> generated by people who don't know that there is such a thing as
> encodings.) And then I need to rewrite my very complex program just
> because of one single character in one single file.

Why would that be the case? The places to change should be very local in
your code.

Stefan


--
http://mail.python.org/mailman/listinfo/python-list


rosuav at gmail

Jul 26, 2012, 2:46 AM

Post #9 of 19 (2501 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

On Thu, Jul 26, 2012 at 5:46 PM, Jaroslav Dobrek
<jaroslav.dobrek [at] gmail> wrote:
> My problem is solved. What I need to do is explicitly decode text when
> reading it. Then I can catch exceptions. I might do this in future
> programs.

Apologies if it's already been said (I'm only skimming this thread),
but ISTM that you want to open the file in binary mode. You'll then
get back a bytes() instead of a str(), and you can attempt to decode
it separately. You may then need to do your own division into lines
that way, though.

ChrisA
--
http://mail.python.org/mailman/listinfo/python-list


wxjmfauth at gmail

Jul 26, 2012, 3:19 AM

Post #10 of 19 (2507 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

On Thursday, July 26, 2012 9:46:27 AM UTC+2, Jaroslav Dobrek wrote:
> On Jul 25, 8:50pm, Dave Angel &lt;d...@davea.name&gt; wrote:
> &gt; On 07/25/2012 08:09 AM, jaroslav.dob...@gmail.com wrote:
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt;
> &gt; &gt; On Wednesday, July 25, 2012 1:35:09 PM UTC+2, Philipp Hagemeister wrote:
> &gt; &gt;&gt; Hi Jaroslav,
> &gt;
> &gt; &gt;&gt; you can catch a UnicodeDecodeError just like any other exception. Can
> &gt; &gt;&gt; you provide a full example program that shows your problem?
> &gt;
> &gt; &gt;&gt; This works fine on my system:
> &gt;
> &gt; &gt;&gt; import sys
> &gt; &gt;&gt; open(&amp;#39;tmp&amp;#39;, &amp;#39;wb&amp;#39;).write(b&amp;#39;\xff\xff&amp;#39;)
> &gt; &gt;&gt; try:
> &gt; &gt;&gt; buf = open(&amp;#39;tmp&amp;#39;, &amp;#39;rb&amp;#39;).read()
> &gt; &gt;&gt; buf.decode(&amp;#39;utf-8&amp;#39;)
> &gt; &gt;&gt; except UnicodeDecodeError as ude:
> &gt; &gt;&gt; sys.exit(&amp;quot;Found a bad char in file &amp;quot; + &amp;quot;tmp&amp;quot;)
> &gt;
> &gt; &gt; Thank you. I got it. What I need to do is explicitly decode text.
> &gt;
> &gt; &gt; But I think trial and error with moving files around will in most cases be faster. Usually, such a problem occurs with some (usually complex) program that I wrote quite a long time ago. I don&#39;t like editing old and complex programs that work under all normal circumstances.
> &gt;
> &gt; &gt; What I am missing (especially for Python3) is something like:
> &gt;
> &gt; &gt; try:
> &gt; &gt; for line in sys.stdin:
> &gt; &gt; except UnicodeDecodeError:
> &gt; &gt; sys.exit(&quot;Encoding problem in line &quot; + str(line_number))
> &gt;
> &gt; &gt; I got the point that there is no such thing as encoding-independent lines. But if no line ending can be found, then the file simply has one single line.
> &gt;
> &gt; i can&#39;t understand your question. if the problem is that the system
> &gt; doesn&#39;t magically produce a variable called line_number, then generate
> &gt; it yourself, by counting
> &gt; in the loop.
>
>
> That was just a very incomplete and general example.
>
> My problem is solved. What I need to do is explicitly decode text when
> reading it. Then I can catch exceptions. I might do this in future
> programs.
>
> I dislike about this solution that it complicates most programs
> unnecessarily. In programs that open, read and process many files I
> don&#39;t want to explicitly decode and encode characters all the time. I
> just want to write:
>
> for line in f:
>
> or something like that. Yet, writing this means to *implicitly* decode
> text. And, because the decoding is implicit, you cannot say
>
> try:
> for line in f: # here text is decoded implicitly
> do_something()
> except UnicodeDecodeError():
> do_something_different()
>
> This isn&#39;t possible for syntactic reasons.
>
> The problem is that vast majority of the thousands of files that I
> process are correctly encoded. But then, suddenly, there is a bad
> character in a new file. (This is so because most files today are
> generated by people who don&#39;t know that there is such a thing as
> encodings.) And then I need to rewrite my very complex program just
> because of one single character in one single file.

In my mind you are taking the problem the wrong way.

Basically there is no "real UnicodeDecodeError", you are
just wrongly attempting to read a file with the wrong
codec. Catching a UnicodeDecodeError will not correct
the basic problem, it will "only" show, you are using
a wrong codec.
There is still the possibility, you have to deal with an
ill-formed utf-8 codding, but I doubt it is the case.

Do not forget, a "bit of text" has only a meaning if you
know its coding.

In short, all your files are most probably ok, you do not read
them correctly.

>>> b'abc\xeadef'.decode('utf-8')
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in
position 3: invalid continuation byte
>>> # but
>>> b'abc\xeadef'.decode('cp1252')
'abcdef'
>>> b'abc\xeadef'.decode('mac-roman')
'abcdef'
>>> b'abc\xeadef'.decode('iso-8859-1')
'abcdef'

jmf
--
http://mail.python.org/mailman/listinfo/python-list


jaroslav.dobrek at gmail

Jul 26, 2012, 3:51 AM

Post #11 of 19 (2505 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

> And the cool thing is: you can! :)
>
> In Python 2.6 and later, the new Py3 open() function is a bit more hidden,
> but it's still available:
>
> from io import open
>
> filename = "somefile.txt"
> try:
> with open(filename, encoding="utf-8") as f:
> for line in f:
> process_line(line) # actually, I'd use "process_file(f)"
> except IOError, e:
> print("Reading file %s failed: %s" % (filename, e))
> except UnicodeDecodeError, e:
> print("Some error occurred decoding file %s: %s" % (filename, e))

Thanks. I might use this in the future.

> > try:
> > for line in f: # here text is decoded implicitly
> > do_something()
> > except UnicodeDecodeError():
> > do_something_different()
>
> > This isn't possible for syntactic reasons.
>
> Well, you'd normally want to leave out the parentheses after the exception
> type, but otherwise, that's perfectly valid Python code. That's how these
> things work.

You are right. Of course this is syntactically possible. I was too
rash, sorry. In confused
it with some other construction I once tried. I can't remember it
right now.

But the code above (without the brackets) is semantically bad: The
exception is not caught.


> > The problem is that vast majority of the thousands of files that I
> > process are correctly encoded. But then, suddenly, there is a bad
> > character in a new file. (This is so because most files today are
> > generated by people who don't know that there is such a thing as
> > encodings.) And then I need to rewrite my very complex program just
> > because of one single character in one single file.
>
> Why would that be the case? The places to change should be very local in
> your code.

This is the case in a program that has many different functions which
open and parse different
types of files. When I read and parse a directory with such different
types of files, a program that
uses

for line in f:

will not exit with any hint as to where the error occurred. I just
exits with a UnicodeDecodeError. That
means I have to look at all functions that have some variant of

for line in f:

in them. And it is not sufficient to replace the "for line in f" part.
I would have to transform many functions that
work in terms of lines into functions that work in terms of decoded
bytes.

That is why I usually solve the problem by moving fles around until I
find the bad file. Then I recode or repair
the bad file manually.
--
http://mail.python.org/mailman/listinfo/python-list


jaroslav.dobrek at gmail

Jul 26, 2012, 4:04 AM

Post #12 of 19 (2503 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

On Jul 26, 12:19pm, wxjmfa...@gmail.com wrote:
> On Thursday, July 26, 2012 9:46:27 AM UTC+2, Jaroslav Dobrek wrote:
> > On Jul 25, 8:50pm, Dave Angel &lt;d...@davea.name&gt; wrote:
> > &gt; On 07/25/2012 08:09 AM, jaroslav.dob...@gmail.com wrote:
> > &gt;
> > &gt;
> > &gt;
> > &gt;
> > &gt;
> > &gt;
> > &gt;
> > &gt;
> > &gt;
> > &gt; &gt; On Wednesday, July 25, 2012 1:35:09 PM UTC+2, Philipp Hagemeister wrote:
> > &gt; &gt;&gt; Hi Jaroslav,
> > &gt;
> > &gt; &gt;&gt; you can catch a UnicodeDecodeError just like any other exception. Can
> > &gt; &gt;&gt; you provide a full example program that shows your problem?
> > &gt;
> > &gt; &gt;&gt; This works fine on my system:
> > &gt;
> > &gt; &gt;&gt; import sys
> > &gt; &gt;&gt; open(&amp;#39;tmp&amp;#39;, &amp;#39;wb&amp;#39;).write(b&amp;#39;\xff\xff&amp;#39;)
> > &gt; &gt;&gt; try:
> > &gt; &gt;&gt; buf = open(&amp;#39;tmp&amp;#39;, &amp;#39;rb&amp;#39;).read()
> > &gt; &gt;&gt; buf.decode(&amp;#39;utf-8&amp;#39;)
> > &gt; &gt;&gt; except UnicodeDecodeError as ude:
> > &gt; &gt;&gt; sys.exit(&amp;quot;Found a bad char in file &amp;quot; + &amp;quot;tmp&amp;quot;)
> > &gt;
> > &gt; &gt; Thank you. I got it. What I need to do is explicitly decode text.
> > &gt;
> > &gt; &gt; But I think trial and error with moving files around will in most cases be faster. Usually, such a problem occurs with some (usually complex) program that I wrote quite a long time ago. I don&#39;t like editing old and complex programs that work under all normal circumstances.
> > &gt;
> > &gt; &gt; What I am missing (especially for Python3) is something like:
> > &gt;
> > &gt; &gt; try:
> > &gt; &gt; for line in sys.stdin:
> > &gt; &gt; except UnicodeDecodeError:
> > &gt; &gt; sys.exit(&quot;Encoding problem in line &quot; + str(line_number))
> > &gt;
> > &gt; &gt; I got the point that there is no such thing as encoding-independent lines. But if no line ending can be found, then the file simply has one single line.
> > &gt;
> > &gt; i can&#39;t understand your question. if the problem is that the system
> > &gt; doesn&#39;t magically produce a variable called line_number, then generate
> > &gt; it yourself, by counting
> > &gt; in the loop.
>
> > That was just a very incomplete and general example.
>
> > My problem is solved. What I need to do is explicitly decode text when
> > reading it. Then I can catch exceptions. I might do this in future
> > programs.
>
> > I dislike about this solution that it complicates most programs
> > unnecessarily. In programs that open, read and process many files I
> > don&#39;t want to explicitly decode and encode characters all the time. I
> > just want to write:
>
> > for line in f:
>
> > or something like that. Yet, writing this means to *implicitly* decode
> > text. And, because the decoding is implicit, you cannot say
>
> > try:
> > for line in f: # here text is decoded implicitly
> > do_something()
> > except UnicodeDecodeError():
> > do_something_different()
>
> > This isn&#39;t possible for syntactic reasons.
>
> > The problem is that vast majority of the thousands of files that I
> > process are correctly encoded. But then, suddenly, there is a bad
> > character in a new file. (This is so because most files today are
> > generated by people who don&#39;t know that there is such a thing as
> > encodings.) And then I need to rewrite my very complex program just
> > because of one single character in one single file.
>
> In my mind you are taking the problem the wrong way.
>
> Basically there is no "real UnicodeDecodeError", you are
> just wrongly attempting to read a file with the wrong
> codec. Catching a UnicodeDecodeError will not correct
> the basic problem, it will "only" show, you are using
> a wrong codec.
> There is still the possibility, you have to deal with an
> ill-formed utf-8 codding, but I doubt it is the case.


I participate in projects in which all files (raw text files, xml
files, html files, ...)
are supposed to be encoded in utf-8. I get many different files from
many different people.
They are almost always ancoded in utf-8. But sometimes a whole file
or, more frequently, parts of a file
are not encoded in utf-8. The reason is that most of the files stem
from the internet. Files
or strings are downloaded and, if possible, recoded. And they are
often simply concatenated into larger
strings or files.

I think the most straightforward thing to do is to assume that I get
utf-8 and raise an error
if some file or character proves to be something different.


--
http://mail.python.org/mailman/listinfo/python-list


stefan_ml at behnel

Jul 26, 2012, 4:15 AM

Post #13 of 19 (2502 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

Jaroslav Dobrek, 26.07.2012 12:51:
>>> try:
>>> for line in f: # here text is decoded implicitly
>>> do_something()
>>> except UnicodeDecodeError():
>>> do_something_different()
>
> the code above (without the brackets) is semantically bad: The
> exception is not caught.

Sure it is. Just to repeat myself: if the above doesn't catch the
exception, then the exception did not originate from the place where you
think it did. Again: look at the traceback.


>>> The problem is that vast majority of the thousands of files that I
>>> process are correctly encoded. But then, suddenly, there is a bad
>>> character in a new file. (This is so because most files today are
>>> generated by people who don't know that there is such a thing as
>>> encodings.) And then I need to rewrite my very complex program just
>>> because of one single character in one single file.
>>
>> Why would that be the case? The places to change should be very local in
>> your code.
>
> This is the case in a program that has many different functions which
> open and parse different
> types of files. When I read and parse a directory with such different
> types of files, a program that
> uses
>
> for line in f:
>
> will not exit with any hint as to where the error occurred. I just
> exits with a UnicodeDecodeError.

... that tells you the exact code line where the error occurred. No need to
look around.

Stefan


--
http://mail.python.org/mailman/listinfo/python-list


jaroslav.dobrek at gmail

Jul 26, 2012, 4:58 AM

Post #14 of 19 (2501 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

> that tells you the exact code line where the error occurred. No need to
> look around.


You are right:

try:
for line in f:
do_something()
except UnicodeDecodeError:
do_something_different()

does exactly what one would expect it to do.

Thank you very much for pointing this out and sorry for all the posts. This is one of the days when nothing seems to work and when I don't seem to able to read the simplest error message.

--
http://mail.python.org/mailman/listinfo/python-list


jaroslav.dobrek at gmail

Jul 26, 2012, 4:58 AM

Post #15 of 19 (2502 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

> that tells you the exact code line where the error occurred. No need to
> look around.


You are right:

try:
for line in f:
do_something()
except UnicodeDecodeError:
do_something_different()

does exactly what one would expect it to do.

Thank you very much for pointing this out and sorry for all the posts. This is one of the days when nothing seems to work and when I don't seem to able to read the simplest error message.

--
http://mail.python.org/mailman/listinfo/python-list


phihag at phihag

Jul 26, 2012, 5:17 AM

Post #16 of 19 (2502 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

On 07/26/2012 01:15 PM, Stefan Behnel wrote:
>> exits with a UnicodeDecodeError.
> ... that tells you the exact code line where the error occurred.

Which property of a UnicodeDecodeError does include that information?

On cPython 2.7 and 3.2, I see only start and end, both of which refer to
the number of bytes read so far.

I used the followin test script:

e = None
try:
b'a\xc3\xa4\nb\xff0'.decode('utf-8')
except UnicodeDecodeError as ude:
e = ude
print(e.start) # 5 for this input, 3 for the input b'a\nb\xff0'
print(dir(e))

But even if you would somehow determine a line number, this would only
work if the actual encoding uses 0xa for newline. Most encodings (101
out of 108 applicable ones in cPython 3.2) do include 0x0a in their
representation of '\n', but multi-byte encodings routinely include 0x0a
bytes in their representation of non-newline characters. Therefore, the
most you can do is calculate an upper bound for the line number.

- Philipp
Attachments: signature.asc (0.19 KB)


stefan_ml at behnel

Jul 26, 2012, 5:24 AM

Post #17 of 19 (2500 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

Philipp Hagemeister, 26.07.2012 14:17:
> On 07/26/2012 01:15 PM, Stefan Behnel wrote:
>>> exits with a UnicodeDecodeError.
>> ... that tells you the exact code line where the error occurred.
>
> Which property of a UnicodeDecodeError does include that information?
>
> On cPython 2.7 and 3.2, I see only start and end, both of which refer to
> the number of bytes read so far.

Read again: "*code* line". The OP was apparently failing to see that the
error did not originate in the source code lines that he had wrapped with a
try-except statement but somewhere else, thus leading to the misguided
impression that the exception was not properly caught by the except clause.

Stefan


--
http://mail.python.org/mailman/listinfo/python-list


phihag at phihag

Jul 26, 2012, 5:43 AM

Post #18 of 19 (2502 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

On 07/26/2012 02:24 PM, Stefan Behnel wrote:
> Read again: "*code* line". The OP was apparently failing to see that >
the error did not originate in the source code lines that he had
> wrapped with a try-except statement but somewhere else, thus leading
to the misguided impression that the exception was not properly caught
by the except clause.

Oops, over a dozen posts and I still haven't grasped the OP's problem.
Sorry! and thanks for noting that.

- Philipp
Attachments: signature.asc (0.19 KB)


robertmiles at teranews

Aug 29, 2012, 10:50 PM

Post #19 of 19 (2387 views)
Permalink
Re: catch UnicodeDecodeError [In reply to]

On 7/26/2012 5:51 AM, Jaroslav Dobrek wrote:
>> And the cool thing is: you can! :)
>>
>> In Python 2.6 and later, the new Py3 open() function is a bit more hidden,
>> but it's still available:
>>
>> from io import open
>>
>> filename = "somefile.txt"
>> try:
>> with open(filename, encoding="utf-8") as f:
>> for line in f:
>> process_line(line) # actually, I'd use "process_file(f)"
>> except IOError, e:
>> print("Reading file %s failed: %s" % (filename, e))
>> except UnicodeDecodeError, e:
>> print("Some error occurred decoding file %s: %s" % (filename, e))
>
> Thanks. I might use this in the future.
>
>>> try:
>>> for line in f: # here text is decoded implicitly
>>> do_something()
>>> except UnicodeDecodeError():
>>> do_something_different()
>>
>>> This isn't possible for syntactic reasons.
>>
>> Well, you'd normally want to leave out the parentheses after the exception
>> type, but otherwise, that's perfectly valid Python code. That's how these
>> things work.
>
> You are right. Of course this is syntactically possible. I was too
> rash, sorry. In confused
> it with some other construction I once tried. I can't remember it
> right now.
>
> But the code above (without the brackets) is semantically bad: The
> exception is not caught.
>
>
>>> The problem is that vast majority of the thousands of files that I
>>> process are correctly encoded. But then, suddenly, there is a bad
>>> character in a new file. (This is so because most files today are
>>> generated by people who don't know that there is such a thing as
>>> encodings.) And then I need to rewrite my very complex program just
>>> because of one single character in one single file.
>>
>> Why would that be the case? The places to change should be very local in
>> your code.
>
> This is the case in a program that has many different functions which
> open and parse different
> types of files. When I read and parse a directory with such different
> types of files, a program that
> uses
>
> for line in f:
>
> will not exit with any hint as to where the error occurred. I just
> exits with a UnicodeDecodeError. That
> means I have to look at all functions that have some variant of
>
> for line in f:
>
> in them. And it is not sufficient to replace the "for line in f" part.
> I would have to transform many functions that
> work in terms of lines into functions that work in terms of decoded
> bytes.
>
> That is why I usually solve the problem by moving fles around until I
> find the bad file. Then I recode or repair
> the bad file manually.


Would it be reasonable to use pieces of the old program to write a
new program that prints the name for an input file, then searches
that input file for bad characters? If it doesn't find any, it can
then go on to the next input file, or show a message saying that no
bad characters were found.

--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.