Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Python: Python

UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and...

 

 

Python python RSS feed   Index | Next | Previous | View Threaded


alfps at start

Nov 23, 2009, 1:06 PM

Post #1 of 17 (638 views)
Permalink
UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and...

This is the tragic story of this evening:
1. Aspirins to lessen the pain somewhat.
2. Over in [comp.programming] someone mentions paper on Quicksort.
3. I recall that X once sent me link to paper about how to foil
Quicksort, written by was it Doug McIlroy, anyway some Bell Labs guy.
Want to post that link in response to [comp.programming] article.
4. Checking in Thunderbird, no mails from X or about QS there.
5. But his mail address in address list so something funny going on!
6. Googling, yes, it seems Thunderbird has a habit of "forgetting" mails. But
they're really there after all. It's just the index that's screwed up.
7. OK, opening Thunderbird mailbox file (it's just text) in nearest editor.
8. Machine hangs, Windows says it must increase virtual memory, blah blah.
9. Making little Python script to extract individual mails from file.
10. It says UnicodeDecodeError on mail nr. something something.
11. I switch mode to binary. Didn't know if that would work with std input.
12. It's now apparently ten times faster but *still* UnicodeDecodeError!
13. I ask here!

Of course could have googled that paper, but at each step above it seemed just a
half minute more to find the link in mails, and now I decided it must be found.

And I'm hesitant to just delete index file, hoping that it'll rebuild.

Thunderbird does funny things, so best would be if Python script worked.


<code>
import os
import fileinput

def write( s ): print( s, end = "" )

msg_id = 0
f = open( "nul", "w" )
for line in fileinput.input( mode = "rb" ):
if line.startswith( "From - " ):
msg_id += 1;
f.close()
print( msg_id )
f = open( "msg_{0:0>6}.txt".format( msg_id ), "w+" )
else:
f.write( line )
f.close()
</code>


<last few lines of output>
955
956
957
958
Traceback (most recent call last):
File "C:\test\tbfix\splitmails.py", line 11, in <module>
for line in fileinput.input( mode = "rb" ):
File "C:\Program Files\cpython\python31\lib\fileinput.py", line 254, in __next__
line = self.readline()
File "C:\Program Files\cpython\python31\lib\fileinput.py", line 349, in readline
self._buffer = self._file.readlines(self._bufsize)
File "C:\Program Files\cpython\python31\lib\encodings\cp1252.py", line 23, in
decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 2188:
character maps to <undefined
</last few lines of output>


Cheers,

- Alf
--
http://mail.python.org/mailman/listinfo/python-list


alfps at start

Nov 23, 2009, 2:37 PM

Post #2 of 17 (617 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

* Alf P. Steinbach:
>
> <code>
> import os
> import fileinput
>
> def write( s ): print( s, end = "" )
>
> msg_id = 0
> f = open( "nul", "w" )
> for line in fileinput.input( mode = "rb" ):
> if line.startswith( "From - " ):
> msg_id += 1;
> f.close()
> print( msg_id )
> f = open( "msg_{0:0>6}.txt".format( msg_id ), "w+" )
> else:
> f.write( line )
> f.close()
> </code>
>
>
> <last few lines of output>
> 955
> 956
> 957
> 958
> Traceback (most recent call last):
> File "C:\test\tbfix\splitmails.py", line 11, in <module>
> for line in fileinput.input( mode = "rb" ):
> File "C:\Program Files\cpython\python31\lib\fileinput.py", line 254,
> in __next__
> line = self.readline()
> File "C:\Program Files\cpython\python31\lib\fileinput.py", line 349,
> in readline
> self._buffer = self._file.readlines(self._bufsize)
> File "C:\Program Files\cpython\python31\lib\encodings\cp1252.py", line
> 23, in decode
> return codecs.charmap_decode(input,self.errors,decoding_table)[0]
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position
> 2188: character maps to <undefined
> </last few lines of output>

The following worked:


<code>
import sys
import fileinput

def write( s ): print( s, end = "" )

msg_id = 0
f = open( "nul", "w" )
input = sys.stdin.detach() # binary
while True:
line = input.readline()
if len( line ) == 0:
break
elif line.decode( "ascii", "ignore" ).startswith( "From - " ):
msg_id += 1;
f.close()
print( msg_id )
f = open( "msg_{0:0>6}.txt".format( msg_id ), "wb+" )
else:
f.write( line )
f.close()
</code>


Cheers,

- Alf
--
http://mail.python.org/mailman/listinfo/python-list


tjreedy at udel

Nov 23, 2009, 5:48 PM

Post #3 of 17 (616 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

Alf P. Steinbach wrote:

> import os
> import fileinput
>
> def write( s ): print( s, end = "" )

I believe this is the same as
write = sys.stdout.write
though you never use it that I see.
>
> msg_id = 0
> f = open( "nul", "w" )
> for line in fileinput.input( mode = "rb" ):

I presume you are expecting the line to be undecoded bytes, as with
open(f,'rb'). To be sure, add write(type(line)).

> if line.startswith( "From - " ):
> msg_id += 1;
> f.close()
> print( msg_id )
> f = open( "msg_{0:0>6}.txt".format( msg_id ), "w+" )

I do not understand why you are writing since you just wanted to look.
In any case, you open in text mode.


> else:
> f.write( line )
> f.close()
> </code>
>
>
> <last few lines of output>
> 955
> 956
> 957
> 958
> Traceback (most recent call last):
> File "C:\test\tbfix\splitmails.py", line 11, in <module>
> for line in fileinput.input( mode = "rb" ):
> File "C:\Program Files\cpython\python31\lib\fileinput.py", line 254,
> in __next__
> line = self.readline()
> File "C:\Program Files\cpython\python31\lib\fileinput.py", line 349,
> in readline
> self._buffer = self._file.readlines(self._bufsize)
> File "C:\Program Files\cpython\python31\lib\encodings\cp1252.py", line
> 23, in decode
> return codecs.charmap_decode(input,self.errors,decoding_table)[0]
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position
> 2188: character maps to <undefined

It goes ahead and tries to decode to str anyway. Maybe there is a bug,
though maybe the text-mode open in the loop somehow changes fileinput,
especially if you write to something it has open. So I would not report
a bug until I tried reading without writing.

tjr

--
http://mail.python.org/mailman/listinfo/python-list


lie.1296 at gmail

Nov 23, 2009, 8:02 PM

Post #4 of 17 (613 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

Alf P. Steinbach wrote:
>
> And I'm hesitant to just delete index file, hoping that it'll rebuild.

it'll be rebuild the next time you start Thunderbird:
(MozillaZine: http://kb.mozillazine.org/Disappearing_mail)
* It's possible that the ".msf" files (index files) are corrupted. To
rebuild the index of a folder, right-click it, select Properties, and
choose "Rebuild Index" from the General Information tab. You can also
close Thunderbird and manually delete them from your profile folder;
they will be rebuilt when Thunderbird starts.

--
http://mail.python.org/mailman/listinfo/python-list


nobody at nowhere

Nov 23, 2009, 10:15 PM

Post #5 of 17 (609 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

On Mon, 23 Nov 2009 22:06:29 +0100, Alf P. Steinbach wrote:

> 10. It says UnicodeDecodeError on mail nr. something something.

That's what you get for using Python 3.x ;)

If you must use 3.x, don't use the standard descriptors. If you must use
the standard descriptors in 3.x, call detach() on them to get the
underlying binary stream, i.e.

stdin = sys.stdin.detach()
stdout = sys.stdout.detach()

and use those instead.

Or set LC_ALL or LC_CTYPE to an ISO-8859-* locale (any stream of bytes can
be decoded, and any string resulting from decoding can be encoded).

--
http://mail.python.org/mailman/listinfo/python-list


steve at REMOVE-THIS-cybersource

Nov 24, 2009, 5:02 AM

Post #6 of 17 (596 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

On Mon, 23 Nov 2009 22:06:29 +0100, Alf P. Steinbach wrote:


> 6. Googling, yes, it seems Thunderbird has a habit of "forgetting"
> mails. But they're really there after all. It's just the index that's
> screwed up.
[...]
> And I'm hesitant to just delete index file, hoping that it'll rebuild.

Right-click on the mailbox and choose "Rebuild Index".

If you're particularly paranoid, and you probably should be, make a
backup copy of the entire mail folder first.

http://kb.mozillazine.org/Compacting_folders
http://kb.mozillazine.org/Recover_messages_from_a_corrupt_folder
http://kb.mozillazine.org/Disappearing_mail


Good grief, it's about six weeks away from 2010 and Thunderbird still
uses mbox as it's default mail box format. Hello, the nineties called,
they want their mail formats back! Are the tbird developers on crack or
something? I can't believe that they're still using that crappy format.

No, I tell a lie. I can believe it far too well.



--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


cjns1989 at gmail

Nov 24, 2009, 10:19 AM

Post #7 of 17 (594 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

On Tue, Nov 24, 2009 at 08:02:09AM EST, Steven D'Aprano wrote:

> Good grief, it's about six weeks away from 2010 and Thunderbird still
> uses mbox as it's default mail box format. Hello, the nineties called,
> they want their mail formats back! Are the tbird developers on crack or
> something? I can't believe that they're still using that crappy format.
>
> No, I tell a lie. I can believe it far too well.

:-)

I realize that's somewhat OT, but what mail box format do you recommend,
and why?

Thanks,

CJ
--
http://mail.python.org/mailman/listinfo/python-list


steve at REMOVE-THIS-cybersource

Nov 24, 2009, 2:43 PM

Post #8 of 17 (587 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

On Tue, 24 Nov 2009 13:19:10 -0500, Chris Jones wrote:

> On Tue, Nov 24, 2009 at 08:02:09AM EST, Steven D'Aprano wrote:
>
>> Good grief, it's about six weeks away from 2010 and Thunderbird still
>> uses mbox as it's default mail box format. Hello, the nineties called,
>> they want their mail formats back! Are the tbird developers on crack or
>> something? I can't believe that they're still using that crappy format.
>>
>> No, I tell a lie. I can believe it far too well.
>
> :-)
>
> I realize that's somewhat OT, but what mail box format do you recommend,
> and why?

maildir++

http://en.wikipedia.org/wiki/Maildir

Corruption is less likely, if there is corruption you'll only lose a
single message rather than potentially everything in the mail folder[*],
at a pinch you can read the emails using a text editor or easily grep
through them, and compacting the mail folder is lightning fast, there's
no wasted space in the mail folder, and there's no need to mangle lines
starting with "From " in the body of the email.

The only major downside is that because you're dealing with potentially
thousands of smallish files, it *may* have reduced performance on some
older file systems that don't deal well with lots of files. These days,
that's not a real issue.

Oh yes, and people using Windows can't use maildir because (1) it doesn't
allow colons in names, and (2) it doesn't have atomic renames. Neither of
these are insurmountable problems: an implementation could substitute
another character for the colon, and while that would be a technical
violation of the standard, it would still work. And the lack of atomic
renames would simply mean that implementations have to be more careful
about not having two threads writing to the one mailbox at the same time.




[*] I'm assuming normal "oops there's a bug in the mail client code"
corruption rather than "I got drunk and started deleting random files and
directories" corruption.



--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


samwyse at gmail

Nov 24, 2009, 3:09 PM

Post #9 of 17 (586 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

On Nov 24, 4:43 pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.au> wrote:

> Oh yes, and people using Windows can't use maildir because (1) it doesn't
> allow colons in names, and (2) it doesn't have atomic renames. Neither of
> these are insurmountable problems: an implementation could substitute
> another character for the colon, and while that would be a technical
> violation of the standard, it would still work. And the lack of atomic
> renames would simply mean that implementations have to be more careful
> about not having two threads writing to the one mailbox at the same time.

A common work around for the former is to URL encode the names, which
let's you stick all sorts of odd characters.

I'm afraid I can't help with the latter, though.
--
http://mail.python.org/mailman/listinfo/python-list


cjns1989 at gmail

Nov 24, 2009, 9:11 PM

Post #10 of 17 (582 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

On Tue, Nov 24, 2009 at 05:43:32PM EST, Steven D'Aprano wrote:
> On Tue, 24 Nov 2009 13:19:10 -0500, Chris Jones wrote:
>
> > On Tue, Nov 24, 2009 at 08:02:09AM EST, Steven D'Aprano wrote:
> >
> >> Good grief, it's about six weeks away from 2010 and Thunderbird still
> >> uses mbox as it's default mail box format. Hello, the nineties called,
> >> they want their mail formats back! Are the tbird developers on crack or
> >> something? I can't believe that they're still using that crappy format.
> >>
> >> No, I tell a lie. I can believe it far too well.
> >
> > :-)
> >
> > I realize that's somewhat OT, but what mail box format do you recommend,
> > and why?
>
> maildir++
>
> http://en.wikipedia.org/wiki/Maildir

Outside the two pluses, maildir also goes back to the 90s - 1995, Daniel
Berstein's orginal specification.

> Corruption is less likely, if there is corruption you'll only lose a
> single message rather than potentially everything in the mail folder[*],
> at a pinch you can read the emails using a text editor or easily grep
> through them, and compacting the mail folder is lightning fast, there's
> no wasted space in the mail folder, and there's no need to mangle lines
> starting with "From " in the body of the email.

This last aspect very welcome.

> The only major downside is that because you're dealing with potentially
> thousands of smallish files, it *may* have reduced performance on some
> older file systems that don't deal well with lots of files. These days,
> that's not a real issue.
>
> Oh yes, and people using Windows can't use maildir because (1) it doesn't
> allow colons in names, and (2) it doesn't have atomic renames. Neither of
> these are insurmountable problems: an implementation could substitute
> another character for the colon, and while that would be a technical
> violation of the standard, it would still work. And the lack of atomic
> renames would simply mean that implementations have to be more careful
> about not having two threads writing to the one mailbox at the same time.
>
>
> [*] I'm assuming normal "oops there's a bug in the mail client code"
> corruption rather than "I got drunk and started deleting random files and
> directories" corruption.

I'm not concerned with the other aspects, but I'm reaching a point where
mutt is becoming rather sluggish with the mbox format, especially those
mail boxes that have more than about 3000 messages and it looks like
maildir, especially with some form of header caching might help.

Looks like running a local IMAP server would probably be more effective,
though.

Thank you for your comments.

CJ
--
http://mail.python.org/mailman/listinfo/python-list


aahz at pythoncraft

Dec 1, 2009, 8:45 AM

Post #11 of 17 (506 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

In article <031bc732$0$1336$c3e8da3 [at] news>,
Steven D'Aprano <steve [at] REMOVE-THIS-cybersource> wrote:
>
>Good grief, it's about six weeks away from 2010 and Thunderbird still
>uses mbox as it's default mail box format. Hello, the nineties called,
>they want their mail formats back! Are the tbird developers on crack or
>something? I can't believe that they're still using that crappy format.

Just to be contrary, I *like* mbox.
--
Aahz (aahz [at] pythoncraft) <*> http://www.pythoncraft.com/

The best way to get information on Usenet is not to ask a question, but
to post the wrong information.
--
http://mail.python.org/mailman/listinfo/python-list


michael at stroeder

Dec 3, 2009, 3:52 PM

Post #12 of 17 (490 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

Aahz wrote:
> In article <031bc732$0$1336$c3e8da3 [at] news>,
> Steven D'Aprano <steve [at] REMOVE-THIS-cybersource> wrote:
>> Good grief, it's about six weeks away from 2010 and Thunderbird still
>> uses mbox as it's default mail box format. Hello, the nineties called,
>> they want their mail formats back! Are the tbird developers on crack or
>> something? I can't believe that they're still using that crappy format.
>
> Just to be contrary, I *like* mbox.

Me too. :-)

Ciao, Michael.
--
http://mail.python.org/mailman/listinfo/python-list


steve at REMOVE-THIS-cybersource

Dec 3, 2009, 4:33 PM

Post #13 of 17 (489 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

On Fri, 04 Dec 2009 00:52:35 +0100, Michael Ströder wrote:

> Aahz wrote:
>> In article <031bc732$0$1336$c3e8da3 [at] news>, Steven D'Aprano
>> <steve [at] REMOVE-THIS-cybersource> wrote:
>>> Good grief, it's about six weeks away from 2010 and Thunderbird still
>>> uses mbox as it's default mail box format. Hello, the nineties called,
>>> they want their mail formats back! Are the tbird developers on crack
>>> or something? I can't believe that they're still using that crappy
>>> format.
>>
>> Just to be contrary, I *like* mbox.
>
> Me too. :-)


Why? What features or benefits of mbox do you see that make up for it's
disadvantages?



--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


drobinow at gmail

Dec 3, 2009, 4:59 PM

Post #14 of 17 (490 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

On Thu, Dec 3, 2009 at 7:33 PM, Steven D'Aprano
<steve [at] remove-this-cybersource> wrote:
> On Fri, 04 Dec 2009 00:52:35 +0100, Michael Ströder wrote:
>
>> Aahz wrote:
>>> Just to be contrary, I *like* mbox.
>>
>> Me too. :-)
>
>
> Why? What features or benefits of mbox do you see that make up for it's
> disadvantages?

I've never heard of mbox. Is it written in Python?
--
http://mail.python.org/mailman/listinfo/python-list


steve at REMOVE-THIS-cybersource

Dec 3, 2009, 5:27 PM

Post #15 of 17 (489 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

On Thu, 03 Dec 2009 19:59:30 -0500, David Robinow wrote:


> I've never heard of mbox. Is it written in Python?

It is a file format used for storing email. Wikipedia is your friend:

http://en.wikipedia.org/wiki/Mbox


--
Steven
--
http://mail.python.org/mailman/listinfo/python-list


cjns1989 at gmail

Dec 3, 2009, 5:47 PM

Post #16 of 17 (490 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

On Thu, Dec 03, 2009 at 07:59:30PM EST, David Robinow wrote:
> On Thu, Dec 3, 2009 at 7:33 PM, Steven D'Aprano
> <steve [at] remove-this-cybersource> wrote:
> > On Fri, 04 Dec 2009 00:52:35 +0100, Michael Ströder wrote:
> >
> >> Aahz wrote:
> >>> Just to be contrary, I *like* mbox.
> >>
> >> Me too. :-)
> >
> >
> > Why? What features or benefits of mbox do you see that make up for it's
> > disadvantages?
>
> I've never heard of mbox. Is it written in Python?

English, actually.. short for mail box, I gather.

CJ
--
http://mail.python.org/mailman/listinfo/python-list


nobody at nowhere

Dec 5, 2009, 8:20 AM

Post #17 of 17 (462 views)
Permalink
Re: UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and... [In reply to]

On Fri, 04 Dec 2009 00:33:57 +0000, Steven D'Aprano wrote:

>>> Just to be contrary, I *like* mbox.
>>
>> Me too. :-)

Me too.

> Why? What features or benefits of mbox do you see that make up for it's
> disadvantages?

Simplicity and performance.

Maildir isn't simple when you add in the filesystem or archive format
(leaving aside the fact that maildir cannot be processed using nothing but
ANSI C).

Nor is it particularly quick if you want to grep for a message in a
decade's worth of archives (even on Linux; and NTFS is *much* worse for
dealing with many small files).

--
http://mail.python.org/mailman/listinfo/python-list

Python python RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.