Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: DBMail: users

importing email with utf envoded subjects

 

 

DBMail users RSS feed   Index | Next | Previous | View Threaded


aleksander.kamenik at krediidiinfo

Oct 8, 2010, 5:55 AM

Post #1 of 6 (779 views)
Permalink
importing email with utf envoded subjects

Hi,

I have mbox files I want to import with the mailbox2dbmail tool. One mbox file equals 1 imap folder and can be over a gigabyte in size.

The mbox files contain Subjects with various encodings, some are ISO-8859, some specify the encondig (Subject: =?windows-1257?Q?...) and some a mixed (several subjects with different encodings). However many subjects are in UTF only, utf7 I assume.

These, when imported don't show correctly on Outlook 2007 imap clients, they are rendered in ISO. The problem shows with umlauts and other >127 chars.

Using dbmail 2.2.17. Would it be possible for dbmail-smtp to recognize these UTF subjects and convert them to something acceptable for dbmail on the fly?

The database is configured as UTF8 and the mbox files were generated from Outlook pst files using libpst.

Regards,

Aleksander Kamenik
System Administrator
Krediidiinfo AS
an Experian Company
Phone: +372 665 9649
Email: aleksander [at] krediidiinfo

_______________________________________________
DBmail mailing list
DBmail [at] dbmail
http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail


tokul at users

Oct 8, 2010, 6:10 AM

Post #2 of 6 (764 views)
Permalink
Re: importing email with utf envoded subjects [In reply to]

2010.10.08 15:55 Kamenik, Aleksander rašė:
> Hi,
>
> I have mbox files I want to import with the mailbox2dbmail tool. One mbox
> file equals 1 imap folder and can be over a gigabyte in size.
>
> The mbox files contain Subjects with various encodings, some are ISO-8859,
> some specify the encondig (Subject: =?windows-1257?Q?...) and some a mixed
> (several subjects with different encodings). However many subjects are in
> UTF only, utf7 I assume.

You don't have to assume anything. Character set name is written in first
section of B|Q encoding. If character set name is not written and subject
is not encoded, it must be in us ascii.

utf7 is rarely used for Subject. You have utf-8 (=?utf-8?b? or =?utf-8?q?)
or some unicode variat (unicode-#-# or unicode-#-#-some-text) or you
confuse Unicode with broken 8bit headers.

--
Tomas


_______________________________________________
DBmail mailing list
DBmail [at] dbmail
http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail


aleksander.kamenik at krediidiinfo

Oct 8, 2010, 7:17 AM

Post #3 of 6 (759 views)
Permalink
Re: importing email with utf envoded subjects [In reply to]

> You don't have to assume anything. Character set name is written in
> first
> section of B|Q encoding. If character set name is not written and
> subject
> is not encoded, it must be in us ascii.
>
> utf7 is rarely used for Subject. You have utf-8 (=?utf-8?b? or =?utf-
> 8?q?)
> or some unicode variat (unicode-#-# or unicode-#-#-some-text) or you
> confuse Unicode with broken 8bit headers.

These are from Outlook 2007 as far as I can tell. Not everybody follows standards. For example:

# grep 'Subject: palun juur' mbox
Subject: palun juurdepääsu
# grep 'Subject: palun juur' mbox | hexdump -C
00000000 53 75 62 6a 65 63 74 3a 20 70 61 6c 75 6e 20 6a |Subject: palun j|
00000010 75 75 72 64 65 70 c3 a4 c3 a4 73 75 0a |uurdep....su.|
0000001d
#

There are loads of these.

Regards,

Aleksander Kamenik
System Administrator
Krediidiinfo AS
an Experian Company
Phone: +372 665 9649
Email: aleksander [at] krediidiinfo
_______________________________________________
DBmail mailing list
DBmail [at] dbmail
http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail


tokul at users

Oct 8, 2010, 7:40 AM

Post #4 of 6 (759 views)
Permalink
Re: importing email with utf envoded subjects [In reply to]

2010.10.08 17:17 Kamenik, Aleksander rašė:
>> You don't have to assume anything. Character set name is written in
>> first
>> section of B|Q encoding. If character set name is not written and
>> subject
>> is not encoded, it must be in us ascii.
>>
>> utf7 is rarely used for Subject. You have utf-8 (=?utf-8?b? or =?utf-
>> 8?q?)
>> or some unicode variat (unicode-#-# or unicode-#-#-some-text) or you
>> confuse Unicode with broken 8bit headers.
>
> These are from Outlook 2007 as far as I can tell. Not everybody follows
> standards. For example:
>
> # grep 'Subject: palun juur' mbox
> Subject: palun juurdepääsu
> # grep 'Subject: palun juur' mbox | hexdump -C
> 00000000 53 75 62 6a 65 63 74 3a 20 70 61 6c 75 6e 20 6a |Subject:
> palun j|
> 00000010 75 75 72 64 65 70 c3 a4 c3 a4 73 75 0a
> |uurdep....su.|
> 0000001d
> #

c3 a4 c3 a4

It is in UTF-8, but it is also violation of rfc822/rfc2047. Headers must
be encoded. Computer program can't detect used character set, if sender
does not specify which character set is used. It is highly unlikely that
all your malformed emails are in utf-8. You can have a mix of utf-8,
iso-8859-1, iso-8859-13, iso-8859-15, windows-1252, windows-1257 and other
character sets. Older Estonian emails are probably not in utf-8. If you
try to fix all 8bit subjects, you will break malformed iso-8859-x Estonian
texts that look ok in Outlook now.

If those utf-8 emails looked OK in Outlook, maybe problem is in libpst.

--
Tomas

_______________________________________________
DBmail mailing list
DBmail [at] dbmail
http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail


aleksander.kamenik at krediidiinfo

Oct 8, 2010, 8:36 AM

Post #5 of 6 (769 views)
Permalink
Re: importing email with utf envoded subjects [In reply to]

> It is in UTF-8, but it is also violation of rfc822/rfc2047. Headers
> must
> be encoded. Computer program can't detect used character set, if sender
> does not specify which character set is used.

I was aware of that. However these emails exist and I can't get Outlook to change anyway.

> It is highly unlikely
> that
> all your malformed emails are in utf-8. You can have a mix of utf-8,
> iso-8859-1, iso-8859-13, iso-8859-15, windows-1252, windows-1257 and
> other
> character sets. Older Estonian emails are probably not in utf-8. If you
> try to fix all 8bit subjects, you will break malformed iso-8859-x
> Estonian
> texts that look ok in Outlook now.

The problem is only with the Subject header on a subset of emails. The bodies look OK before and after conversion. UTF8 subjects break after conversion from pst to dbmail via mbox.

My plan if all else fails is to generate the mbox files and then go through them searching for the UTF8 subjects and recode these only. Or modify libpst to do that which would be more efficient.

> If those utf-8 emails looked OK in Outlook, maybe problem is in libpst.

Maybe, however I remember having come across this problem with malformed Outlook subjects before. I'll investigate further next week.

Thanks for your comments Tomas!

Regards,

Aleksander Kamenik
System Administrator
Krediidiinfo AS
an Experian Company
Phone: +372 665 9649
Email: aleksander [at] krediidiinfo
_______________________________________________
DBmail mailing list
DBmail [at] dbmail
http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail


aleksander.kamenik at krediidiinfo

Oct 14, 2010, 12:16 AM

Post #6 of 6 (758 views)
Permalink
Re: importing email with utf envoded subjects [In reply to]

Replying to myself. FYI.

> -----Original Message-----
> From: dbmail-bounces [at] dbmail [mailto:dbmail-bounces [at] dbmail] On
> Behalf Of Kamenik, Aleksander
> Sent: Friday, October 08, 2010 6:36 PM
> To: DBMail mailinglist
> Subject: Re: [Dbmail] importing email with utf envoded subjects
>
> > It is in UTF-8, but it is also violation of rfc822/rfc2047. Headers
> > must
> > be encoded. Computer program can't detect used character set, if
> sender
> > does not specify which character set is used.
>
> I was aware of that. However these emails exist and I can't get Outlook
> to change anyway.

These emails are created when Outlook internally generetas them. For example a message from Office Communicator which states "Missed conversation with somebody, somebody else, etc". If one the names contains chars with umlauts it's simply encoded in UTF8 without specifying so.

There are emails sent from other Outlook clients via same Exchange server too, which have UTF8 subjects when they contain for example umlauts.

> > It is highly unlikely
> > that
> > all your malformed emails are in utf-8. You can have a mix of utf-8,
> > iso-8859-1, iso-8859-13, iso-8859-15, windows-1252, windows-1257 and
> > other
> > character sets. Older Estonian emails are probably not in utf-8. If
> you
> > try to fix all 8bit subjects, you will break malformed iso-8859-x
> > Estonian
> > texts that look ok in Outlook now.
>
> The problem is only with the Subject header on a subset of emails. The
> bodies look OK before and after conversion. UTF8 subjects break after
> conversion from pst to dbmail via mbox.
>
> My plan if all else fails is to generate the mbox files and then go
> through them searching for the UTF8 subjects and recode these only. Or
> modify libpst to do that which would be more efficient.

And so I did. An awk script with a pipe to a shell script that does the charset check and converts to quoted printable format using mmencode. I have a solution.

If you ever do something similar then make sure to convert the Subject: header in the header of the main email message only. You'll find loads of UTF8 subjects in the messages itself as Outlook usually quotes inline starting with the "-----Original Message-----" line and then quoting the To, From, Sent and Subject headers which are in UTF8, but that's OK as they are in the body of the message.

libpst should do this IMHO, but I'm not versed in C to actually write a patch.

Regards,

Aleksander Kamenik
System Administrator
Krediidiinfo AS
an Experian Company
Phone: +372 665 9649
Email: aleksander [at] krediidiinfo
_______________________________________________
DBmail mailing list
DBmail [at] dbmail
http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail

DBMail users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.