aleksander.kamenik at krediidiinfo
Oct 14, 2010, 12:16 AM
Post #6 of 6
Replying to myself. FYI.
Re: importing email with utf envoded subjects
[In reply to]
> -----Original Message-----
> From: dbmail-bounces [at] dbmail [mailto:dbmail-bounces [at] dbmail] On
> Behalf Of Kamenik, Aleksander
> Sent: Friday, October 08, 2010 6:36 PM
> To: DBMail mailinglist
> Subject: Re: [Dbmail] importing email with utf envoded subjects
> > It is in UTF-8, but it is also violation of rfc822/rfc2047. Headers
> > must
> > be encoded. Computer program can't detect used character set, if
> > does not specify which character set is used.
> I was aware of that. However these emails exist and I can't get Outlook
> to change anyway.
These emails are created when Outlook internally generetas them. For example a message from Office Communicator which states "Missed conversation with somebody, somebody else, etc". If one the names contains chars with umlauts it's simply encoded in UTF8 without specifying so.
There are emails sent from other Outlook clients via same Exchange server too, which have UTF8 subjects when they contain for example umlauts.
> > It is highly unlikely
> > that
> > all your malformed emails are in utf-8. You can have a mix of utf-8,
> > iso-8859-1, iso-8859-13, iso-8859-15, windows-1252, windows-1257 and
> > other
> > character sets. Older Estonian emails are probably not in utf-8. If
> > try to fix all 8bit subjects, you will break malformed iso-8859-x
> > Estonian
> > texts that look ok in Outlook now.
> The problem is only with the Subject header on a subset of emails. The
> bodies look OK before and after conversion. UTF8 subjects break after
> conversion from pst to dbmail via mbox.
> My plan if all else fails is to generate the mbox files and then go
> through them searching for the UTF8 subjects and recode these only. Or
> modify libpst to do that which would be more efficient.
And so I did. An awk script with a pipe to a shell script that does the charset check and converts to quoted printable format using mmencode. I have a solution.
If you ever do something similar then make sure to convert the Subject: header in the header of the main email message only. You'll find loads of UTF8 subjects in the messages itself as Outlook usually quotes inline starting with the "-----Original Message-----" line and then quoting the To, From, Sent and Subject headers which are in UTF8, but that's OK as they are in the body of the message.
libpst should do this IMHO, but I'm not versed in C to actually write a patch.
an Experian Company
Phone: +372 665 9649
Email: aleksander [at] krediidiinfo
DBmail mailing list
DBmail [at] dbmail