Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: exim: dev

Bad utf-8 in pgsql lookup and mainlog

 

 

exim dev RSS feed   Index | Next | Previous | View Threaded


Axel.Rau at chaos1

Jul 20, 2013, 5:17 AM

Post #1 of 9 (110 views)
Permalink
Bad utf-8 in pgsql lookup and mainlog

Recording of utf-8 characters from headers in mainlog and PostgreSQL DB via lookup usually works flawlessly.

Occasionally PostgreSQL complains during INSERT of header items or main log events (our log host uses PostgreSQL as bakend) about invalid byte sequence, like here:

[1\3] 1V085d-00067H-9X H=mail03.noris.net [62.128.1.223] Warning: ACL "warn" statement skipped: condition test deferred: PGSQL: query failed: ERROR: invalid byte sequence for encoding "UTF8": 0xfc

2013-07-19T10:39:10.005396+00:00 db1 rsyslogd: db error (22021): invalid byte sequence for encoding "UTF8": 0xfc
2013-07-19T10:39:10.005415+00:00 db1 rsyslogd: db error (event): |2013-07-19t10:39:09.991124+00:00|6|2|mx4|exim| [2\3] (PGRES_FATAL_ERROR) (SELECT * FROM record_Reception( '1525916', '1V085d-00067H-9X', 'Staatstheater Nürnberg <info [at] staatstheater>', 'Newsletter Staatstheater Nürnberg', 'none', 'N/A'))

Does this come from bad encoding of original mail headers?
Is there an easy solution to skip bad characters before sending them to the DB?

In lokkups/pgsql.c:258 I see:
PQsetClientEncoding(pg_conn, "SQL_ASCII");

but I think it's not related.

Thanks, Axel
---
PGP-Key:29E99DD6 ☀ +49 151 2300 9283 ☀ computing @ chaos claudius


--
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##


jgh at wizmail

Jul 20, 2013, 6:45 AM

Post #2 of 9 (106 views)
Permalink
Re: Bad utf-8 in pgsql lookup and mainlog [In reply to]

On 07/20/2013 01:17 PM, Axel Rau wrote:
> Recording of utf-8 characters from headers in mainlog and PostgreSQL DB via lookup usually works flawlessly.
>
> Occasionally PostgreSQL complains during INSERT of header items or main log events (our log host uses PostgreSQL as bakend) about invalid byte sequence, like here:
>
> [1\3] 1V085d-00067H-9X H=mail03.noris.net [62.128.1.223] Warning: ACL "warn" statement skipped: condition test deferred: PGSQL: query failed: ERROR: invalid byte sequence for encoding "UTF8": 0xfc
>
> 2013-07-19T10:39:10.005396+00:00 db1 rsyslogd: db error (22021): invalid byte sequence for encoding "UTF8": 0xfc
> 2013-07-19T10:39:10.005415+00:00 db1 rsyslogd: db error (event): |2013-07-19t10:39:09.991124+00:00|6|2|mx4|exim| [2\3] (PGRES_FATAL_ERROR) (SELECT * FROM record_Reception( '1525916', '1V085d-00067H-9X', 'Staatstheater Nürnberg <info [at] staatstheater>', 'Newsletter Staatstheater Nürnberg', 'none', 'N/A'))
>
> Does this come from bad encoding of original mail headers?

It seems likely from the complaint coming from pgsql,
but you've not shown us any relevant bit of your exim config.

Can you duplicate the situation manually?
Can you identify the lines in your exim config that are responsible?
--
Cheers,
Jeremy

--
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##


Axel.Rau at chaos1

Jul 20, 2013, 7:41 AM

Post #3 of 9 (106 views)
Permalink
Re: Bad utf-8 in pgsql lookup and mainlog [In reply to]

Am 20.07.2013 um 15:45 schrieb Jeremy Harris <jgh [at] wizmail>:

> On 07/20/2013 01:17 PM, Axel Rau wrote:
>> Recording of utf-8 characters from headers in mainlog and PostgreSQL DB via lookup usually works flawlessly.
>>
>> Occasionally PostgreSQL complains during INSERT of header items or main log events (our log host uses PostgreSQL as bakend) about invalid byte sequence, like here:
>>
>> [1\3] 1V085d-00067H-9X H=mail03.noris.net [62.128.1.223] Warning: ACL "warn" statement skipped: condition test deferred: PGSQL: query failed: ERROR: invalid byte sequence for encoding "UTF8": 0xfc
>>
>> 2013-07-19T10:39:10.005396+00:00 db1 rsyslogd: db error (22021): invalid byte sequence for encoding "UTF8": 0xfc
>> 2013-07-19T10:39:10.005415+00:00 db1 rsyslogd: db error (event): |2013-07-19t10:39:09.991124+00:00|6|2|mx4|exim| [2\3] (PGRES_FATAL_ERROR) (SELECT * FROM record_Reception( '1525916', '1V085d-00067H-9X', 'Staatstheater Nürnberg <info [at] staatstheater>', 'Newsletter Staatstheater Nürnberg', 'none', 'N/A'))
>>
>> Does this come from bad encoding of original mail headers?
>
> It seems likely from the complaint coming from pgsql,
> but you've not shown us any relevant bit of your exim config.
---
warn condition = ${if eq {${lookup pgsql {SELECT * FROM record_Reception( \
'${quote_pgsql:$acl_m_mail_id_list}', \
'${quote_pgsql:$message_exim_id}', \
'${quote_pgsql:${if def:h_from:{$h_from:}{$h_sender:}}}', \
'${quote_pgsql:$h_Subject:}', \
'${quote_pgsql:$dkim_verify_status}', \
'${quote_pgsql:N/A}')}}}{t}}
---
record_Reception(…) is a backend function:
---
CREATE OR REPLACE FUNCTION record_Reception(mail_id_list text, msgID text, from_hdr text, subject_hdr text, DKIMresult text, SA_result text) RETURNS BOOLEAN AS $$
DECLARE
retval BOOLEAN := True;
affected_rows INT;
BEGIN
--RAISE NOTICE '[record_Reception 1: mail_id_list_text = %; Subject = "%s"]', mail_id_list, convert_from(from_hdr, 'LATIN1');
UPDATE mail
SET messageID = msgID,
from_header = from_hdr,
subject_header = subject_hdr,
deliverytime = NOW()
WHERE id = ANY( CAST(string_to_array( mail_id_list, ',' ) AS BIGINT[]));
GET DIAGNOSTICS affected_rows := ROW_COUNT;
IF affected_rows = 0 THEN
RAISE NOTICE '[record_Reception 1: mail_id_list_text = %]', mail_id_list;
retval = False;
END IF;
RETURN retval;
END;
$$ LANGUAGE 'plpgsql';
---

>
> Can you duplicate the situation manually?
No. I would have to create a mail with an invalid byte sequence in a subject header.
Any suggestions?
> Can you identify the lines in your exim config that are responsible?
Yes, see above.

Axel
---
PGP-Key:29E99DD6 ☀ +49 151 2300 9283 ☀ computing @ chaos claudius


--
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##


marcin at mejor

Jul 20, 2013, 8:25 AM

Post #4 of 9 (104 views)
Permalink
Re: Bad utf-8 in pgsql lookup and mainlog [In reply to]

W dniu 2013-07-20 14:17, Axel Rau pisze:
> Recording of utf-8 characters from headers in mainlog and PostgreSQL DB via lookup usually works flawlessly.
>
> Occasionally PostgreSQL complains during INSERT of header items or main log events (our log host uses PostgreSQL as bakend) about invalid byte sequence, like here:
>
> [1\3] 1V085d-00067H-9X H=mail03.noris.net [62.128.1.223] Warning: ACL "warn" statement skipped: condition test deferred: PGSQL: query failed: ERROR: invalid byte sequence for encoding "UTF8": 0xfc
>
> 2013-07-19T10:39:10.005396+00:00 db1 rsyslogd: db error (22021): invalid byte sequence for encoding "UTF8": 0xfc
> 2013-07-19T10:39:10.005415+00:00 db1 rsyslogd: db error (event): |2013-07-19t10:39:09.991124+00:00|6|2|mx4|exim| [2\3] (PGRES_FATAL_ERROR) (SELECT * FROM record_Reception( '1525916', '1V085d-00067H-9X', 'Staatstheater Nürnberg <info [at] staatstheater>', 'Newsletter Staatstheater Nürnberg', 'none', 'N/A'))
>
> Does this come from bad encoding of original mail headers?
> Is there an easy solution to skip bad characters before sending them to the DB?
>
> In lokkups/pgsql.c:258 I see:
> PQsetClientEncoding(pg_conn, "SQL_ASCII");
>
> but I think it's not related.

Hi Axel!
I suspect it is related. If you try to insert text into postgresql you
should know which encoding is used in this text. If you know the
inserted text is in utf-8 you should use set "clientencoding" to utf-8.
But in emails you never know what encoding will be used. In theory it
should be used only basic ASCII characters.
You can:
a) rejects mail with non ASCII chars in Subject
b) encode Subject using e.g. base64 then inserts to database
c) guess which encoding was used in Subject, then set properly
"clientencoding" parameter
d) use "C" collation for given database/table in postgresql - it allows
you to insert any characters into table. But you will lost possibility
to get tuple in your preffered charset. (E.g. you can keep text in utf-8
in database but when you set "clientencoding" to e.g. 8859-2 you will
get text in 8859-2. With "C" collation pgsql doesn't do encoding to e.g
iso8859-2)

Regards


--
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##


Axel.Rau at chaos1

Jul 20, 2013, 10:05 AM

Post #5 of 9 (105 views)
Permalink
Re: Bad utf-8 in pgsql lookup and mainlog [In reply to]

Am 20.07.2013 um 17:25 schrieb Marcin Mirosław <marcin [at] mejor>:

> W dniu 2013-07-20 14:17, Axel Rau pisze:
>> Recording of utf-8 characters from headers in mainlog and PostgreSQL DB via lookup usually works flawlessly.
>>
>> Occasionally PostgreSQL complains during INSERT of header items or main log events (our log host uses PostgreSQL as bakend) about invalid byte sequence, like here:
>>
>> [1\3] 1V085d-00067H-9X H=mail03.noris.net [62.128.1.223] Warning: ACL "warn" statement skipped: condition test deferred: PGSQL: query failed: ERROR: invalid byte sequence for encoding "UTF8": 0xfc
>>
>> 2013-07-19T10:39:10.005396+00:00 db1 rsyslogd: db error (22021): invalid byte sequence for encoding "UTF8": 0xfc
>> 2013-07-19T10:39:10.005415+00:00 db1 rsyslogd: db error (event): |2013-07-19t10:39:09.991124+00:00|6|2|mx4|exim| [2\3] (PGRES_FATAL_ERROR) (SELECT * FROM record_Reception( '1525916', '1V085d-00067H-9X', 'Staatstheater Nürnberg <info [at] staatstheater>', 'Newsletter Staatstheater Nürnberg', 'none', 'N/A'))
>>
>> Does this come from bad encoding of original mail headers?
>> Is there an easy solution to skip bad characters before sending them to the DB?
>>
>> In lokkups/pgsql.c:258 I see:
>> PQsetClientEncoding(pg_conn, "SQL_ASCII");
>>
>> but I think it's not related.
>
> Hi Axel!
> I suspect it is related. If you try to insert text into postgresql you
> should know which encoding is used in this text. If you know the
> inserted text is in utf-8 you should use set "clientencoding" to utf-8.
> But in emails you never know what encoding will be used. In theory it
> should be used only basic ASCII characters.
> You can:
> a) rejects mail with non ASCII chars in Subject.
> b) encode Subject using e.g. base64 then inserts to database
> c) guess which encoding was used in Subject, then set properly
> "clientencoding" parameter
> d) use "C" collation for given database/table in postgresql - it allows
> you to insert any characters into table. But you will lost possibility
> to get tuple in your preffered charset. (E.g. you can keep text in utf-8
> in database but when you set "clientencoding" to e.g. 8859-2 you will
> get text in 8859-2. With "C" collation pgsql doesn't do encoding to e.g
> iso8859-2)
As exim works with utf-8 strings, my naive assumption was, that a header like
Subject: Neue =?ISO-8859-1?q?Gl=E4ser?=
(RFC 2047) will be converted to utf-8 by exim before I access it via $h_Subject: .
Looking at the complexity of expand.c, this seems to be proved.
Can anybody confirm this?

If the header contains none-ASCII 8-bit-characters (=illegal), I would like exim to replace them by "?".
Can this be done in the exim config or do we need a new expansion function for that?

I must ensure valid utf-8 at the DB interface.

Axel
---
PGP-Key:29E99DD6 ☀ +49 151 2300 9283 ☀ computing @ chaos claudius


--
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##


pdp at exim

Jul 20, 2013, 8:35 PM

Post #6 of 9 (102 views)
Permalink
Re: Bad utf-8 in pgsql lookup and mainlog [In reply to]

On 2013-07-20 at 19:05 +0200, Axel Rau wrote:
> As exim works with utf-8 strings, my naive assumption was, that a header like
> Subject: Neue =?ISO-8859-1?q?Gl=E4ser?=
> (RFC 2047) will be converted to utf-8 by exim before I access it via $h_Subject: .
> Looking at the complexity of expand.c, this seems to be proved.
> Can anybody confirm this?

Exim's behaviour depends upon what value was defined for HEADERS_CHARSET
in Local/Makefile when Exim was built. You also need HAVE_ICONV=yes but
that's supplied by default on some OSes.

The sample configuration supplied in src/EDITME sets
HEADERS_CHARSET="ISO-8859-1".

For myself, I always set HEADERS_CHARSET="UTF-8".

> If the header contains none-ASCII 8-bit-characters (=illegal), I would like exim to replace them by "?".
> Can this be done in the exim config or do we need a new expansion function for that?

I *suspect* that a new expansion function would be needed, but I could
be proven wrong by a particularly clever hack. I also suspect that, if
we were to implement this, we'd default the replacement character to be
codepoint 0xFFFD, the Unicode REPLACEMENT CHARACTER.

--
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##


Axel.Rau at chaos1

Jul 23, 2013, 3:24 PM

Post #7 of 9 (103 views)
Permalink
Re: Bad utf-8 in pgsql lookup and mainlog [In reply to]

Thanks, Phil, for this very helpful response.

Am 21.07.2013 um 05:35 schrieb Phil Pennock <pdp [at] exim>:

> On 2013-07-20 at 19:05 +0200, Axel Rau wrote:
>> As exim works with utf-8 strings, my naive assumption was, that a header like
>> Subject: Neue =?ISO-8859-1?q?Gl=E4ser?=
>> (RFC 2047) will be converted to utf-8 by exim before I access it via $h_Subject: .
>> Looking at the complexity of expand.c, this seems to be proved.
>> Can anybody confirm this?
>
> Exim's behaviour depends upon what value was defined for HEADERS_CHARSET
> in Local/Makefile when Exim was built. You also need HAVE_ICONV=yes but
> that's supplied by default on some OSes.
>
> The sample configuration supplied in src/EDITME sets
> HEADERS_CHARSET="ISO-8859-1".
>
> For myself, I always set HEADERS_CHARSET="UTF-8".

Indeed FreeBSD ports system defaults to ISO-8859-1.
I reinstalled with UTF-8. Unfortunately exim -bV does not list the HEADERS_CHARSET.
From my simple tests, it seems to be work:
Subject: TEST =?ISO-8859-1?q?Gl=FCckliche_m=F6gliche_=C4chtung?=
was recorded correctly in UTF-8 in the DB.
>
>> If the header contains none-ASCII 8-bit-characters (=illegal), I would like exim to replace them by "?".
>> Can this be done in the exim config or do we need a new expansion function for that?
>
> I *suspect* that a new expansion function would be needed, but I could
> be proven wrong by a particularly clever hack. I also suspect that, if
> we were to implement this, we'd default the replacement character to be
> codepoint 0xFFFD, the Unicode REPLACEMENT CHARACTER.


Wouldn't this be reasonable enhancement of the existing conversion functionality anyway?

Axel
---
PGP-Key:29E99DD6 ☀ +49 151 2300 9283 ☀ computing @ chaos claudius


--
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##


Axel.Rau at chaos1

Jul 30, 2013, 4:26 AM

Post #8 of 9 (79 views)
Permalink
Re: Bad utf-8 in pgsql lookup and mainlog [In reply to]

Am 24.07.2013 um 00:24 schrieb Axel Rau <Axel.Rau [at] chaos1>:

> Am 21.07.2013 um 05:35 schrieb Phil Pennock <pdp [at] exim>:
>
>> On 2013-07-20 at 19:05 +0200, Axel Rau wrote:
>>> As exim works with utf-8 strings, my naive assumption was, that a header like
>>> Subject: Neue =?ISO-8859-1?q?Gl=E4ser?=
>>> (RFC 2047) will be converted to utf-8 by exim before I access it via $h_Subject: .
>>> Looking at the complexity of expand.c, this seems to be proved.
>>> Can anybody confirm this?
>>
>> Exim's behaviour depends upon what value was defined for HEADERS_CHARSET
>> in Local/Makefile when Exim was built. You also need HAVE_ICONV=yes but
>> that's supplied by default on some OSes.
>>
>> The sample configuration supplied in src/EDITME sets
>> HEADERS_CHARSET="ISO-8859-1".
>>
>> For myself, I always set HEADERS_CHARSET="UTF-8".
>
> Indeed FreeBSD ports system defaults to ISO-8859-1.
> I reinstalled with UTF-8. Unfortunately exim -bV does not list the HEADERS_CHARSET.
> From my simple tests, it seems to be work:
> Subject: TEST =?ISO-8859-1?q?Gl=FCckliche_m=F6gliche_=C4chtung?=
> was recorded correctly in UTF-8 in the DB.
>>
>>> If the header contains none-ASCII 8-bit-characters (=illegal), I would like exim to replace them by "?".
>>> Can this be done in the exim config or do we need a new expansion function for that?
>>
>> I *suspect* that a new expansion function would be needed, but I could
>> be proven wrong by a particularly clever hack. I also suspect that, if
>> we were to implement this, we'd default the replacement character to be
>> codepoint 0xFFFD, the Unicode REPLACEMENT CHARACTER.
>
>
> Wouldn't this be reasonable enhancement of the existing conversion functionality anyway?

After 10 days running with HEADERS_CHARSET="UTF-8" in Local/Makefile and
PQsetClientEncoding(pg_conn, "UTF8"); in lokkups/pgsql.c, I still get tons of
'invalid byte sequence for encoding "UTF8"'(as expected by malformed mails).
I would like to prepare a patch for a bug report to replace the illegal sequence
by this Unicode REPLACEMENT CHARACTER.

Is there any place in the exim code base, where this information is available?
(I looked at rfc2047.c, expand.c…)

I need a solution in order to log header contents in a pgsql backend.

Thanks, Axel

---
PGP-Key:29E99DD6 ☀ +49 151 2300 9283 ☀ computing @ chaos claudius


--
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##


pdp at exim

Jul 30, 2013, 9:46 AM

Post #9 of 9 (79 views)
Permalink
Re: Bad utf-8 in pgsql lookup and mainlog [In reply to]

On 2013-07-30 at 13:26 +0200, Axel Rau wrote:
> After 10 days running with HEADERS_CHARSET="UTF-8" in Local/Makefile and
> PQsetClientEncoding(pg_conn, "UTF8"); in lokkups/pgsql.c, I still get tons of
> 'invalid byte sequence for encoding "UTF8"'(as expected by malformed mails).
> I would like to prepare a patch for a bug report to replace the illegal sequence
> by this Unicode REPLACEMENT CHARACTER.
>
> Is there any place in the exim code base, where this information is available?
> (I looked at rfc2047.c, expand.c…)
>
> I need a solution in order to log header contents in a pgsql backend.

As far as Exim is concerned, it's a string of bytes, which happen to
have been prepared by normalising through iconv from another stream of
bytes.

I suspect that the cleanest approach that fits with Exim would be to add
a new expansion operator ${utf8clean:...} in expand.c, with the core
routine living either in that file or in string.c.

I'm not aware of any standard library code which can check the string,
although UTF-8 is so clean and well-defined that it should be around 10
lines of code with a switch inside a loop. Looking around I see that
the Perl Unicode::CheckUTF8 module describes itself as a wrapper around
some Unicode Consortium code for accomplishing this. I suspect that
digging that out, checking the license, and using the same base code in
Exim is the way to go.

-Phil

--
## List details at https://lists.exim.org/mailman/listinfo/exim-dev Exim details at http://www.exim.org/ ##

exim dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.