Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Zope: Dev

Content Type Meta tag stripping in zope.pagetemplate

 

 

Zope dev RSS feed   Index | Next | Previous | View Threaded


mianonjoka at gmail

Feb 22, 2012, 7:28 AM

Post #1 of 14 (832 views)
Permalink
Content Type Meta tag stripping in zope.pagetemplate

Hello all,

I'm a fairly new zope developer, came across a "bug" in my application
that <meta http-equiv="content-type" content="text/html;charset=UTF-8"
/> tags were being stripped out from ZPT templates. Is there a reason
for this? This is done in the _prepare_html function of
zope.pagetemplate.pagetemplatefile.PageTemplateFile. My application
produces XHTML that contains non-ASCII characters that is then used by
other applications so it needs to have the content type set on the
document itself in addition to the HTTP headers.

Secondly, finding and stripping of the meta tag is done using a regular
expression so simply changing the order of the attributes on the
<meta> tag would make the reg-exp not match.

Attached is a patch that uses HTMLParser to find the content type meta
tag instead of a regex. It stops parsing the html as soon as it
encounters the required meta tag.

Miano
Attachments: meta_content_type_tag.patch (3.72 KB)


fred at fdrake

Feb 22, 2012, 9:08 AM

Post #2 of 14 (810 views)
Permalink
Re: Content Type Meta tag stripping in zope.pagetemplate [In reply to]

On Wed, Feb 22, 2012 at 10:28 AM, Miano Njoka <mianonjoka [at] gmail> wrote:
> <meta http-equiv="content-type" content="text/html;charset=UTF-8"
> /> tags were being stripped out from ZPT templates. Is there a reason
> for this?

As I recall, the rationale goes like this:

1. We're sniffing the input encoding from the charset setting.

2. We're storing the content-type on the instance (I hope tihs
is still true).

3. The template/application/publisher is responsible for
delivering the the output with an appropriate content-type
header.


--
Fred L. Drake, Jr.    <fred at fdrake.net>
"A storm broke loose in my mind."  --Albert Einstein
_______________________________________________
Zope-Dev maillist - Zope-Dev [at] zope
https://mail.zope.org/mailman/listinfo/zope-dev
** No cross posts or HTML encoding! **
(Related lists -
https://mail.zope.org/mailman/listinfo/zope-announce
https://mail.zope.org/mailman/listinfo/zope )


mianonjoka at gmail

Feb 22, 2012, 11:54 PM

Post #3 of 14 (800 views)
Permalink
Re: Content Type Meta tag stripping in zope.pagetemplate [In reply to]

On Wed, Feb 22, 2012 at 8:08 PM, Fred Drake <fred [at] fdrake> wrote:
> On Wed, Feb 22, 2012 at 10:28 AM, Miano Njoka <mianonjoka [at] gmail> wrote:
>> <meta http-equiv="content-type" content="text/html;charset=UTF-8"
>> /> tags were being stripped out from ZPT templates. Is there a reason
>> for this?
>
> As I recall, the rationale goes like this:
>
> 1. We're sniffing the input encoding from the charset setting.
>
> 2. We're storing the content-type on the instance (I hope tihs
>   is still true).
>
> 3. The template/application/publisher is responsible for
>   delivering the the output with an appropriate content-type
>   header.


Yes, this is true, but why strip out the meta tag from the resulting HTML?
_______________________________________________
Zope-Dev maillist - Zope-Dev [at] zope
https://mail.zope.org/mailman/listinfo/zope-dev
** No cross posts or HTML encoding! **
(Related lists -
https://mail.zope.org/mailman/listinfo/zope-announce
https://mail.zope.org/mailman/listinfo/zope )


fred at fdrake

Feb 23, 2012, 3:44 AM

Post #4 of 14 (808 views)
Permalink
Re: Content Type Meta tag stripping in zope.pagetemplate [In reply to]

On Thu, Feb 23, 2012 at 2:54 AM, Miano Njoka <mianonjoka [at] gmail> wrote:
> Yes, this is true, but why strip out the meta tag from the resulting HTML?

Two reasons:

1. It may be incorrect.

2. If multiple templates are used to construct a response, different
values may be included from each template, which may be inconsistent.

Since the meta element is unnecessary, it seemed better to leave it out
of the result, and rely on other components to render the correct values
without requiring them to insert correct values into the rendered template.
(The publisher, for instance, shouldn't need to know how to edit that into
the finished HTML.)


-Fred

--
Fred L. Drake, Jr.    <fred at fdrake.net>
"A storm broke loose in my mind."  --Albert Einstein
_______________________________________________
Zope-Dev maillist - Zope-Dev [at] zope
https://mail.zope.org/mailman/listinfo/zope-dev
** No cross posts or HTML encoding! **
(Related lists -
https://mail.zope.org/mailman/listinfo/zope-announce
https://mail.zope.org/mailman/listinfo/zope )


mianonjoka at gmail

Feb 24, 2012, 12:47 AM

Post #5 of 14 (799 views)
Permalink
Content Type Meta tag stripping in zope.pagetemplate [In reply to]

On Thu, Feb 23, 2012 at 2:44 PM, Fred Drake <fred [at] fdrake> wrote:
> On Thu, Feb 23, 2012 at 2:54 AM, Miano Njoka <mianonjoka [at] gmail> wrote:
>> Yes, this is true, but why strip out the meta tag from the resulting HTML?
>
> Two reasons:
>
> 1. It may be incorrect.
>
> 2. If multiple templates are used to construct a response, different
>   values may be included from each template, which may be inconsistent.
>

The code as it is now does not take this into account. It parses the
meta content type tag from all the templates passed to it and the
content type header sent in the response will be that of the last
template processed.


> Since the meta element is unnecessary, it seemed better to leave it out
> of the result,

While it is not essential, it is necessary in some cases where the
finished document will be read from disk or is used by other
applications eg. Deliverance[http://packages.python.org/Deliverance/].
In fact w3c's HTML validator throws a warning that one should declare
the character encoding in the document itself if it is missing.

> and rely on other components to render the correct values
> without requiring them to insert correct values into the rendered template.

Rather than removing the meta tag, I think its less complicated to
leave it in the finished HTML and let the developer fix any
inconsistencies that may arise.
_______________________________________________
Zope-Dev maillist - Zope-Dev [at] zope
https://mail.zope.org/mailman/listinfo/zope-dev
** No cross posts or HTML encoding! **
(Related lists -
https://mail.zope.org/mailman/listinfo/zope-announce
https://mail.zope.org/mailman/listinfo/zope )


charlie.clark at clark-consulting

Feb 24, 2012, 12:57 PM

Post #6 of 14 (802 views)
Permalink
Re: Content Type Meta tag stripping in zope.pagetemplate [In reply to]

Am 24.02.2012, 09:47 Uhr, schrieb Miano Njoka <mianonjoka [at] gmail>:

> While it is not essential, it is necessary in some cases where the
> finished document will be read from disk or is used by other
> applications eg. Deliverance[http://packages.python.org/Deliverance/].
> In fact w3c's HTML validator throws a warning that one should declare
> the character encoding in the document itself if it is missing.

This is actually what the validator says:

"""
No character encoding information was found within the document, either in
an HTML meta element or an XML declaration. It is often recommended to
declare the character encoding in the document itself, especially if there
is a chance that the document will be read from or saved to disk, CD, etc.
"""

As ZPT produces XHTML the proper place for any encoding declaration is in
the XML declaration, defaulting to UTF-8, which should throw a validation
error if incorrect. Like much of the HTML standard the meta tags were
never really thought through and, because invisible to the user, all too
often copied mindlessly from one project to another: I have customers
today with completely invalid and misleading meta tags of which they and
the rest of the world is blissfully unware. And as a result browsers - the
main consumers of the format were made fault tolerant - after all the user
often had no idea what was causing the problem or how to rectify it. I
have seen many examples of the server saying one think and the meta
something else entirely. I think nearly all browsers believe what the
server says over what's in the meta tag.

According to MAMA, which was instrumental in developing HTML 5 based on
what has actually been written, the charset was set in the
http-headersover 99 % of the time. Unfortunately, it doesn't contain any
stats on discrepancies between the http-header and the meta.

http://dev.opera.com/articles/view/mama

While there is apparently a possible security risk when not declaring the
charset I think the Pythonic principle of "there should be preferably one
obvious way to do something" should apply when within Zope trying to
decide the charset of a file and that should be well documented. I'd
suggest keeping the stripping but implementing a more rigorous approach
such as you suggest.

Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226
_______________________________________________
Zope-Dev maillist - Zope-Dev [at] zope
https://mail.zope.org/mailman/listinfo/zope-dev
** No cross posts or HTML encoding! **
(Related lists -
https://mail.zope.org/mailman/listinfo/zope-announce
https://mail.zope.org/mailman/listinfo/zope )


marius at gedmin

Feb 24, 2012, 3:18 PM

Post #7 of 14 (802 views)
Permalink
Re: Content Type Meta tag stripping in zope.pagetemplate [In reply to]

On Fri, Feb 24, 2012 at 09:57:57PM +0100, Charlie Clark wrote:
> Am 24.02.2012, 09:47 Uhr, schrieb Miano Njoka <mianonjoka [at] gmail>:
>
> >While it is not essential, it is necessary in some cases where the
> >finished document will be read from disk or is used by other
> >applications eg. Deliverance[http://packages.python.org/Deliverance/].
> >In fact w3c's HTML validator throws a warning that one should declare
> >the character encoding in the document itself if it is missing.
>
> This is actually what the validator says:
>
> """
> No character encoding information was found within the document,
> either in an HTML meta element or an XML declaration. It is often
> recommended to declare the character encoding in the document
> itself, especially if there is a chance that the document will be
> read from or saved to disk, CD, etc.
> """
>
> As ZPT produces XHTML the proper place for any encoding declaration
> is in the XML declaration, defaulting to UTF-8, which should throw a
> validation error if incorrect.

A strong -1 for zope.pagetemplate adding <?xml ... ?> declarations
automatically.

> Like much of the HTML standard the
> meta tags were never really thought through and, because invisible
> to the user, all too often copied mindlessly from one project to
> another: I have customers today with completely invalid and
> misleading meta tags of which they and the rest of the world is
> blissfully unware. And as a result browsers - the main consumers of
> the format were made fault tolerant - after all the user often had
> no idea what was causing the problem or how to rectify it. I have
> seen many examples of the server saying one think and the meta
> something else entirely. I think nearly all browsers believe what
> the server says over what's in the meta tag.

The HTML spec requires that:

"To sum up, conforming user agents must observe the following
priorities when determining a document's character encoding (from
highest priority to lowest):

1. An HTTP "charset" parameter in a "Content-Type" field.
2. A META declaration with "http-equiv" set to "Content-Type" and a
value set for "charset".
3. The charset attribute set on an element that designates an
external resource."

-- http://www.w3.org/TR/html4/charset.html#h-5.2.2

(Aside: The rationale for this ordering, IIRC, is that it allows HTTP
servers to do on-the-fly charset conversion from one 8-bit charset to a
different one, without having to parse HTML and modify the charset name
in the <meta> declaration.)

> According to MAMA, which was instrumental in developing HTML 5 based
> on what has actually been written, the charset was set in the
> http-headersover 99 % of the time. Unfortunately, it doesn't contain
> any stats on discrepancies between the http-header and the meta.
>
> http://dev.opera.com/articles/view/mama
>
> While there is apparently a possible security risk when not
> declaring the charset I think the Pythonic principle of "there
> should be preferably one obvious way to do something" should apply
> when within Zope trying to decide the charset of a file and that
> should be well documented. I'd suggest keeping the stripping but
> implementing a more rigorous approach such as you suggest.

I'm not a big fan of the stripping.

Consider people using wget to mirror websites (or some equivalent way --
hitting Save As in a browser and selecting "Web Page (original)" instead
of "Web Page (complete)"). The Content-Type header is not going to be
saved on disk.

Why should zope.pagetemplate forbid programmers from duplicating the
charset information in the <meta> element, at least as long as that
information is correct (i.e. matches the content type)?

Marius Gedminas
--
http://pov.lt/ -- Zope 3/BlueBream consulting and development
Attachments: signature.asc (0.19 KB)


charlie.clark at clark-consulting

Mar 27, 2012, 1:54 AM

Post #8 of 14 (689 views)
Permalink
Re: Content Type Meta tag stripping in zope.pagetemplate [In reply to]

Am 25.02.2012, 00:18 Uhr, schrieb Marius Gedminas <marius [at] gedmin>:

> The HTML spec requires that:
> "To sum up, conforming user agents must observe the following
> priorities when determining a document's character encoding (from
> highest priority to lowest):
> 1. An HTTP "charset" parameter in a "Content-Type" field.
> 2. A META declaration with "http-equiv" set to "Content-Type" and a
> value set for "charset".
> 3. The charset attribute set on an element that designates an
> external resource."
> -- http://www.w3.org/TR/html4/charset.html#h-5.2.2

> (Aside: The rationale for this ordering, IIRC, is that it allows HTTP
> servers to do on-the-fly charset conversion from one 8-bit charset to a
> different one, without having to parse HTML and modify the charset name
> in the <meta> declaration.)

As a follow up to this it's worth noting that as from Opera 12 the
practice will be:

* BOM sniffing
* http header
* meta declaration

In that order and inline with Webkit and IE:

"""
It is better to encode your Web pages in UTF-8, and serve them as such. In
HTTP, the HTTP header has priority, then the meta name contained in HTML.
Some Web pages have specific encoding. It happens often on the Web that
the Web page encoding is different from the one specified in the file
and/or the one specified in HTTP headers. It creates issues for users who
receive unreadable characters on their screens. So the browsers have to
fix the encoding on the fly. We had bug reports about Web sites sending
BOM different from the HTTP header. We decided to make the BOM
authoritative like webkit and IE, because there are more chances for it to
be exact than the HTTP headers.
"""

http://my.opera.com/ODIN/blog/2012/03/26/whats-new-in-opera-development-snapshots-march-26-2012

Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226
_______________________________________________
Zope-Dev maillist - Zope-Dev [at] zope
https://mail.zope.org/mailman/listinfo/zope-dev
** No cross posts or HTML encoding! **
(Related lists -
https://mail.zope.org/mailman/listinfo/zope-announce
https://mail.zope.org/mailman/listinfo/zope )


fred at fdrake

Mar 27, 2012, 3:16 AM

Post #9 of 14 (691 views)
Permalink
Re: Content Type Meta tag stripping in zope.pagetemplate [In reply to]

On Tue, Mar 27, 2012 at 4:54 AM, Charlie Clark
<charlie.clark [at] clark-consulting> wrote:
> """
> We had bug reports about Web sites sending BOM different from the HTTP
> header.
> """

In other words... "the web" will continue to thrive on hacks and
sniffing data to
support users' expectations in spite of the data on "the web".

I appreciate the motivation (it's not the users' fault the content
provider can't
get it right), it saddens me that there will no end of quirks-mode-like data
interpretation. And that after this many years, we still can't get content-type
and encodings straightened out.


-Fred

--
Fred L. Drake, Jr.    <fred at fdrake.net>
"A storm broke loose in my mind."  --Albert Einstein
_______________________________________________
Zope-Dev maillist - Zope-Dev [at] zope
https://mail.zope.org/mailman/listinfo/zope-dev
** No cross posts or HTML encoding! **
(Related lists -
https://mail.zope.org/mailman/listinfo/zope-announce
https://mail.zope.org/mailman/listinfo/zope )


mianonjoka at gmail

Mar 27, 2012, 3:35 AM

Post #10 of 14 (694 views)
Permalink
Re: Content Type Meta tag stripping in zope.pagetemplate [In reply to]

>
>
>
> """
> It is better to encode your Web pages in UTF-8, and serve them as such. In
> HTTP, the HTTP header has priority, then the meta name contained in HTML.
> Some Web pages have specific encoding. It happens often on the Web that the
> Web page encoding is different from the one specified in the file and/or
> the one specified in HTTP headers. It creates issues for users who receive
> unreadable characters on their screens. So the browsers have to fix the
> encoding on the fly. We had bug reports about Web sites sending BOM
> different from the HTTP header. We decided to make the BOM authoritative
> like webkit and IE, because there are more chances for it to be exact than
> the HTTP headers.
> """
>
> http://my.opera.com/ODIN/blog/**2012/03/26/whats-new-in-opera-**
> development-snapshots-march-**26-2012<http://my.opera.com/ODIN/blog/2012/03/26/whats-new-in-opera-development-snapshots-march-26-2012>
>
>
In our case if a meta content type tag exists in a template, the HTTP
header charset parameter will always be set to the same value. Always.
There is no chance of a conflict. zope.pagetemplate should therefore not
opaquely strip out the meta tag.


charlie.clark at clark-consulting

Mar 27, 2012, 3:36 AM

Post #11 of 14 (690 views)
Permalink
Re: Content Type Meta tag stripping in zope.pagetemplate [In reply to]

Am 27.03.2012, 12:16 Uhr, schrieb Fred Drake <fred [at] fdrake>:

> In other words... "the web" will continue to thrive on hacks and
> sniffing data to
> support users' expectations in spite of the data on "the web".
> I appreciate the motivation (it's not the users' fault the content
> provider can't
> get it right), it saddens me that there will no end of quirks-mode-like
> data
> interpretation. And that after this many years, we still can't get
> content-type
> and encodings straightened out.

True but I think that the problem was largely of our own making in not
coming up with "one, preferably only one" way of handling this. Re-reading
Marius' post I was struck by the whole idea of the http-server transcoding
the content on the fly. Now, I've never looked at this in detail but have
any of the major webservers ever done that? Having struggled in the past
with "weird" encoding errors limited to Safari and IE only, probably
caused by me not handling the encode/decode chain properly in my code but
still left staring unbelievingly at a long and confusing traceback and
yearning for an easy to way "to do the right thing" which in my view
should be the webserver trying to serve up UTF-8.

I guess, that years ago we had to worry much more about encodings
(latin-1, windows-1252, mac-roman, IBM code pages, and whatever unix was
doing).

I've been reading about http 2.0 coming up - is this going to improve the
matter?

Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226
_______________________________________________
Zope-Dev maillist - Zope-Dev [at] zope
https://mail.zope.org/mailman/listinfo/zope-dev
** No cross posts or HTML encoding! **
(Related lists -
https://mail.zope.org/mailman/listinfo/zope-announce
https://mail.zope.org/mailman/listinfo/zope )


fred at fdrake

Mar 27, 2012, 7:04 AM

Post #12 of 14 (689 views)
Permalink
Re: Content Type Meta tag stripping in zope.pagetemplate [In reply to]

On Tue, Mar 27, 2012 at 6:36 AM, Charlie Clark
<charlie.clark [at] clark-consulting> wrote:
> True but I think that the problem was largely of our own making in not
> coming up with "one, preferably only one" way of handling this. Re-reading
> Marius' post I was struck by the whole idea of the http-server transcoding
> the content on the fly.

Transcoding on the fly?

The page template generates Unicode; that's then encoded.

Are you suggesting we shouldn't be using Unicode as the internal representation?
Failure to do so would make it easy to get things wrong.


-Fred

--
Fred L. Drake, Jr.    <fred at fdrake.net>
"A storm broke loose in my mind."  --Albert Einstein
_______________________________________________
Zope-Dev maillist - Zope-Dev [at] zope
https://mail.zope.org/mailman/listinfo/zope-dev
** No cross posts or HTML encoding! **
(Related lists -
https://mail.zope.org/mailman/listinfo/zope-announce
https://mail.zope.org/mailman/listinfo/zope )


charlie.clark at clark-consulting

Mar 27, 2012, 7:42 AM

Post #13 of 14 (689 views)
Permalink
Re: Content Type Meta tag stripping in zope.pagetemplate [In reply to]

Am 27.03.2012, 16:04 Uhr, schrieb Fred Drake <fred [at] fdrake>:

> Transcoding on the fly?
> The page template generates Unicode; that's then encoded.
> Are you suggesting we shouldn't be using Unicode as the internal
> representation?

Not at all, just harking back to the time when we didn't use unicode
internally. In the CMF we still have to deal with that on occasion.

> Failure to do so would make it easy to get things wrong.

Indeed.

Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226
_______________________________________________
Zope-Dev maillist - Zope-Dev [at] zope
https://mail.zope.org/mailman/listinfo/zope-dev
** No cross posts or HTML encoding! **
(Related lists -
https://mail.zope.org/mailman/listinfo/zope-announce
https://mail.zope.org/mailman/listinfo/zope )


marius at gedmin

Mar 27, 2012, 3:09 PM

Post #14 of 14 (683 views)
Permalink
Re: Content Type Meta tag stripping in zope.pagetemplate [In reply to]

On Tue, Mar 27, 2012 at 12:36:19PM +0200, Charlie Clark wrote:
> Am 27.03.2012, 12:16 Uhr, schrieb Fred Drake <fred [at] fdrake>:
>
> >In other words... "the web" will continue to thrive on hacks and
> >sniffing data to support users' expectations in spite of the data on
> >"the web". I appreciate the motivation (it's not the users' fault
> >the content provider can't get it right), it saddens me that there
> >will no end of quirks-mode-like data interpretation. And that after
> >this many years, we still can't get content-type and encodings
> >straightened out.
>
> True but I think that the problem was largely of our own making in
> not coming up with "one, preferably only one" way of handling this.
> Re-reading Marius' post I was struck by the whole idea of the
> http-server transcoding the content on the fly. Now, I've never
> looked at this in detail but have any of the major webservers ever
> done that?

No idea.

I wish I remembered where I read about that. There used to be a dozen
charsets for Russian (koi8-r, windows-1251, cp866, iso-8859-5,
x-mac-cyrillic; basically at least one for every OS ;) and some websites
even went as far as letting the visitor choose the charset they wanted
to see.

*google google*

"Having words like "please, choose an appropriate encoding" on your
pages is really a BAD idea, drives people crazy. [...]

Here is my advice. Get the latest version of Apache and a FLY plug-in
module written by Igor Sereda (sereda [at] spb). The module
allows on-the-fly recoding from one character set to another on the
basis of either HTTP_ACCEPT_CHARSET or, if it is not set, it scans
"User-Agent" field from which it tries to figure out what platform
and OS you are on. I am archiving it on sunsite as well."
-- http://www.ibiblio.org/sergei/Software/http.html

That document even mentions *proxy servers doing charset conversions on
the fly*. (O.o)

This is all, of course, completely irrelevant to the modern web.

I mention this merely as a historical curiosity, because I find the "why
are the rules that way?" type of questions fascinating.

Marius Gedminas
--
http://pov.lt/ -- Zope 3/BlueBream consulting and development
Attachments: signature.asc (0.19 KB)

Zope dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.