
marius at gedmin
Feb 24, 2012, 3:18 PM
Post #7 of 14
(578 views)
Permalink
|
|
Re: Content Type Meta tag stripping in zope.pagetemplate
[In reply to]
|
|
On Fri, Feb 24, 2012 at 09:57:57PM +0100, Charlie Clark wrote: > Am 24.02.2012, 09:47 Uhr, schrieb Miano Njoka <mianonjoka [at] gmail>: > > >While it is not essential, it is necessary in some cases where the > >finished document will be read from disk or is used by other > >applications eg. Deliverance[http://packages.python.org/Deliverance/]. > >In fact w3c's HTML validator throws a warning that one should declare > >the character encoding in the document itself if it is missing. > > This is actually what the validator says: > > """ > No character encoding information was found within the document, > either in an HTML meta element or an XML declaration. It is often > recommended to declare the character encoding in the document > itself, especially if there is a chance that the document will be > read from or saved to disk, CD, etc. > """ > > As ZPT produces XHTML the proper place for any encoding declaration > is in the XML declaration, defaulting to UTF-8, which should throw a > validation error if incorrect. A strong -1 for zope.pagetemplate adding <?xml ... ?> declarations automatically. > Like much of the HTML standard the > meta tags were never really thought through and, because invisible > to the user, all too often copied mindlessly from one project to > another: I have customers today with completely invalid and > misleading meta tags of which they and the rest of the world is > blissfully unware. And as a result browsers - the main consumers of > the format were made fault tolerant - after all the user often had > no idea what was causing the problem or how to rectify it. I have > seen many examples of the server saying one think and the meta > something else entirely. I think nearly all browsers believe what > the server says over what's in the meta tag. The HTML spec requires that: "To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest): 1. An HTTP "charset" parameter in a "Content-Type" field. 2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset". 3. The charset attribute set on an element that designates an external resource." -- http://www.w3.org/TR/html4/charset.html#h-5.2.2 (Aside: The rationale for this ordering, IIRC, is that it allows HTTP servers to do on-the-fly charset conversion from one 8-bit charset to a different one, without having to parse HTML and modify the charset name in the <meta> declaration.) > According to MAMA, which was instrumental in developing HTML 5 based > on what has actually been written, the charset was set in the > http-headersover 99 % of the time. Unfortunately, it doesn't contain > any stats on discrepancies between the http-header and the meta. > > http://dev.opera.com/articles/view/mama > > While there is apparently a possible security risk when not > declaring the charset I think the Pythonic principle of "there > should be preferably one obvious way to do something" should apply > when within Zope trying to decide the charset of a file and that > should be well documented. I'd suggest keeping the stripping but > implementing a more rigorous approach such as you suggest. I'm not a big fan of the stripping. Consider people using wget to mirror websites (or some equivalent way -- hitting Save As in a browser and selecting "Web Page (original)" instead of "Web Page (complete)"). The Content-Type header is not going to be saved on disk. Why should zope.pagetemplate forbid programmers from duplicating the charset information in the <meta> element, at least as long as that information is correct (i.e. matches the content type)? Marius Gedminas -- http://pov.lt/ -- Zope 3/BlueBream consulting and development
|