
sidney at sidney
Nov 3, 2009, 5:42 PM
Post #1 of 14
(1173 views)
Permalink
|
|
Re: Non-Roman characters in TLDs and domain names
|
|
[.This is a repost excerpted from two messages I sent to the list. I just discovered that my email settings were left incorrect after I recovered from a hard disk crash. I apologize for the redundancy if the other two messages are just stuck instead of lost and you end up seeing them.] I'm bringing this up on dev list to get some discussion of the technical issues involved before opening a Bugzilla issue for it. News of an ICANN decision to allow international character sets in domain names was reported last week, for example, in this article: http://www.voanews.com/english/2009-10-30-voa14.cfm The article doesn't have much technical detail, but does say that there will be new TLDs "by the end of the year" which is less than two months away. I'm concerned that it might have a big impact on SpamAssassin's parsing of headers and URLs. Further digging found this: http://idn.icann.org/E-mail_test which seems to imply that email will use the A-label encoding of IDN for email addresses, which converts charset encoded characters into encoded ASCII strings from the alphabet a through z and the hyphen character, with a prefix of "xn--". As far as I can tell from the examples there will be new TLDs that will have to be A-label encoded. I think this means that there will not need to be a major change to SpamAsassin regarding parsing of headers in which A-label encoding is required. Where we now have routines that check for valid TLDs looking for .com, .org, .us, .kr, etc., we will simply have to add some new TLDs to the list. They will still be specific fixed ASCII strings, just that there will be new TLDs that look like ".xn--deba0ad" However, what does this mean for detecting URLs in plain text messages in which a URL string can be in a non-ASCII charset and MUAs might (eventually) parse them as URLs? -- sidney
|