2008-12-03 (updated later) Comments to: markdavis@google.com International Domain Names (IDN) have been available for about 5 years now, defined by a set of IETF RFCs called IDNA 2003. These permit non-ASCII domain names, like "http://ÖBB.at" for the Austrian railway system (Österreichische Bundesbahnen - ÖBB). There is a new draft set of RFCs called IDNA 2008 {tables , defs , rationale , protocol , bidi} nearing completion. This document describes some security issues with the changes that would be introduced by IDNA 2008. This document is the result of a request by members of the Unicode Technical Committee to produce a summary of the potential security problems in the current draft IDNA 2008, for circulation to security teams within their organizations. The goal here is to describe the security issues as if the current draft IDNA 2008 were approved as is. That draft is a moving target, and the text here may undergo progressive revisions if there are changes in the draft IDNA 2008.
Comments and questions are welcome. However, if readers have any concerns about the draft IDNA 2008 itself, the appropriate forum to voice them is at idna-update (joining the email list and thereby the working group), not in response to this document.
Main Differences
|
Category |
Description |
Comments |
Strict |
No mapping |
Thus rejecting http://ÖBB.at (but permitting http://öbb.at) |
Hybrid |
Map as in IDNA 2003
& disallow symbols |
Using the Unicode comparison format. Thus it will allow http://ÖBB.at, mapping it to http://öbb.at. |
Compatible |
Map as in IDNA 2003
& allow symbols |
Same as Hybrid, except that it also allows IDNs like http://√.com. (See above under Subtractions.) |
Custom |
Non-standard mapping |
Arbitrary other mappings are allowed in the current draft of IDNA 2008. Thus a custom implementation could allow http://ÖBB.at, mapping it to http://øbb.at, or to http://oebb.at, or to http://obb.at, or to anything else, even http:/phishing.com. One IDNA 2008-Custom implementation could map http://TÜRKIYE.com to http://türkiye.com while another could map it to http://türkıye.com (note the dotless i) -- and go to a different location. |
Special Cases (Deviations)
There are a few situations (luckily only a few) where IDNA 2008-Strict will result in the resolution of IDNs to different IP addresses than in IDNA 2003. This affects a small number of characters, but that are relatively common in particular languages and will affect a significant number of strings in those languages. (For more information on why IDNA 2003 does this, see the FAQ.) These four "Special Cases" are all listed in the table below:
Code |
Character |
IDNA
2008 |
IDNA
2003 |
Example:
IDNA 2008 |
Example:
IDNA 2003 |
ß |
ß |
ss |
|||
ς |
ς |
σ |
|||
ZWJ |
ZWJ |
deleted |
[TBD] |
|
|
ZWNJ |
ZWNJ |
deleted |
[TBD] |
|
These differences allow for security exploits. Consider the following URL, where "IDNBANK.xx" represents an IDN for a bank.
-
Alice's browser supports IDNA 2003. Under those rules, "IDNBANK.xx" is mapped to "xn--blahblah", which is registered by the "xx" registry and resolves to the IP address 127.0.63.245.
-
Bob's browser supports IDNA 2008. Under those rules, "IDNBANK.xx" is also valid, but converts to a different punycode "xn--gorpblah", which it turns out is also registered by the "xx" registry and resolves to different IP address: 136.17.22.221.
The site at
http://136.17.22.221/index.html turns out to be a deliberate spoof page
(put up by a scammer) of the legitimate page
http://127.0.63.245/index.html, a banking site. Alice gets to the
correct page she is seeking. Bob gets to the phishing site instead,
supplies his bank password, and is robbed.
Note that this exploit can be carried out no matter which of the IDNA 2008 implementation categories Bob's browser uses.
The ZWJ and ZWNJ characters are of particular concern, because they are normally invisible. That is, the sequence "a<ZWJ>b" looks just like "ab". IDNA 2008 does provide a special category for characters like this (called CONTEXT), and only permits them in certain contexts (certain sequences of Arabic or Indic characters, for example). However, lookup applications are not required to check for these contexts, so overall security is dependent on registries' correct implementations.
The existence of these special cases means that the Unicode comparison format used in Hybrid and Compatible implementations needs to be modified to exclude these characters.
Tactics
For compatibility in the foreseeable future, special steps would need to be taken with Special Cases. When doing DNS lookup, a Compatible/Hybrid application would need to first try a lookup with the modified Unicode comparison format, preserving Special Cases, then try a lookup with the Special Cases mapped according to IDNA 2003. If both of the lookups work, but resolve to different IP addresses, then the lookup should fail. If exactly one succeeds, then it would be used. That allows all sites based on either IDNA 2008 or IDNA 2003 to work, and prevents the above problem. Luckily the number of IDNs with the Special Cases will be a small fraction of the total, so this should not impact performance.
While some steps could be taken by registries to mitigate the above problems, we must remember that we are not only talking about top level domains, or second level domains, but also lower level domains that are under the control of thousands of different organizations. For example, the domain names under "blogspot.com", such as http://café.blogspot.com, are controlled by the company that has registered "blogspot". Ideally no registries would allow two IDNs that correspond according to the Special Cases table to resolve to different IP addresses. So blogspot would need to disallow registration of both the registration of http://gefäss.blogspot.com and of http://gefäß.blogspot.com, to prevent problems (and of other cases like the normally-invisible ZWJ and ZWNJ). However, applications cannot depend on all such registries behaving correctly, because the odds are high that at least some (and perhaps many) of the many thousands of registries will not check for this. Thus the burden is primarily on applications handling IDNs to prevent the situation.
The worst of all possible cases is an IDNA 2008-Custom implementation. Unfortunately, there appears to be no good way to prevent security problems with IDNA 2008 Custom implementations, because it is impossible to anticipate what such implementations would do. Such an implementation is not limited to just the above four special cases for exploits -- it could remap even characters like "A" or "B" to an arbitrary other character (or sequence). Because there is no way to predict what it will do, there are no effective countermeasures.
Note: Some of us have particular concerns about allowing arbitrary mappings -- and think that if there is a mapping, it must be consistent with IDNA 2003 -- excepting the Special Cases. However, that does not appear to be the current consensus in the idna-update working group. Clients such as search engines have another practical issue facing
them. They will probably opt for IDNA 2008-Compatible, allowing all
valid IDNA 2003 characters so that they can access all of the web.
Normally they also need to canonicalize URLs, so that they can
determine when two URLs are actually the same. For IDNA 2003 this was
straightforward. For IDNA 2008-Hybrid/Compatible, the canonicalization
can result in two different possibilities (with or without Special
Cases). It may then require two DNS lookups to determine which of the
two possibilities is to be used.
Whatever approach is taken, IDNA 2008 does not make any appreciable
difference in reducing problems with visually-confusable characters
(so-called homographs). Thus programmers still need to be aware of
those issues as detailed in Unicode Security Considerations,
including the list of potentially visually-confusable characters that
can be used in programmatic tests found in that Unicode Technical
Report.
Conclusions
As implementations update to IDNA2008, we will for some
considerable length of time have a situation where there are both IDNA
2003 and IDNA 2008 implementations in use, with the possible categories
of IDNA 2008 given above: Strict, Hybrid, Compatible, or Custom.
To reduce security concerns, we strongly hope that no implementations choose a Custom variant, to avoid indeterminacies which can cause security problems. (Even better would be if this option were removed from the IDNA 2008 specs!) To maintain compatibility, we anticipate that few implementations will opt for the Strict variant.
That is, most would implement either IDNA 2008-Hybrid or IDNA 2008-Compatible in the near term. Once sufficiently many high-level registries disallow symbols, the IDNA 2008-Compatible implementations could probably move towards IDNA 2008-Hybrid. It is unclear when, if ever, it would reasonable for those implementations to move to being Strict.
FAQ
Q. What are examples of where the different categories of IDNA implementation behave differently?
Q. What are the main advantages of IDNA2008?
Q. What is "bidi label hopping?
Q. What are the main disadvantages of IDNA2008?
Q. Are the "local" mappings just a UI issue?
Q. Do the Custom exploits require unscrupulous registries?
Q. What is the motivation for allowing arbitrary (Custom) mappings?
Q. What are examples of where the different categories of IDNA implementation behave differently?
A. Here is a table that illustrates the differences, where 2003 is the current behavior.
|
2003 | 2008-Compatible | 2008-Hybrid | 2008-Strict | 2008-Custom | Comments |
http://öbb.at | Yes |
Yes |
Yes | Yes |
Yes |
Simple characters |
http://ÖBB.at | Yes | Yes |
Yes |
No | ? | Case mapping |
http://√.com | Yes |
Yes |
No | No | ? | Symbol |
http://faß.de | Yes | Yes* | Yes* | Yes* | Yes* | Special (different IP address) |
http://ԛәлп.com |
No | Yes | Yes | Yes | Yes | New Unicode (version 5.1) U+051B (ԛ) cyrillic qa |
Q. How much of a problem is this actually if support for symbols like √.com were just dropped immediately?
A. Browsers and other user agents can't and won't change from 2003 overnight. And who knows how long it would take for registries to change, notify people that their registrations are no longer valid, and handle whatever legal issues there are. The number of symbol registrations is rather small, and probably not of great concern. While the exclusion of symbols doesn't buy much, it doesn't hurt much either.The larger concern are those cases like http://Brüder.com that work now (on IDNA 2003, being equivalent to http://brüder.com), but fail under a strict implementation.
Q. What are the main advantages of IDNA2008?
A. The main advantages are:- It is no longer tied to an old version of Unicode (version 3.2)
- It updates automatically to each new version of Unicode
- It allows some characters that were forbidden, but which large communities feel they need (the Special Cases)
- It improves the bidi handling, allowing for some sequences that 2003 should not have restricted (eg, trailing combining marks, needed for Thaana), and restricting sequences that lead to "bidi label hopping". (While these new bidi rules go a long way towards reducing this problem, they do not eliminate it because they do not check for inter-label situations.)
Q. What is "bidi label hopping?
A. It is where bidi reordering causes characters from one label to appear to be part of another label. For example, with "B1.d" in a right-to-left paragraph (where B stands for an Arabic or Hebrew letter), the display would be "1.dB".Q. What are the main disadvantages of IDNA2008?
A. The main disadvantages are:- The IDNA 2008-Custom implementations appear to offer the opportunity for significant interoperability and security problems, with no effective means of handling them.
- The Special Cases offer the opportunity for interoperability and security problems if not handled correctly. (However, it appears that there are ways to handle them.)
- There are new contextual rules that are fairly complicated to implement, and are not in a machine-readable format. Without a comprehensive test suite and/or reference implementations to test against, it is fairly likely that there will be incompatibilities. Of particular interest are the invisible ZWJ/ZWNJ characters, which offer opportunities for spoofing if not properly restricted.
- The removal of about 2,900 symbols from the current definition of IDNA is an incompatible change. (Luckily these symbols are relatively rarely used, so this appears to be a minor issue.)
Q. Are the "local" mappings just a UI issue?
A. No, not if what is meant is that they are only involved in interactions with the address bar. Examples:- Alice sees that a URL works in her browser (say http://faß.de or http://TÜRKIYE.com).
She sends it to Bob in an email, who clicks on the email
representation. He goes to the bad site, because his browser maps to http://fass.de or http://türkiye.com while Alice's maps to http://faß.de or http://türkıye.com.
- Alice creates a web page, using <a href="http://faß.de"> (or http://TÜRKIYE.com). Bob clicks on the link, and goes to a bad site.
- It is generally understood at the W3C that all attributes that take URLs should take full IRIs, not punycoded-URIs, so for example SVG, MathML, XLink, XML, etc, all take IRIs now, as does HTML5.
- Alice is in a IM chat with Bob. She copies in http://faß.de (or http://TÜRKIYE.com) and hits return. Bob clicks on the link he sees in his chat window. Bob clicks on the link, and goes to a bad site.
- Alice sends a Word document to Bob with a link in it...
- Alice creates a PDF document...
- ...
Q. Do the Custom exploits require unscrupulous registries?
A. No. The exploits don't require unscrupulous registries -- it only requires that registries don't police every URL that they register for possible spoofing behavior. And we have clear evidence that the registries are simply unable to do that. After all, if we could depend on registries to always do the "right thing", we wouldn't need any restrictions on characters at the protocol level!Q. Why does IDNA 2003 map eszett (ß) to "ss", and map final sigma (ς) to sigma (σ), and delete ZWJ/ZWNJ?
A. This is to provide full case insensitivity, following the Unicode Standard. These characters are anomalous: the standard uppercase of ß is "SS", the same as the uppercase of "ss", and the uppercase of ς is Σ, the same as the uppercase of σ. For full case insensitivity (with transitivity), {ss, ß, SS} and{σ, ς, Σ} need to be treated as equivalent, with one of each set chosen as the representative in the mapping. That is what is done in the Unicode Standard, which was followed by IDNA 2003.Note that the text "ΒόλοΣ.com", which appears on http://Βόλος.com, illustrates this: the normal case mapping of Σ is to σ. If σ and ς are not treated as case variants, there wouldn't be a match between ΒόλοΣ and Βόλος.
In German, the situation is even more complicated:
- In Switzerland, "ss" is uniformly used instead of ß.
- The recent spelling reform in Germany and Austria changed whether ß or ss is used in many words. For example, http://Schloß.de was the spelling before 1996, and http://Schloss.de is correct after.
- Recently, in Unicode 5.1, an uppercase version of ß was added (ẞ), since it is attested in some cases. It is unknown, however, whether it will ever become the preferred uppercase. Unicode now treats all of these as a single equivalence class for case-insensitive matching: {ss, ß, SS, ẞ}. See also the Unicode FAQ.
IDNA 2003 deletes ZWJ, ZWNJ and other characters that are
themselves invisible but may affect rendering. IDNA 2008 allows them,
but only in limited contexts.
Q. Why allow ZWJ/ZWNJ at all?
Q. What is the motivation for allowing arbitrary (Custom) mappings?
A. It is unclear, because as with many instances in the rationale document, there are no concrete examples that would help justify that choice. However, it appears that it was the Turkic dotted/dotless I issue. Unfortunately, instead of limiting it to that one case (which would be painful enough), the IDNA 2008 specification currently allows for any mapping. For example, some people disagree on the case mappings for Greek, and even French. The spec makes no distinction between useful mappings (like those for compatibility with 2003) and potentially bizarre or pernicious ones.Q. Why doesn't IDNA 2008 (or for that matter IDNA 2003) restrict allowed domains on the basis of language?
A. It is extremely difficult to restrict on the basis of language, because the letters used in a particular language are not well defined. The "core" letters typically are, but many others are typically accepted in loan words, and have perfectly legitimate commercial and social use.
It is a bit easier to maintain a bright line based on script differences between characters: every Unicode character has a defined script (or is Common/Inherited). Even there it is problematic to have that as a restriction. Some languages (Japanese) require multiple scripts. And in most cases, mixtures of scripts are harmless. One can have SONY日本.com with no problems at all -- while there are many cases of "homographs" (visually confusable characters) within the same script that a restriction based on script doesn't deal with.
It would have been of some aid to remove historic scripts (like cuneiform) from the protocol, but the IDNA working group didn't agree to that. See Unicode Specific Character Adjustments, Table 4.
The rough consensus among the working group is that script/language mixing restrictions are not appropriate for the lowest-level protocol. So in this respect, IDNA 2008 is no different than IDNA 2003. IDNA doesn't try to attack the homograph problem, because it is too difficult to have a bright line. Effective solutions depend on information or capabilities outside of the protocol's control, such as language restrictions appropriate for a particular registry, the language of the user looking at this URL, the ability of a UI to display suspicious URLs with special highlighting, and so on.
Responsible registries will have their own rules, since they can apply such restrictions. For example, DENIC can decide on a restricted set of characters appropriate for German. Apps also take certain precautions -- MSIE, Safari, and Chrome all display domain names in Unicode only if the user's language(s) typically use the scripts in those domain names. Firefox is the odd man out, expecting TLD registries to publish rules to Firefox management's liking. There is more on the kinds of techniques that implementations can use on the Unicode web site, at [Unicode Security Considerations].
Q. Why care about the changes to ß and ς?
If we were starting with a blank slate, it would be feasible to have ß/ẞ kept separate from ss/SS (although note that the German national standards body doesn't: the uppercase of ß is SS, which is why they got connected in the first place in Unicode). Similarly, it would have been possible to separate σ/ς (again, joined because they have a common uppercase Σ).But we are not starting with a clean slate: we are facing with changes to an existing widely-deployed standard, over 6 years old, which is a long time in terms of the web. We have two effective options:
A. Maintain compatibility with IDNA2003
URLs |
Result |
When |
http://www.γιατρός.gr http://www.γιατρόσ.gr http://www.ΓΙΑΤΡΌΣ.gr |
http://www.xn--mxads7ake1d.gr | always |
http://www.weltfußball.at http://www.WELTFUẞBALL.at http://www.weltfussball.at http://www.WELTFUSSBALL.at |
http://www.weltfussball.at | always |
B. Deviate from IDNA2003
Try to separate them, leaving users with a de-facto indeterminant mapping. That results in the following:URLs |
Result |
When |
http://www.γιατρός.gr | http://www.xn--mxads7afk1d.gr | sometimes |
http://www.xn--mxads7ake1d.gr | sometimes |
|
http://www.γιατρόσ.gr http://www.ΓΙΑΤΡΌΣ.gr |
http://www.xn--mxads7ake1d.gr | always |
http://www.weltfußball.at http://www.WELTFUẞBALL.at |
http://www.xn--weltfuball-b4a.at | sometimes |
http://www.weltfussball.at | sometimes | |
http://www.weltfussball.at http://www.WELTFUSSBALL.at |
http://www.weltfussball.at | always |
Now perhaps the NICs for de, at, and ch will address ß/ẞ/ss/SS (and the NIC for gr will address σ/ς/Σ) by bundling or blocking. But bundling or blocking defeats the purpose of separating them, and there is little reason to think that .com, .biz, .whatever will all do the same -- not to speak of the many, many more registries below the top level.