Security Degradations with IDNA2008

2008-12-03 (updated later)

Comments to: markdavis@google.com

International Domain Names (IDN) have been available for about 5 years now, defined by a set of IETF RFCs called IDNA 2003. These permit non-ASCII domain names, like "http://ÖBB.at" for the Austrian railway system (Österreichische Bundesbahnen - ÖBB). There is a new draft set of RFCs called IDNA 2008 {tables , defs , rationale , protocol , bidi} nearing completion. This document describes some security issues with the changes that would be introduced by IDNA 2008.

This document is the result of a request by members of the Unicode Technical Committee to produce a summary of the potential security problems in the current draft IDNA 2008, for circulation to security teams within their organizations.

The goal here is to describe the security issues as if the current draft IDNA 2008 were approved as is. That draft is a moving target, and the text here may undergo progressive revisions if there are changes in the draft IDNA 2008.

Comments and questions are welcome. However, if readers have any concerns about the draft IDNA 2008 itself, the appropriate forum to voice them is at idna-update (joining the email list and thereby the working group), not in response to this document.

Main Differences

For the most common cases, IDNA 2003 and IDNA 2008 behave identically. Both map a user-visible Unicode form of a URL (like http://öbb.at) to a transformed version with only ASCII characters that is actually sent over the wire, the punycode version (like http://xn--bb-eka.at). The punycode version can later be transformed back into Unicode form for display. (For a demo of this process, see the icu idnbrowser). In this document, we'll focus on the user-visible Unicode forms.

IDNA 2008 does not maintain backwards compatibility with IDNA 2003. The main differences between the two are:

    • Additions. Some IDNs are invalid in IDNA 2003, but valid in IDNA 2008.

    • Subtractions. Some IDNs are valid in IDNA 2003, but invalid in IDNA 2008.

    • Deviations. Some IDNs are valid in both, but resolve to different destinations.

    • Unpredictable cases. Some IDNs do not have predictable behavior in IDNA 2008, due to "Local Mappings". They may fail, or may have any of the above characteristics.

Additions. The additions are expected: IDNA 2003 is built on an old version of Unicode (version 3.2). There are almost 5,500 more characters in the newest version of Unicode (version 5.1), of which about 3,800 are suitable for international domain names. IDNA 2008 not only allows many more characters, it also is not tied to a specific version of Unicode. Thus as characters are added to Unicode, they will become valid in IDNA 2008. (There are also some combinations of characters that are valid in IDNA 2003 but not in 2008, and vice versa.)

Subtractions. IDNA 2008 removes many characters that were valid under IDNA 2003, because it makes most symbols and punctuation be illegal. So http://√.com is valid (and goes to a site) in an IDNA 2003 implementation; it would fail on a IDNA 2008 implementation. This affects about 2,900 characters, mostly rarely used ones. (An extremely small percentage of those 2,900 cases are security risks because of confusability. The majority are unproblematic: having http://I♥NY.com won't cause problems.)

Deviations and Unpredictable Cases. Neither the additions nor the subtractions are expected to cause security problems (although if anyone can think of potential problems, please send them in). The deviations and unpredictable cases are a different matter. They result from the ways in which IDNA 2003 and IDNA 2008 allow for mappings: conversion of one IDN string into another, plus four special cases in IDNA 2008.

Mappings

IDNA 2003 requires a mapping phase, which maps http://ÖBB.at to http://öbb.at (for example). Mapping typically involves mapping uppercase characters to their lowercase pairs, but it also involves other types of mappings between equivalent characters, such as mapping half-width katakana characters to normal (full-width) katakana characters in Japanese. The IDNA mapping is based on data specified by Unicode and results in what is called the comparison format in this discussion. Formally, this format is based on the Unicode NFKC normalization format, plus Unicode case foldings, plus deletion of default-ignorable characters (normally invisible). This affects roughly 4,500 characters, including some quite common ones: all uppercase characters, for example.

The reason for this mapping phase in IDNA 2003 was to match the case-insensitivity of ASCII domain names. Users are accustomed to having both http://CNN.com and http://cnn.com work identically. They would not expect the addition of an accent to make a difference: they expect that if http://Bruder.com is the same as http://bruder.com, then of course http://Brüder.com is the same as http://brüder.com. Case equivalences and similar equivalences between characters from with other alphabets are handled by the Unicode comparison format.

IDNA 2008 does not require a mapping phase, but does permit one (called "Local Mapping") with no limitations on what the mapping can do to disallowed characters (including even ASCII uppercase characters, if they occur in an IDN). For more information on the permitted mappings, see Protocol, Section 4.3 and Protocol, Section 5.3.

We can thus categorize IDNA-2008 implementations into four main general categories, given in the table below. (Note that these categories are not found in IDNA 2008: they are instead categories based on anticipated implementations of IDNA 2008.)

Special Cases (Deviations)

There are a few situations (luckily only a few) where IDNA 2008-Strict will result in the resolution of IDNs to different IP addresses than in IDNA 2003. This affects a small number of characters, but that are relatively common in particular languages and will affect a significant number of strings in those languages. (For more information on why IDNA 2003 does this, see the FAQ.) These four "Special Cases" are all listed in the table below:

These differences allow for security exploits. Consider the following URL, where "IDNBANK.xx" represents an IDN for a bank.

    1. Alice's browser supports IDNA 2003. Under those rules, "IDNBANK.xx" is mapped to "xn--blahblah", which is registered by the "xx" registry and resolves to the IP address 127.0.63.245.

    2. Bob's browser supports IDNA 2008. Under those rules, "IDNBANK.xx" is also valid, but converts to a different punycode "xn--gorpblah", which it turns out is also registered by the "xx" registry and resolves to different IP address: 136.17.22.221.

The site at http://136.17.22.221/index.html turns out to be a deliberate spoof page (put up by a scammer) of the legitimate page http://127.0.63.245/index.html, a banking site. Alice gets to the correct page she is seeking. Bob gets to the phishing site instead, supplies his bank password, and is robbed.

Note that this exploit can be carried out no matter which of the IDNA 2008 implementation categories Bob's browser uses.

The ZWJ and ZWNJ characters are of particular concern, because they are normally invisible. That is, the sequence "a<ZWJ>b" looks just like "ab". IDNA 2008 does provide a special category for characters like this (called CONTEXT), and only permits them in certain contexts (certain sequences of Arabic or Indic characters, for example). However, lookup applications are not required to check for these contexts, so overall security is dependent on registries' correct implementations.

The existence of these special cases means that the Unicode comparison format used in Hybrid and Compatible implementations needs to be modified to exclude these characters.

Tactics

For compatibility in the foreseeable future, special steps would need to be taken with Special Cases. When doing DNS lookup, a Compatible/Hybrid application would need to first try a lookup with the modified Unicode comparison format, preserving Special Cases, then try a lookup with the Special Cases mapped according to IDNA 2003. If both of the lookups work, but resolve to different IP addresses, then the lookup should fail. If exactly one succeeds, then it would be used. That allows all sites based on either IDNA 2008 or IDNA 2003 to work, and prevents the above problem. Luckily the number of IDNs with the Special Cases will be a small fraction of the total, so this should not impact performance.

While some steps could be taken by registries to mitigate the above problems, we must remember that we are not only talking about top level domains, or second level domains, but also lower level domains that are under the control of thousands of different organizations. For example, the domain names under "blogspot.com", such as http://café.blogspot.com, are controlled by the company that has registered "blogspot". Ideally no registries would allow two IDNs that correspond according to the Special Cases table to resolve to different IP addresses. So blogspot would need to disallow registration of both the registration of http://gefäss.blogspot.com and of http://gefäß.blogspot.com, to prevent problems (and of other cases like the normally-invisible ZWJ and ZWNJ). However, applications cannot depend on all such registries behaving correctly, because the odds are high that at least some (and perhaps many) of the many thousands of registries will not check for this. Thus the burden is primarily on applications handling IDNs to prevent the situation.

The worst of all possible cases is an IDNA 2008-Custom implementation. Unfortunately, there appears to be no good way to prevent security problems with IDNA 2008 Custom implementations, because it is impossible to anticipate what such implementations would do. Such an implementation is not limited to just the above four special cases for exploits -- it could remap even characters like "A" or "B" to an arbitrary other character (or sequence). Because there is no way to predict what it will do, there are no effective countermeasures.

Note: Some of us have particular concerns about allowing arbitrary mappings -- and think that if there is a mapping, it must be consistent with IDNA 2003 -- excepting the Special Cases. However, that does not appear to be the current consensus in the idna-update working group.

Clients such as search engines have another practical issue facing them. They will probably opt for IDNA 2008-Compatible, allowing all valid IDNA 2003 characters so that they can access all of the web. Normally they also need to canonicalize URLs, so that they can determine when two URLs are actually the same. For IDNA 2003 this was straightforward. For IDNA 2008-Hybrid/Compatible, the canonicalization can result in two different possibilities (with or without Special Cases). It may then require two DNS lookups to determine which of the two possibilities is to be used.

Whatever approach is taken, IDNA 2008 does not make any appreciable difference in reducing problems with visually-confusable characters (so-called homographs). Thus programmers still need to be aware of those issues as detailed in Unicode Security Considerations, including the list of potentially visually-confusable characters that can be used in programmatic tests found in that Unicode Technical Report.

Conclusions

As implementations update to IDNA2008, we will for some considerable length of time have a situation where there are both IDNA 2003 and IDNA 2008 implementations in use, with the possible categories of IDNA 2008 given above: Strict, Hybrid, Compatible, or Custom.

To reduce security concerns, we strongly hope that no implementations choose a Custom variant, to avoid indeterminacies which can cause security problems. (Even better would be if this option were removed from the IDNA 2008 specs!) To maintain compatibility, we anticipate that few implementations will opt for the Strict variant.

That is, most would implement either IDNA 2008-Hybrid or IDNA 2008-Compatible in the near term. Once sufficiently many high-level registries disallow symbols, the IDNA 2008-Compatible implementations could probably move towards IDNA 2008-Hybrid. It is unclear when, if ever, it would reasonable for those implementations to move to being Strict.

FAQ

Q. What are examples of where the different categories of IDNA implementation behave differently?

Q. How much of a problem is this actually if support for symbols like √.com were just dropped immediately?

Q. What are the main advantages of IDNA2008?

Q. What is "bidi label hopping?

Q. What are the main disadvantages of IDNA2008?

Q. Are the "local" mappings just a UI issue?

Q. Do the Custom exploits require unscrupulous registries?

Q. Why does IDNA 2003 map eszett (ß) to "ss", and map final sigma (ς) to sigma (σ), and delete ZWJ/ZWNJ?

Q. Why allow ZWJ/ZWNJ at all?

Q. What is the motivation for allowing arbitrary (Custom) mappings?

Q. Why doesn't IDNA 2008 (or for that matter IDNA 2003) restrict allowed domains on the basis of language?

Q. What are examples of where the different categories of IDNA implementation behave differently?

A. Here is a table that illustrates the differences, where 2003 is the current behavior.

Q. How much of a problem is this actually if support for symbols like √.com were just dropped immediately?

A. Browsers and other user agents can't and won't change from 2003 overnight. And who knows how long it would take for registries to change, notify people that their registrations are no longer valid, and handle whatever legal issues there are. The number of symbol registrations is rather small, and probably not of great concern. While the exclusion of symbols doesn't buy much, it doesn't hurt much either.

The larger concern are those cases like http://Brüder.com that work now (on IDNA 2003, being equivalent to http://brüder.com), but fail under a strict implementation.

Q. What are the main advantages of IDNA2008?

A. The main advantages are:

    • It is no longer tied to an old version of Unicode (version 3.2)

    • It updates automatically to each new version of Unicode

    • It allows some characters that were forbidden, but which large communities feel they need (the Special Cases)

    • It improves the bidi handling, allowing for some sequences that 2003 should not have restricted (eg, trailing combining marks, needed for Thaana), and restricting sequences that lead to "bidi label hopping". (While these new bidi rules go a long way towards reducing this problem, they do not eliminate it because they do not check for inter-label situations.)

Q. What is "bidi label hopping?

A. It is where bidi reordering causes characters from one label to appear to be part of another label. For example, with "B1.d" in a right-to-left paragraph (where B stands for an Arabic or Hebrew letter), the display would be "1.dB".

Q. What are the main disadvantages of IDNA2008?

A. The main disadvantages are:

    • The IDNA 2008-Custom implementations appear to offer the opportunity for significant interoperability and security problems, with no effective means of handling them.

    • The Special Cases offer the opportunity for interoperability and security problems if not handled correctly. (However, it appears that there are ways to handle them.)

    • There are new contextual rules that are fairly complicated to implement, and are not in a machine-readable format. Without a comprehensive test suite and/or reference implementations to test against, it is fairly likely that there will be incompatibilities. Of particular interest are the invisible ZWJ/ZWNJ characters, which offer opportunities for spoofing if not properly restricted.

    • The removal of about 2,900 symbols from the current definition of IDNA is an incompatible change. (Luckily these symbols are relatively rarely used, so this appears to be a minor issue.)

Q. Are the "local" mappings just a UI issue?

A. No, not if what is meant is that they are only involved in interactions with the address bar. Examples:

    • Alice sees that a URL works in her browser (say http://faß.de or http://TÜRKIYE.com). She sends it to Bob in an email, who clicks on the email representation. He goes to the bad site, because his browser maps to http://fass.de or http://türkiye.com while Alice's maps to http://faß.de or http://türkıye.com.

    • Alice creates a web page, using <a href="http://faß.de"> (or http://TÜRKIYE.com). Bob clicks on the link, and goes to a bad site.

      • It is generally understood at the W3C that all attributes that take URLs should take full IRIs, not punycoded-URIs, so for example SVG, MathML, XLink, XML, etc, all take IRIs now, as does HTML5.

    • Alice is in a IM chat with Bob. She copies in http://faß.de (or http://TÜRKIYE.com) and hits return. Bob clicks on the link he sees in his chat window. Bob clicks on the link, and goes to a bad site.

    • Alice sends a Word document to Bob with a link in it...

    • Alice creates a PDF document...

    • ...

Q. Do the Custom exploits require unscrupulous registries?

A. No. The exploits don't require unscrupulous registries -- it only requires that registries don't police every URL that they register for possible spoofing behavior. And we have clear evidence that the registries are simply unable to do that. After all, if we could depend on registries to always do the "right thing", we wouldn't need any restrictions on characters at the protocol level!

Q. Why does IDNA 2003 map eszett (ß) to "ss", and map final sigma (ς) to sigma (σ), and delete ZWJ/ZWNJ?

A. This is to provide full case insensitivity, following the Unicode Standard. These characters are anomalous: the standard uppercase of ß is "SS", the same as the uppercase of "ss", and the uppercase of ς is Σ, the same as the uppercase of σ. For full case insensitivity (with transitivity), {ss, ß, SS} and{σ, ς, Σ} need to be treated as equivalent, with one of each set chosen as the representative in the mapping. That is what is done in the Unicode Standard, which was followed by IDNA 2003.

Note that the text "ΒόλοΣ.com", which appears on http://Βόλος.com, illustrates this: the normal case mapping of Σ is to σ. If σ and ς are not treated as case variants, there wouldn't be a match between ΒόλοΣ and Βόλος.

In German, the situation is even more complicated:

    • In Switzerland, "ss" is uniformly used instead of ß.

    • The recent spelling reform in Germany and Austria changed whether ß or ss is used in many words. For example, http://Schloß.de was the spelling before 1996, and http://Schloss.de is correct after.

    • Recently, in Unicode 5.1, an uppercase version of ß was added (ẞ), since it is attested in some cases. It is unknown, however, whether it will ever become the preferred uppercase. Unicode now treats all of these as a single equivalence class for case-insensitive matching: {ss, ß, SS, ẞ}. See also the Unicode FAQ.

IDNA 2003 deletes ZWJ, ZWNJ and other characters that are themselves invisible but may affect rendering. IDNA 2008 allows them, but only in limited contexts.

Q. Why allow ZWJ/ZWNJ at all?

During the development of Unicode, the ZWJ and ZWNJ were intended only for presentation -- that is, they would make no difference in the semantics of a word. Thus the comparison format should and does delete them. That comparison format, however, should never really be seen by users - it is just a transient form used for comparison.

Unfortunately, the way the DNS works this comparison format (with transformations of eszett, final sigma, and deleted ZWJ/NJ) ends up being visible to the user.

There are words such as the name of the country of Sri Lanka, which require preservation of the Special Cases features (in this case, ZWJ) in order to appear correct to the end users when the URL comes back from the DNS server. So that is motivating the IDNA working group to retain the Special Cases.

Q. What is the motivation for allowing arbitrary (Custom) mappings?

A. It is unclear, because as with many instances in the rationale document, there are no concrete examples that would help justify that choice. However, it appears that it was the Turkic dotted/dotless I issue. Unfortunately, instead of limiting it to that one case (which would be painful enough), the IDNA 2008 specification currently allows for any mapping. For example, some people disagree on the case mappings for Greek, and even French. The spec makes no distinction between useful mappings (like those for compatibility with 2003) and potentially bizarre or pernicious ones.

Q. Why doesn't IDNA 2008 (or for that matter IDNA 2003) restrict allowed domains on the basis of language?

A. It is extremely difficult to restrict on the basis of language, because the letters used in a particular language are not well defined. The "core" letters typically are, but many others are typically accepted in loan words, and have perfectly legitimate commercial and social use.

It is a bit easier to maintain a bright line based on script differences between characters: every Unicode character has a defined script (or is Common/Inherited). Even there it is problematic to have that as a restriction. Some languages (Japanese) require multiple scripts. And in most cases, mixtures of scripts are harmless. One can have SONY日本.com with no problems at all -- while there are many cases of "homographs" (visually confusable characters) within the same script that a restriction based on script doesn't deal with.

It would have been of some aid to remove historic scripts (like cuneiform) from the protocol, but the IDNA working group didn't agree to that. See Unicode Specific Character Adjustments, Table 4.

The rough consensus among the working group is that script/language mixing restrictions are not appropriate for the lowest-level protocol. So in this respect, IDNA 2008 is no different than IDNA 2003. IDNA doesn't try to attack the homograph problem, because it is too difficult to have a bright line. Effective solutions depend on information or capabilities outside of the protocol's control, such as language restrictions appropriate for a particular registry, the language of the user looking at this URL, the ability of a UI to display suspicious URLs with special highlighting, and so on.

Responsible registries will have their own rules, since they can apply such restrictions. For example, DENIC can decide on a restricted set of characters appropriate for German. Apps also take certain precautions -- MSIE, Safari, and Chrome all display domain names in Unicode only if the user's language(s) typically use the scripts in those domain names. Firefox is the odd man out, expecting TLD registries to publish rules to Firefox management's liking. There is more on the kinds of techniques that implementations can use on the Unicode web site, at [Unicode Security Considerations].

Q. Why care about the changes to ß and ς?

If we were starting with a blank slate, it would be feasible to have ß/ẞ kept separate from ss/SS (although note that the German national standards body doesn't: the uppercase of ß is SS, which is why they got connected in the first place in Unicode). Similarly, it would have been possible to separate σ/ς (again, joined because they have a common uppercase Σ).

But we are not starting with a clean slate: we are facing with changes to an existing widely-deployed standard, over 6 years old, which is a long time in terms of the web. We have two effective options:

A. Maintain compatibility with IDNA2003

B. Deviate from IDNA2003

Try to separate them, leaving users with a de-facto indeterminant mapping. That results in the following:

Why would this happen? Well, for some indefinite period time, both IDNA2003 and IDNA2008 clients will exist. So you get on a friend's machine, go to your bank site, and get to a spoof site. Moreover, in an effort to maintain compatibility for clients, most client-software will do a dual lookup; first try one then the other. If someone comes in with an intervening registration, for a spoof site, then a URL that used to work for you now leads to the spoof site.

Now perhaps the NICs for de, at, and ch will address ß/ẞ/ss/SS (and the NIC for gr will address σ/ς/Σ) by bundling or blocking. But bundling or blocking defeats the purpose of separating them, and there is little reason to think that .com, .biz, .whatever will all do the same -- not to speak of the many, many more registries below the top level.