I've had a chance to do some data mining, and it is now clear which are
the most prominent characters that are remapped under the current
scheme (the relative frequencies vary, as one might expect, depending
on the language): they are case variants, width variants, and
A copy of this email is at http://www.macchiato.com/unicode/idna/remap
Now, my position is still that the
simplest and most compatible option open to us is to simply map with
NFKC + Casefold. However, in the interest of getting this process
moving, I offer the following as a possible compromise approach. It
limits the remapped characters to those that are the most useful in
practice. I'll first give the proposal, then list some of the details
A. Tables document
Add a new type of character: REMAP. A character is REMAP if it meets all of the following criteria:
- The character is not PVALID or CONTEXTO
- The character is mapped by NFKC_CaseFold
- The character is a LetterDigit or Pd
- If remapped by the Unicode property NFKC_Casefold*, then the resulting character(s) are all PVALID or CONTEXTO
- The character has one of the following Decomposition_Type values: canonical, initial, medial, final, isolated, wide, narrow, or compat
- The character does not have the Script value: Hangul
The REMAP characters are removed from DISALLOWED, so that the TABLES values form a partition (all the values are disjoint).
B. Protocols document
Change section 5.3 so as to require:
- Mapping all REMAP characters according to the Unicode property NFKC_Casefold,
- Then normalizing the result according to NFC.
Change section 5.5 by adding a clause like the others (prohibiting REMAP in U-Labels):
o Labels containing remapped code points, i.e., those that are
assigned to the "REMAP" category in the permitted character
The rest of the tests for U-Label remain unchanged. The registration process remains unchanged, so characters are not remapped in the registration process.
C. Defs document
- Define REMAP
- Define an M-Label to be one which if remapped according to B1+B2, results in a U-Label.
Details on REMAP
Caveats: sizes may change depending on tweaks/changes to the formulation, etc.
||PVALID or CONTEXTx
||Not PValid or ContextO
||& NFKC/Case mapped
||& LetterDigit or Pd
|& Results are PValid or ContextO
|& Results are allowed DTs & not Hangul
Note that you can get a detailed view of all of these by going to http://unicode.org/cldr/utility/list-unicodeset.jsp
, and pasting the Set of Characters into the input box, and hitting Show Set
|Set of Characters
||[À-ÅÇ-ÏÑ-ÖÙ-ÝĀĂĄĆĈĊČĎĒĔĖĘĚĜĞĠĢĤĨĪĬĮİ ĴĶĹĻĽŃŅŇŌŎŐŔŖŘŚŜŞŠŢŤŨŪŬŮŰŲŴŶŸŹŻŽ ƠƯǍǏǑǓǕǗǙǛǞǠǢǦǨǪǬǮǴǸǺǼǾȀȂȄȆȈȊȌȎ ȐȒȔȖȘȚȞȦȨȪȬȮȰȲΆΈ-ΊΌΎΏΪΫϴЀЁЃЇЌ-Ў ЙѶӁӐӒӖӚӜӞӢӤӦӪӬӮӰӲӴӸḀḂḄḆḈḊḌḎḐḒ ḔḖḘḚḜḞḠḢḤḦḨḪḬḮḰḲḴḶḸḺḼḾṀṂṄṆṈṊṌṎṐṒ ṔṖṘṚṜṞṠṢṤṦṨ ṪṬṮṰṲṴṶṸṺṼṾẀẂẄẆẈẊẌẎẐẒẔ ẠẢẤẦẨẪẬẮẰẲẴẶ ẸẺẼẾỀỂỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢ ỤỦỨỪỬỮỰỲỴỶỸἈ-ἏἘ-ἝἨ-ἯἸ-ἿὈ-ὍὙὛὝὟὨ-ὯᾸ-ᾺῈῊῘ-ῚῨ-ῪῬῸῺΩKÅ]
||Case and NFC
||[ﭔﭘﭜﭠﭤﭨﭬﭰﭴﭸﭼﮀﮐﮔﮘﮜﮢﮨﮬﯕﯦﯨﯸﯻﯾﲗ-ﳞﴭ-ﴳﵐﵒ-ﵗﵙﵜﵝﵠﵡﵣ ﵥﵨﵫﵭﵰﵲﵳ ﵷﵽ ﶃﶆﶈ-ﶊﶌ-ﶏﶒ-ﶕﶘﶝﶴﶵﶸﶺﷃ-ﷅﺋﺑﺗﺛﺟﺣﺧﺳﺷﺻﺿﻃﻇﻋﻏﻓﻗﻛﻟﻣﻧﻫﻳ]
||[ﭑﭓﭗﭛﭟﭣﭧﭫﭯﭳﭷﭻﭿﮃﮅﮇﮉﮋﮍﮏﮓﮗﮛﮟﮡﮥﮧﮫﮯﮱﯔﯘﯚﯜﯟﯡﯣﯥﯫﯭﯯﯱﯳﯵﯷﯺﯽﱤ-ﲖﴑ-ﴬﴼﵑﵘﵚﵛ ﵞﵟﵢﵤﵦﵧﵩﵪﵬﵮ ﵯﵱﵴ-ﵶﵸ-ﵼﵾ-ﶂﶄﶅﶇﶋ ﶖﶗﶙ-ﶜﶞ-ﶳﶶﶷﶹﶻ-ﷂﷆﷇ ﺂﺄﺆﺈﺊﺎﺐﺔﺖﺚﺞﺢﺦﺪﺬﺮﺰﺲﺶﺺﺾﻂﻆ ﻊﻎﻒﻖﻚﻞﻢﻦﻪﻮﻰﻲﻶﻸﻺﻼ]
||[ﭐﭒﭖﭚﭞﭢﭦﭪﭮﭲﭶﭺﭾﮂﮄﮆﮈﮊﮌﮎﮒﮖﮚﮞﮠﮤﮦﮪﮮﮰﯓﯗﯙﯛﯝﯞﯠﯢ ﯤﯪﯬﯮ ﯰﯲﯴﯶﯹﯼﰀ-ﱝﳵ-ﴐﴽﷰ-ﷹﺀﺁﺃﺅﺇﺉﺍﺏﺓﺕﺙﺝﺡﺥﺩﺫﺭﺯﺱﺵﺹﺽ ﻁﻅﻉﻍﻑﻕﻙﻝﻡﻥﻩﻭﻯﻱﻵﻷﻹﻻ]
Rationale for inclusion of clauses
- The character is not PVALID or CONTEXTO
The character is mapped by NFKC_CaseFold
- This guarantees that the class of REMAP is disjoint with the others
The character is a LetterDigit or Pd
- These are the only affected characters. We don't want to map differently than this, because that would cause compatibility problems.
If remapped by the Unicode property NFKC_Casefold*, then the resulting character(s) are all PVALID or CONTEXTO
- This limits the input characters by eliminating symbols, punctuation, etc.
- The Pd is in only to pick up the fullwidth hyphen.
The character has one of the following Decomposition_Type values: canonical, initial, medial, final, isolated, wide, narrow, or compat
- This condition is not strictly necessary. Because of the way in which REMAP is used in the protocol above,
if a character results that is not PVALID, then it would fail the later
tests. So as far as I'm concerned, this could be dropped. However,
restricting the characters in this way will probably make a character listing clearer to people.
- The derived property NFKC_Casefold is being added to Unicode 5.2, and is already present in the 5.2 beta. It
provides a convenient way to fold characters for identifiers (and not
just for IDNA). It is defined in
http://unicode.org/reports/tr44/tr44-3.html, and the characters
affected are listed in http://unicode.org/Public/5.2.0/ucd/ under
DerivedNormalizationProps. If we didn't want to wait for U5.2, we can define it on our own, but it would be convenient to use it, as long as we release in October or later.
The character does not have the Script value: Hangul
- The initial/medial/final/isolated forms are all Arabic presentation forms, such as:
( ﺑ ) ARABIC LETTER BEH INITIAL FORM
- The narrow/wide are all width variants, such as:
U+FF71 ( ｱ ) HALFWIDTH KATAKANA LETTER A
- The compat forms include various digraphs or forms such as:
U+013F ( Ŀ ) LATIN CAPITAL LETTER L WITH MIDDLE DOT,
U+01C6 ( ǆ ) LATIN SMALL LETTER DZ WITH CARON
Most compat characters are, however, eliminated by other conditions, such as:
U+2474 ( ⑴ ) PARENTHESIZED DIGIT ONE, which is eliminated by both condition #2 and #3
The excluded decomposition types are: font, super, sub; vertical; circle, fraction, nobreak, small, square
- This is consistent with the exclusion in Tables of OldHangulJamo. Sample exclusions are:
U+3131 ( ㄱ ) HANGUL LETTER KIYEOK
U+FFA1 ( ﾡ ) HALFWIDTH HANGUL LETTER KIYEOK