Remap

I've had a chance to do some data mining, and it is now clear which are the most prominent characters that are remapped under the current scheme (the relative frequencies vary, as one might expect, depending on the language): they are case variants, width variants, and presentation variants.

A copy of this email is at http://www.macchiato.com/unicode/idna/remap

Now, my position is still that the simplest and most compatible option open to us is to simply map with NFKC + Casefold. However, in the interest of getting this process moving, I offer the following as a possible compromise approach. It limits the remapped characters to those that are the most useful in practice. I'll first give the proposal, then list some of the details afterwards.

Proposal:

A. Tables document

Add a new type of character: REMAP. A character is REMAP if it meets all of the following criteria:

  1. The character is not PVALID or CONTEXTO
  2. The character is mapped by NFKC_CaseFold
  3. The character is a LetterDigit or Pd
  4. If remapped by the Unicode property NFKC_Casefold*, then the resulting character(s) are all PVALID or CONTEXTO
  5. The character has one of the following Decomposition_Type values: canonical, initial, medial, final, isolated, wide, narrow, or compat
  6. The character does not have the Script value: Hangul
The REMAP characters are removed from DISALLOWED, so that the TABLES values form a partition (all the values are disjoint).

B. Protocols document

Change section 5.3 so as to require:
  1. Mapping all REMAP characters according to the Unicode property NFKC_Casefold,
  2. Then normalizing the result according to NFC.
Change section 5.5 by adding a clause like the others (prohibiting REMAP in U-Labels):
   o  Labels containing remapped code points, i.e., those that are
assigned to the "REMAP" category in the permitted character
table [IDNA2008-Tables].
The rest of the tests for U-Label remain unchanged. The registration process remains unchanged, so characters are not remapped in the registration process.

C. Defs document

  1. Define REMAP
  2. Define an M-Label to be one which if remapped according to B1+B2, results in a U-Label.


Details on REMAP

Sizes

Caveats: sizes may change depending on tweaks/changes to the formulation, etc.

Size
Description
Clause
90,261 PVALID or CONTEXTx  
 1,023,851 Not PValid or ContextO 1
 5,815 & NFKC/Case mapped
2
4,513 & LetterDigit or Pd
3
4,312
& Results are PValid or ContextO
4
2,324
& Results are allowed DTs & not Hangul
5,6

Breakdown

Note that you can get a detailed view of all of these by going to http://unicode.org/cldr/utility/list-unicodeset.jsp, and pasting the Set of Characters into the input box, and hitting Show Set

Percent
Count
Category
Set of Characters
 60.03% 3,491
disallowed
...
 18.61% 1,082 NFC [\u0340\u0341\u0343\u0344\u0374\u0958-\u095F\u09DC\u09DD\u09DF\u0A33\u0A36\u0A59-\u0A5B\u0A5E\u0B5C\u0B5D\u0F43\u0F4D\u0F52\u0F57\u0F5C\u0F69\u0F73\u0F75\u0F76\u0F78\u0F81\u0F93\u0F9D\u0FA2\u0FA7\u0FAC\u0FB9\u1F71\u1F73\u1F75\u1F77\u1F79\u1F7B\u1F7D\u1FBE\u1FD3\u1FE3\uFB1D\uFB1F\uFB2A-\uFB36\uFB38-\uFB3C\uFB3E\uFB40\uFB41\uFB43\uFB44\uFB46-\uFB4E\uF900-\uFA0D\uFA10\uFA12\uFA15-\uFA1E\uFA20\uFA22\uFA25\uFA26\uFA2A-\uFA2D\uFA30-\uFA6A\uFA70-\uFAD9\U0002F800-\U0002FA1D]
 5.97% 347 Case [À-ÅÇ-ÏÑ-ÖÙ-ÝĀĂĄĆĈĊČĎĒĔĖĘĚĜĞĠĢĤĨĪĬĮİ ĴĶĹĻĽŃŅŇŌŎŐŔŖŘŚŜŞŠŢŤŨŪŬŮŰŲŴŶŸŹŻŽ ƠƯǍǏǑǓǕǗǙǛǞǠǢǦǨǪǬǮǴǸǺǼǾȀȂȄȆȈȊȌȎ ȐȒȔȖȘȚȞȦȨȪȬȮȰȲΆΈ-ΊΌΎΏΪΫϴЀЁЃЇЌ-Ў ЙѶӁӐӒӖӚӜӞӢӤӦӪӬӮӰӲӴӸḀḂḄḆḈḊḌḎḐḒ ḔḖḘḚḜḞḠḢḤḦḨḪḬḮḰḲḴḶḸḺḼḾṀṂṄṆṈṊṌṎṐṒ ṔṖṘṚṜṞṠṢṤṦṨ ṪṬṮṰṲṴṶṸṺṼṾẀẂẄẆẈẊẌẎẐẒẔ ẠẢẤẦẨẪẬẮẰẲẴẶ ẸẺẼẾỀỂỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢ ỤỦỨỪỬỮỰỲỴỶỸἈ-ἏἘ-ἝἨ-ἯἸ-ἿὈ-ὍὙὛὝὟὨ-ὯᾸ-ᾺῈῊῘ-ῚῨ-ῪῬῸῺΩKÅ]
0.12%
7 Case and NFC [ΆΈΉΊΎΌΏ]
 0.05% 3 Other Canonical [ϓϔẛ]
 2.94% 171 Initial [ﭔﭘﭜﭠﭤﭨﭬﭰﭴﭸﭼﮀﮐﮔﮘﮜﮢﮨﮬﯕﯦﯨﯸﯻﯾﲗ-ﳞﴭ-ﴳﵐﵒ-ﵗﵙﵜﵝﵠﵡﵣ ﵥﵨﵫﵭﵰﵲﵳ ﵷﵽ ﶃﶆﶈ-ﶊﶌ-ﶏﶒ-ﶕﶘﶝﶴﶵﶸﶺﷃ-ﷅﺋﺑﺗﺛﺟﺣﺧﺳﺷﺻﺿﻃﻇﻋﻏﻓﻗﻛﻟﻣﻧﻫﻳ]
 1.26% 73 Medial [ﭕﭙﭝﭡﭥﭩﭭﭱﭵﭹﭽﮁﮑﮕﮙﮝﮣﮩﮭﯖﯧﯩﯿﳟ-ﳱﴴ-ﴻﺌﺒﺘﺜﺠﺤﺨﺴﺸﺼﻀﻄﻈﻌﻐﻔﻘﻜﻠﻤﻨﻬﻴ]
 4.13% 240 Final [ﭑﭓﭗﭛﭟﭣﭧﭫﭯﭳﭷﭻﭿﮃﮅﮇﮉﮋﮍﮏﮓﮗﮛﮟﮡﮥﮧﮫﮯﮱﯔﯘﯚﯜﯟﯡﯣﯥﯫﯭﯯﯱﯳﯵﯷﯺﯽﱤ-ﲖﴑ-ﴬﴼﵑﵘﵚﵛ ﵞﵟﵢﵤﵦﵧﵩﵪﵬﵮ ﵯﵱﵴ-ﵶﵸ-ﵼﵾ-ﶂﶄﶅﶇﶋ ﶖﶗﶙ-ﶜﶞ-ﶳﶶﶷﶹﶻ-ﷂﷆﷇ ﺂﺄﺆﺈﺊﺎﺐﺔﺖﺚﺞﺢﺦﺪﺬﺮﺰﺲﺶﺺﺾﻂﻆ ﻊﻎﻒﻖﻚﻞﻢﻦﻪﻮﻰﻲﻶﻸﻺﻼ]
3.80%
221 Isolated [ﭐﭒﭖﭚﭞﭢﭦﭪﭮﭲﭶﭺﭾﮂﮄﮆﮈﮊﮌﮎﮒﮖﮚﮞﮠﮤﮦﮪﮮﮰﯓﯗﯙﯛﯝﯞﯠﯢ ﯤﯪﯬﯮ ﯰﯲﯴﯶﯹﯼﰀ-ﱝﳵ-ﴐﴽﷰ-ﷹﺀﺁﺃﺅﺇﺉﺍﺏﺓﺕﺙﺝﺡﺥﺩﺫﺭﺯﺱﺵﺹﺽ ﻁﻅﻉﻍﻑﻕﻙﻝﻡﻥﻩﻭﻯﻱﻵﻷﻹﻻ]
 1.00% 58 Narrow [ヲ-゚]
 1.08% 63 Wide [-0-9A-Za-z]
 1.01% 59 Compat [µIJijĿŀʼnſDŽ-njDZ-dzϐ-ϒϕϖϰ-ϲϵϹևٵ-ٸำຳໜໝ\u0F77\u0F79ẚℇℵ-ℸff-stﬓ-ﬗﭏ]

Rationale for inclusion of clauses

  1. The character is not PVALID or CONTEXTO
    • This guarantees that the class of REMAP is disjoint with the others
  2. The character is mapped by NFKC_CaseFold
    • These are the only affected characters. We don't want to map differently than this, because that would cause compatibility problems.
  3. The character is a LetterDigit or Pd
    • This limits the input characters by eliminating symbols, punctuation, etc.
    • The Pd is in only to pick up the fullwidth hyphen.
  4. If remapped by the Unicode property NFKC_Casefold*, then the resulting character(s) are all PVALID or CONTEXTO
    • This condition is not strictly necessary. Because of the way in which REMAP is used in the protocol above, if a character results that is not PVALID, then it would fail the later tests. So as far as I'm concerned, this could be dropped. However, restricting the characters in this way will probably make a character listing clearer to people.
    • The derived property NFKC_Casefold is being added to Unicode 5.2, and is already present in the 5.2 beta. It provides a convenient way to fold characters for identifiers (and not just for IDNA). It is defined in http://unicode.org/reports/tr44/tr44-3.html, and the characters affected are listed in http://unicode.org/Public/5.2.0/ucd/ under DerivedNormalizationProps. If we didn't want to wait for U5.2, we can define it on our own, but it would be convenient to use it, as long as we release in October or later.
  5. The character has one of the following Decomposition_Type values: canonical, initial, medial, final, isolated, wide, narrow, or compat
    • The initial/medial/final/isolated forms are all Arabic presentation forms, such as:
      U+FE91
      ( ‎ﺑ‎ ) ARABIC LETTER BEH INITIAL FORM
    • The narrow/wide are all width variants, such as:
      U+FF71 ( ア ) HALFWIDTH KATAKANA LETTER A
    • The compat forms include various digraphs or forms such as:
      U+013F ( Ŀ ) LATIN CAPITAL LETTER L WITH MIDDLE DOT,
      U+01C6 ( dž ) LATIN SMALL LETTER DZ WITH CARON
      Most compat characters are, however, eliminated by other conditions, such as:
      U+2474 ( ⑴ ) PARENTHESIZED DIGIT ONE, which is eliminated by both condition #2 and #3
    • The excluded decomposition types are: font, super, sub; vertical; circle, fraction, nobreak, small, square
  6. The character does not have the Script value: Hangul
    • This is consistent with the exclusion in Tables of OldHangulJamo. Sample exclusions are:
      U+3131 ( ㄱ ) HANGUL LETTER KIYEOK
      U+FFA1 ( ᄀ ) HALFWIDTH HANGUL LETTER KIYEOK
Comments