UAX 31 Changes

L2/...

From: Mark Davis

Date: 2009-3-28

I suggest the following changes in UAX 31.

1. Fix ambiguous variables

There are suggested rules for using ZWJ and ZWNJ in http://unicode.org/draft/reports/tr31/tr31.html#Layout_and_Format_Control_Characters

In those rules, we use the variable $L for two different entities in the rules: Left Joining, and Letter (for Indic). While they are in separate contexts, it would be much clearer if we didn't have the overlap. There are a few possible alternatives; I suggest:

    • For the Joining specifications of ZWJ/ZWNJ, change $L, $R to $LJ, $RJ

2. Add Default Ignorable Code Points to Table 4 Candidate Characters for Exclusion from Identifiers

In http://unicode.org/draft/reports/tr31/tr31.html#Specific_Character_Adjustments,

add a row:

[:Default_Ignorable_Code_Point=True:] Default Ignorable Code Points (See Section 2.3)

[Rationale: we already say that DIs should be excluded, with certain exceptions in Section 2.3, which has a lot of detail on the topic. This just makes that relationship more visible.]

3. Add Unicode 5.2 Characters to Table 3/4 (Candidates for Inclusion/Exclusion)

Add to Table 4 (Exclusion) the following scripts (this is a rough cut, so feedback is welcome):

Archaic / Historic

    • Old Turkic

    • Old South Arabian

    • Imperial Aramaic

    • Inscriptional Parthian

    • Inscriptional Pahlavi

    • Avestan

    • Egyptian Hieroglyphs

    • Javanese

Limited Use

    • Samaritan

    • Kaithi

    • Tai Viet

    • Bamum

    • Lisu

Add the following to Table 5. Recommended Scripts

    • Meetei Mayek

    • Tai Tham

4. Add U+0640 ( ‎ـ‎ ) ARABIC TATWEEL as a candidate character for exclusion.

We have the following tables in http://unicode.org/draft/reports/tr31/tr31.html#Specific_Character_Adjustments

    • Table 3. Candidate Characters for Inclusion in Identifiers

    • Table 4. Candidate Characters for Exclusion from Identifiers

A. I suggest adding a row to Table 4, being

[\u0640] Arabic Tatweel

B. Alternatively, one could break Table 4 into two tables:

Table 4a. Candidate Characters Identified by CodePointfor Exclusion from Identifiers

Containing only Tatweel

Table 4b. Candidate Characters Identified by Property for Exclusion from Identifiers

Containing the current Table 4 contents

(Ken favors a two table solution; I think it is simpler with one.)

5. Add Characters from IDNA Tables Document

The IDNA tables document (draft) contains certain exceptions that we should review, in http://tools.ietf.org/html/draft-ietf-idnabis-tables#section-2.6.

The following characters are not in the Unicode identifier definition XID_Continue (after subtracting characters that are affected by case folding and NFKC), nor are in the Candidates for Inclusion.

Greek And Coptic - Numeral signs

U+0375 ( ͵ ) GREEK LOWER NUMERAL SIGN

Arabic - Signs for Sindhi

U+06FD ( ‎۽‎ ) ARABIC SIGN SINDHI AMPERSAND

U+06FE ( ‎۾‎ ) ARABIC SIGN SINDHI POSTPOSITION MEN

Tibetan - Marks and signs

U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG

Katakana - Conjunction and length marks

U+30FB ( ・ ) KATAKANA MIDDLE DOT

Of them, I'd recommend that we add U+30FB ( ・ ) KATAKANA MIDDLE DOT to Table 3. Candidate Characters for Inclusion in Identifiers, since it serves a function somewhat like an underbar. The others have gotten into the IDNA specification (draft), but there doesn't seem to be any compelling rationale for that. However, others may know more about them and present good reasons for inclusion into UAX#31.

Note that the following is part of Pattern_Syntax, and thus not part of XID_Continue. Pattern_Syntax is immutable, and required to be disjoint from identifiers, and yet this character was added in that range, which was probably a mistake.

Supplemental Punctuation - Medievalist punctuation

U+2E2F ( ⸯ ) VERTICAL TILDE

Of the characters that Unicode has, and IDNA doesn't, I don't see any need to make any changes. Some of them are principled differences, like the omission of connector punctuation, and others are not, like the omission of Hangul Jamo.

5.1 Background

For completeness, the following lists the exceptions in the 05 version of that document, organized by type.

*PVALID: // would otherwise have been DISALLOWED

00DF; PVALID # LATIN SMALL LETTER SHARP S

03C2; PVALID # GREEK SMALL LETTER FINAL SIGMA

06FD; PVALID # ARABIC SIGN SINDHI AMPERSAND

06FE; PVALID # ARABIC SIGN SINDHI POSTPOSITION MEN

0F0B; PVALID # TIBETAN MARK INTERSYLLABIC TSHEG

3007; PVALID # IDEOGRAPHIC NUMBER ZERO

*CONTEXTO: // would otherwise have been DISALLOWED

00B7; CONTEXTO # MIDDLE DOT

0375; CONTEXTO # GREEK LOWER NUMERAL SIGN (KERAIA)

05F3; CONTEXTO # HEBREW PUNCTUATION GERESH

05F4; CONTEXTO # HEBREW PUNCTUATION GERSHAYIM

30FB; CONTEXTO # KATAKANA MIDDLE DOT

*CONTEXTO: // would otherwise have been PVALID

002D; CONTEXTO # HYPHEN-MINUS

02B9; CONTEXTO # MODIFIER LETTER PRIME

0660; CONTEXTO # ARABIC-INDIC DIGIT ZERO

0661; CONTEXTO # ARABIC-INDIC DIGIT ONE

0662; CONTEXTO # ARABIC-INDIC DIGIT TWO

0663; CONTEXTO # ARABIC-INDIC DIGIT THREE

0664; CONTEXTO # ARABIC-INDIC DIGIT FOUR

0665; CONTEXTO # ARABIC-INDIC DIGIT FIVE

0666; CONTEXTO # ARABIC-INDIC DIGIT SIX

0667; CONTEXTO # ARABIC-INDIC DIGIT SEVEN

0668; CONTEXTO # ARABIC-INDIC DIGIT EIGHT

0669; CONTEXTO # ARABIC-INDIC DIGIT NINE

06F0; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT ZERO

06F1; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT ONE

06F2; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT TWO

06F3; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT THREE

06F4; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT FOUR

06F5; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT FIVE

06F6; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT SIX

06F7; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT SEVEN

06F8; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT EIGHT

06F9; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT NINE

0483; CONTEXTO # COMBINING CYRILLIC TITLO

3005; CONTEXTO # IDEOGRAPHIC ITERATION MARK

303B; CONTEXTO # VERTICAL IDEOGRAPHIC ITERATION MARK

*DISALLOWED: // would otherwise have been PVALID

302E; DISALLOWED # HANGUL SINGLE DOT TONE MARK

302F; DISALLOWED # HANGUL DOUBLE DOT TONE MARK

5.2 Characters in IDNA draft

Here is the current set, as of the current draft and Unicode 5.1. You can paste into http://unicode.org/cldr/utility/list-unicodeset.jsp to explore, or compare against XID_Continue.

[\-0-9a-z·ß-öø-ÿāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıĵķĸĺļľłńņňŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷźżžƀƃƅƈƌƍƒƕƙ-ƛƞơƣƥƨƪƫƭưƴƶƹ-ƻƽ-ǃǎǐǒǔǖǘǚǜǝǟǡǣǥǧǩǫǭǯǰǵǹǻǽǿȁȃȅȇȉȋȍȏȑȓȕȗșțȝȟȡȣȥȧȩȫȭȯȱȳ-ȹȼȿɀɂɇɉɋɍɏ-ʯʹ-ˁˆ-ˑˬˮ̀-̿͂͆-͎͐-ͯͱͳ͵ͷͻ-ͽΐά-ώϗϙϛϝϟϡϣϥϧϩϫϭϯϳϸϻϼа-џѡѣѥѧѩѫѭѯѱѳѵѷѹѻѽѿҁ҃-҇ҋҍҏґғҕҗҙқҝҟҡңҥҧҩҫҭүұҳҵҷҹһҽҿӂӄӆӈӊӌӎӏӑӓӕӗәӛӝӟӡӣӥӧөӫӭӯӱӳӵӷӹӻӽӿԁԃԅԇԉԋԍԏԑԓԕԗԙԛԝԟԡԣՙա-ֆ֑-ׇֽֿׁׂׅׄא-תװ-״ؐ-ؚء-ٞ٠-٩ٮ-ٴٹ-ۓە-ۜ۟-۪ۨ-ۿܐ-݊ݍ-ޱ߀-ߵߺँ-ह़-्ॐ-॔ॠ-ॣ०-९ॱॲॻ-ॿঁ-ঃঅ-ঌএঐও-নপ-রলশ-হ়-ৄেৈো-ৎৗৠ-ৣ০-ৱਁ-ਃਅ-ਊਏਐਓ-ਨਪ-ਰਲਵਸਹ਼ਾ-ੂੇੈੋ-੍ੑੜ੦-ੵઁ-ઃઅ-ઍએ-ઑઓ-નપ-રલળવ-હ઼-ૅે-ૉો-્ૐૠ-ૣ૦-૯ଁ-ଃଅ-ଌଏଐଓ-ନପ-ରଲଳଵ-ହ଼-ୄେୈୋ-୍ୖୗୟ-ୣ୦-୯ୱஂஃஅ-ஊஎ-ஐஒ-கஙசஜஞடணதந-பம-ஹா-ூெ-ைொ-்ௐௗ௦-௯ఁ-ఃఅ-ఌఎ-ఐఒ-నప-ళవ-హఽ-ౄె-ైొ-్ౕౖౘౙౠ-ౣ౦-౯ಂಃಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹ಼-ೄೆ-ೈೊ-್ೕೖೞೠ-ೣ೦-೯ംഃഅ-ഌഎ-ഐഒ-നപ-ഹഽ-ൄെ-ൈൊ-്ൗൠ-ൣ൦-൯ൺ-ൿංඃඅ-ඖක-නඳ-රලව-ෆ්ා-ුූෘ-ෟෲෳก-าิ-ฺเ-๎๐-๙ກຂຄງຈຊຍດ-ທນ-ຟມ-ຣລວສຫອ-າິ-ູົ-ຽເ-ໄໆ່-ໍ໐-໙ༀ་༘༙༠-༩༹༵༷༾-གང-ཇཉ-ཌཎ-དན-བམ-ཛཝ-ཨཪ-ཬཱིེུ-ྀྂ-྄྆-ྋྐ-ྒྔ-ྗྙ-ྜྞ-ྡྣ-ྦྨ-ྫྭ-ྸྺ-ྼ࿆က-၉ၐ-႙ა-ჺሀ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚ፟ᎀ-ᎏᎠ-Ᏼᐁ-ᙬᙯ-ᙶᚁ-ᚚᚠ-ᛪᜀ-ᜌᜎ-᜔ᜠ-᜴ᝀ-ᝓᝠ-ᝬᝮ-ᝰᝲᝳក-ឳា-៓ៗៜ៝០-៩᠐-᠙ᠠ-ᡷᢀ-ᢪᤀ-ᤜᤠ-ᤫᤰ-᤻᥆-ᥭᥰ-ᥴᦀ-ᦩᦰ-ᧉ᧐-᧙ᨀ-ᨛᬀ-ᭋ᭐-᭙᭫-᭳ᮀ-᮪ᮮ-᮹ᰀ-᰷᱀-᱉ᱍ-ᱽᴀ-ᴫᴯᴻᵎᵫ-ᵷᵹ-ᶚ᷀-᷿ᷦ᷾ḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉẋẍẏẑẓẕ-ẙẜẝẟạảấầẩẫậắằẳẵặẹẻẽếềểễệỉịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹỻỽỿ-ἇἐ-ἕἠ-ἧἰ-ἷὀ-ὅὐ-ὗὠ-ὧὰὲὴὶὸὺὼᾰᾱᾶῆῐ-ῒῖῗῠ-ῢῤ-ῧῶ‌‍ⅎↄⰰ-ⱞⱡⱥⱦⱨⱪⱬⱱⱳⱴⱶ-ⱻⲁⲃⲅⲇⲉⲋⲍⲏⲑⲓⲕⲗⲙⲛⲝⲟⲡⲣⲥⲧⲩⲫⲭⲯⲱⲳⲵⲷⲹⲻⲽⲿⳁⳃⳅⳇⳉⳋⳍⳏⳑⳓⳕⳗⳙⳛⳝⳟⳡⳣⳤⴀ-ⴥⴰ-ⵥⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⷠ-ⷿⸯ々-〇〪-〭〱-〵〻〼ぁ-ゖ゙゚ゝゞァ-ヾㄅ-ㄭㆠ-ㆷㇰ-ㇿ㐀-䶵一-鿃ꀀ-ꒌꔀ-ꘌꘐ-ꘫꙁꙃꙅꙇꙉꙋꙍꙏꙑꙓꙕꙗꙙꙛꙝꙟꙣꙥꙧꙩꙫꙭ-꙯꙼꙽ꙿꚁꚃꚅꚇꚉꚋꚍꚏꚑꚓꚕꚗꜗ-ꜟꜣꜥꜧꜩꜫꜭꜯ-ꜱꜳꜵꜷꜹꜻꜽꜿꝁꝃꝅꝇꝉꝋꝍꝏꝑꝓꝕꝗꝙꝛꝝꝟꝡꝣꝥꝧꝩꝫꝭꝯꝱ-ꝸꝺꝼꝿꞁꞃꞅꞇꞈꞌꟻ-ꠧꡀ-ꡳꢀ-꣄꣐-꣙꤀-꤭ꤰ-꥓ꨀ-ꨶꩀ-ꩍ꩐-꩙가-힣﨎﨏﨑﨓﨔﨟﨡﨣﨤﨧-﨩ﬞ︠-︦ﹳ𐀁-𐀋𐀍-𐀦𐀨-𐀺𐀼𐀽𐀿-𐁍𐁐-𐁝𐂀-𐃺𐇽𐊀-𐊜𐊠-𐋐𐌀-𐌞𐌰-𐍀𐍂-𐍉𐎀-𐎝𐎠-𐏃𐏈-𐏏𐐨-𐒝𐒠-𐒩𐠀-𐠅𐠈𐠊-𐠵𐠷𐠸𐠼𐠿𐤀-𐤕𐤠-𐤹𐨀-𐨃𐨅𐨆𐨌-𐨓𐨕-𐨗𐨙-𐨳𐨸-𐨿𐨺𒀀-𒍮𠀀-𪛖]