From: Mark Davis
Date: 2009-08-14 (revised)
In U5.2, add a section to UAX#29 describing that the following characters are candidates for tailoring to add to MidLetter. [Use language consistent with the other bullets]
[\-\u058A\u1806\u2010\u2011\u2E17\u30A0\uFE63\uFF0D][\u058A\u0F0B\u30A0\u30FB]
U+002D ( - ) HYPHEN-MINUS
U+058A ( ֊ ) ARMENIAN HYPHEN
U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG
U+1806 ( ᠆ ) MONGOLIAN TODO SOFT HYPHEN
U+2010 ( ‐ ) HYPHEN
U+2011 ( ‑ ) NON-BREAKING HYPHEN
U+2E17 ( ⸗ ) DOUBLE OBLIQUE HYPHEN
U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN
U+30FB ( ・ ) KATAKANA MIDDLE DOT
U+FE63 ( ﹣ ) SMALL HYPHEN-MINUS
U+FF0D ( - ) FULLWIDTH HYPHEN-MINUS
In U5.2, add bulleted text to UAX#29 discussing name validation characters, and giving guidelines for usage adding characters to the word characters allowed above.
In U5.2, add pointers in UAX#31 and #29 wordbreak and text indicating the relationship between identifiers and words (and that the character sets are not the same).
Add a test for consistency between the WB properties and Table 3 (with the known exceptions) to the invariant tests.
====
For entry field validation, implementations sometimes need to know which characters can occur in personal names. While it is a bit fuzzy exactly what this means, they want to distinguish between characters like those in "James Smith-Faley, Jr." and those in "!#@♥≠". Note that it is important to be reasonably lenient: it is extremely annoying for people not to be able to add legitimate names, like "di Silva", because those names have characters like space.
Typically, these personal name validations should not be language-specific; I might be using a website in a language other than the one for my name, for example. While a more sophisticated validation might use context among characters, a basic validation just wants to know "what characters can be part of names?". The text should explain that:
It is only a guideline, and may need tailoring for different environments
It is a lenient, non-language-specific set - for language-specific characters one should see CLDR.
Mention characters:
It includes characters that may not be appropriate for identifiers, and those that would not be parts of words.
It does not include contextual tests
Additional tests may be needed in cases where security is at issue.
The set can be narrowed if name fields are split out. For example, "," may not be necessary if titles are split out; if titles are not allowed, "." may not be necessary.
Word characters contains some characters that may be part of words in a broad sense, such as "c:a" in Swedish or a dictionary word containing hyphenation points, that might not normally be part of names.
Explain the use of NFKC in name validation
U+0027 ( ' ) APOSTROPHE
U+002E ( . ) FULL STOP
U+003A ( : ) COLON
U+005F ( _ ) LOW LINE
U+00B7 ( · ) MIDDLE DOT
U+0387 ( · ) GREEK ANO TELEIA
U+05F3 ( ׳ ) HEBREW PUNCTUATION GERESH
U+05F4 ( ״ ) HEBREW PUNCTUATION GERSHAYIM
U+2018 ( ‘ ) LEFT SINGLE QUOTATION MARK
U+2019 ( ’ ) RIGHT SINGLE QUOTATION MARK
U+2024 ( ․ ) ONE DOT LEADER
U+2027 ( ‧ ) HYPHENATION POINT
U+203F ( ‿ ) UNDERTIE
U+2040 ( ⁀ ) CHARACTER TIE
U+2054 ( ⁔ ) INVERTED UNDERTIE
U+FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLON
U+FE33 ( ︳ ) PRESENTATION FORM FOR VERTICAL LOW LINE
U+FE34 ( ︴ ) PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
U+FE4D ( ﹍ ) DASHED LOW LINE
U+FE4E ( ﹎ ) CENTRELINE LOW LINE
U+FE4F ( ﹏ ) WAVY LOW LINE
U+FE52 ( ﹒ ) SMALL FULL STOP
U+FE55 ( ﹕ ) SMALL COLON
U+FF07 ( ' ) FULLWIDTH APOSTROPHE
U+FF0E ( . ) FULLWIDTH FULL STOP
U+FF1A ( : ) FULLWIDTH COLON
U+FF3F ( _ ) FULLWIDTH LOW LINE
0027 (') APOSTROPHE
002D (-) HYPHEN-MINUS
002E (.) FULL STOP
003A (:) COLON
00B7 (·) MIDDLE DOT
058A (֊) ARMENIAN HYPHEN
05F3 (׳) HEBREW PUNCTUATION GERESH
05F4 (״) HEBREW PUNCTUATION GERSHAYIM
0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG
200C () ZERO WIDTH NON-JOINER*
200D () ZERO WIDTH JOINER*
2010 (‐) HYPHEN
2019 (’) RIGHT SINGLE QUOTATION MARK
2027 (‧) HYPHENATION POINT
30A0 (=) KATAKANA-HIRAGANA DOUBLE HYPHEN
30FB ( ・ ) KATAKANA MIDDLE DOT
The following are the word characters from #31, minus Cf:
[\p{alpha}\p{WB=Extend}\p{WB=FO}\p{WB=LE}\p{WB=ML}\p{WB=MB}\p{WB=EX}-\p{cf}]
And the Identifier characters from #31 (including table 1)
[[:L:][:Nl:][:Mn:][:Mc:][\u0027\u002D\u002E\u003A\u00B7\u058A\u05F3
\u05F4\u0F0B\u200C\u200D\u2010\u2019\u2027\u30A0\u30FB][:Pc:]
-[:Pattern_Syntax:]
-[:Pattern_White_Space:]]]
Here are Word characters minus Identifier characters.
U+0027 ( ' ) APOSTROPHE
U+002E ( . ) FULL STOP
U+003A ( : ) COLON
U+0387 ( · ) GREEK ANO TELEIA
U+0488 ( ҈ ) COMBINING CYRILLIC HUNDRED THOUSANDS SIGN
U+0489 ( ҉ ) COMBINING CYRILLIC MILLIONS SIGN
U+06DE ( ۞ ) ARABIC START OF RUB EL HIZB
U+2018 ( ‘ ) LEFT SINGLE QUOTATION MARK
U+2019 ( ’ ) RIGHT SINGLE QUOTATION MARK
U+2024 ( ․ ) ONE DOT LEADER
U+2027 ( ‧ ) HYPHENATION POINT
U+20DD ( ⃝ ) COMBINING ENCLOSING CIRCLE
U+20DE ( ⃞ ) COMBINING ENCLOSING SQUARE
U+20DF ( ⃟ ) COMBINING ENCLOSING DIAMOND
U+20E0 ( ⃠ ) COMBINING ENCLOSING CIRCLE BACKSLASH
U+20E2 ( ⃢ ) COMBINING ENCLOSING SCREEN
U+20E3 ( ⃣ ) COMBINING ENCLOSING KEYCAP
U+20E4 ( ⃤ ) COMBINING ENCLOSING UPWARD POINTING TRIANGLE
U+24B6 ( Ⓐ ) CIRCLED LATIN CAPITAL LETTER A
..
U+24E9 ( ⓩ ) CIRCLED LATIN SMALL LETTER Z
U+2E2F ( ⸯ ) VERTICAL TILDE
U+A670 ( ꙰ ) COMBINING CYRILLIC TEN MILLIONS SIGN
U+A671 ( ꙱ ) COMBINING CYRILLIC HUNDRED MILLIONS SIGN
U+A672 ( ꙲ ) COMBINING CYRILLIC THOUSAND MILLIONS SIGN
U+FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLON
U+FE52 ( ﹒ ) SMALL FULL STOP
U+FE55 ( ﹕ ) SMALL COLON
U+FF07 ( ' ) FULLWIDTH APOSTROPHE
U+FF0E ( . ) FULLWIDTH FULL STOP
U+FF1A ( : ) FULLWIDTH COLON
And the Identifier Characters minus the Word Characters
U+058A ( ֊ ) ARMENIAN HYPHEN
U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG
U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN
U+30FB ( ・ ) KATAKANA MIDDLE DOT