Personal Name Validation Characters
L2/09-232R
From: Mark Davis
Date: 2009-08-14 (revised)
Proposal
In U5.2, add a section to UAX#29 describing that the following characters are candidates for tailoring to add to MidLetter. [Use language consistent with the other bullets]
[\-\u058A\u1806\u2010\u2011\u2E17\u30A0\uFE63\uFF0D][\u058A\u0F0B\u30A0\u30FB]
U+002D ( - ) HYPHEN-MINUS
U+058A ( ֊ ) ARMENIAN HYPHEN
U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG
U+1806 ( ᠆ ) MONGOLIAN TODO SOFT HYPHEN
U+2010 ( ‐ ) HYPHEN
U+2011 ( ‑ ) NON-BREAKING HYPHEN
U+2E17 ( ⸗ ) DOUBLE OBLIQUE HYPHEN
U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN
U+30FB ( ・ ) KATAKANA MIDDLE DOT
U+FE63 ( ﹣ ) SMALL HYPHEN-MINUS
U+FF0D ( - ) FULLWIDTH HYPHEN-MINUS
In U5.2, add bulleted text to UAX#29 discussing name validation characters, and giving guidelines for usage adding characters to the word characters allowed above.
In U5.2, add pointers in UAX#31 and #29 wordbreak and text indicating the relationship between identifiers and words (and that the character sets are not the same).
Add a test for consistency between the WB properties and Table 3 (with the known exceptions) to the invariant tests.
====
For entry field validation, implementations sometimes need to know which characters can occur in personal names. While it is a bit fuzzy exactly what this means, they want to distinguish between characters like those in "James Smith-Faley, Jr." and those in "!#@♥≠". Note that it is important to be reasonably lenient: it is extremely annoying for people not to be able to add legitimate names, like "di Silva", because those names have characters like space.
Typically, these personal name validations should not be language-specific; I might be using a website in a language other than the one for my name, for example. While a more sophisticated validation might use context among characters, a basic validation just wants to know "what characters can be part of names?". The text should explain that:
It is only a guideline, and may need tailoring for different environments
It is a lenient, non-language-specific set - for language-specific characters one should see CLDR.
Mention characters:
It includes characters that may not be appropriate for identifiers, and those that would not be parts of words.
It does not include contextual tests
Additional tests may be needed in cases where security is at issue.
The set can be narrowed if name fields are split out. For example, "," may not be necessary if titles are split out; if titles are not allowed, "." may not be necessary.
Word characters contains some characters that may be part of words in a broad sense, such as "c:a" in Swedish or a dictionary word containing hyphenation points, that might not normally be part of names.
Explain the use of NFKC in name validation
Background Information
Characters added by Word Boundaries
Basic Latin - ASCII punctuation and symbols
U+0027 ( ' ) APOSTROPHE
U+002E ( . ) FULL STOP
U+003A ( : ) COLON
U+005F ( _ ) LOW LINE
Latin 1 Supplement - Latin-1 punctuation and symbols
U+00B7 ( · ) MIDDLE DOT
Greek And Coptic - Punctuation
U+0387 ( · ) GREEK ANO TELEIA
Hebrew - Additional punctuation
U+05F3 ( ׳ ) HEBREW PUNCTUATION GERESH
U+05F4 ( ״ ) HEBREW PUNCTUATION GERSHAYIM
General Punctuation - General punctuation
U+2018 ( ‘ ) LEFT SINGLE QUOTATION MARK
U+2019 ( ’ ) RIGHT SINGLE QUOTATION MARK
U+2024 ( ․ ) ONE DOT LEADER
U+2027 ( ‧ ) HYPHENATION POINT
U+203F ( ‿ ) UNDERTIE
U+2040 ( ⁀ ) CHARACTER TIE
U+2054 ( ⁔ ) INVERTED UNDERTIE
Vertical Forms - Glyphs for vertical variants
U+FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLON
CJK Compatibility Forms - Glyphs for vertical variants
U+FE33 ( ︳ ) PRESENTATION FORM FOR VERTICAL LOW LINE
U+FE34 ( ︴ ) PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
CJK Compatibility Forms - Overscores and underscores
U+FE4D ( ﹍ ) DASHED LOW LINE
U+FE4E ( ﹎ ) CENTRELINE LOW LINE
U+FE4F ( ﹏ ) WAVY LOW LINE
Small Form Variants - Small form variants
U+FE52 ( ﹒ ) SMALL FULL STOP
U+FE55 ( ﹕ ) SMALL COLON
Halfwidth And Fullwidth Forms - Fullwidth ASCII variants
U+FF07 ( ' ) FULLWIDTH APOSTROPHE
U+FF0E ( . ) FULLWIDTH FULL STOP
U+FF1A ( : ) FULLWIDTH COLON
U+FF3F ( _ ) FULLWIDTH LOW LINE
Table 3. Candidate Characters for Inclusion in Identifiers
0027 (') APOSTROPHE
002D (-) HYPHEN-MINUS
002E (.) FULL STOP
003A (:) COLON
00B7 (·) MIDDLE DOT
058A (֊) ARMENIAN HYPHEN
05F3 (׳) HEBREW PUNCTUATION GERESH
05F4 (״) HEBREW PUNCTUATION GERSHAYIM
0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG
200C () ZERO WIDTH NON-JOINER*
200D () ZERO WIDTH JOINER*
2010 (‐) HYPHEN
2019 (’) RIGHT SINGLE QUOTATION MARK
2027 (‧) HYPHENATION POINT
30A0 (=) KATAKANA-HIRAGANA DOUBLE HYPHEN
30FB ( ・ ) KATAKANA MIDDLE DOT
Word Characters vs Identifier Characters
The following are the word characters from #31, minus Cf:
[\p{alpha}\p{WB=Extend}\p{WB=FO}\p{WB=LE}\p{WB=ML}\p{WB=MB}\p{WB=EX}-\p{cf}]
And the Identifier characters from #31 (including table 1)
[[:L:][:Nl:][:Mn:][:Mc:][\u0027\u002D\u002E\u003A\u00B7\u058A\u05F3
\u05F4\u0F0B\u200C\u200D\u2010\u2019\u2027\u30A0\u30FB][:Pc:]
-[:Pattern_Syntax:]
-[:Pattern_White_Space:]]]
Here are Word characters minus Identifier characters.
Basic Latin - ASCII punctuation and symbols
U+0027 ( ' ) APOSTROPHE
U+002E ( . ) FULL STOP
U+003A ( : ) COLON
Greek And Coptic - Punctuation
U+0387 ( · ) GREEK ANO TELEIA
Cyrillic - Historic miscellaneous
U+0488 ( ҈ ) COMBINING CYRILLIC HUNDRED THOUSANDS SIGN
U+0489 ( ҉ ) COMBINING CYRILLIC MILLIONS SIGN
Arabic - Koranic annotation signs
U+06DE ( ۞ ) ARABIC START OF RUB EL HIZB
General Punctuation - General punctuation
U+2018 ( ‘ ) LEFT SINGLE QUOTATION MARK
U+2019 ( ’ ) RIGHT SINGLE QUOTATION MARK
U+2024 ( ․ ) ONE DOT LEADER
U+2027 ( ‧ ) HYPHENATION POINT
Combining Diacritical Marks For Symbols - Enclosing diacritics
U+20DD ( ⃝ ) COMBINING ENCLOSING CIRCLE
U+20DE ( ⃞ ) COMBINING ENCLOSING SQUARE
U+20DF ( ⃟ ) COMBINING ENCLOSING DIAMOND
U+20E0 ( ⃠ ) COMBINING ENCLOSING CIRCLE BACKSLASH
Combining Diacritical Marks For Symbols - Additional enclosing diacritics
U+20E2 ( ⃢ ) COMBINING ENCLOSING SCREEN
U+20E3 ( ⃣ ) COMBINING ENCLOSING KEYCAP
U+20E4 ( ⃤ ) COMBINING ENCLOSING UPWARD POINTING TRIANGLE
Enclosed Alphanumerics - Circled Latin letters
U+24B6 ( Ⓐ ) CIRCLED LATIN CAPITAL LETTER A
..
U+24E9 ( ⓩ ) CIRCLED LATIN SMALL LETTER Z
Supplemental Punctuation - Medievalist punctuation
U+2E2F ( ⸯ ) VERTICAL TILDE
Cyrillic Extended B - Combining numeric signs
U+A670 ( ꙰ ) COMBINING CYRILLIC TEN MILLIONS SIGN
U+A671 ( ꙱ ) COMBINING CYRILLIC HUNDRED MILLIONS SIGN
U+A672 ( ꙲ ) COMBINING CYRILLIC THOUSAND MILLIONS SIGN
Vertical Forms - Glyphs for vertical variants
U+FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLON
Small Form Variants - Small form variants
U+FE52 ( ﹒ ) SMALL FULL STOP
U+FE55 ( ﹕ ) SMALL COLON
Halfwidth And Fullwidth Forms - Fullwidth ASCII variants
U+FF07 ( ' ) FULLWIDTH APOSTROPHE
U+FF0E ( . ) FULLWIDTH FULL STOP
U+FF1A ( : ) FULLWIDTH COLON
And the Identifier Characters minus the Word Characters
Armenian - Punctuation
U+058A ( ֊ ) ARMENIAN HYPHEN
Tibetan - Marks and signs
U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG
Katakana - Katakana punctuation
U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN
Katakana - Conjunction and length marks
U+30FB ( ・ ) KATAKANA MIDDLE DOT