Personal Name Validation Characters

L2/09-232R

From: Mark Davis

Date: 2009-08-14 (revised)

Proposal

    1. In U5.2, add a section to UAX#29 describing that the following characters are candidates for tailoring to add to MidLetter. [Use language consistent with the other bullets]

      • [\-\u058A\u1806\u2010\u2011\u2E17\u30A0\uFE63\uFF0D][\u058A\u0F0B\u30A0\u30FB]

      • U+002D ( - ) HYPHEN-MINUS

      • U+058A ( ֊ ) ARMENIAN HYPHEN

      • U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG

      • U+1806 ( ᠆ ) MONGOLIAN TODO SOFT HYPHEN

      • U+2010 ( ‐ ) HYPHEN

      • U+2011 ( ‑ ) NON-BREAKING HYPHEN

      • U+2E17 ( ⸗ ) DOUBLE OBLIQUE HYPHEN

      • U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN

      • U+30FB ( ・ ) KATAKANA MIDDLE DOT

      • U+FE63 ( ﹣ ) SMALL HYPHEN-MINUS

      • U+FF0D ( - ) FULLWIDTH HYPHEN-MINUS

    2. In U5.2, add bulleted text to UAX#29 discussing name validation characters, and giving guidelines for usage adding characters to the word characters allowed above.

    3. In U5.2, add pointers in UAX#31 and #29 wordbreak and text indicating the relationship between identifiers and words (and that the character sets are not the same).

    4. Add a test for consistency between the WB properties and Table 3 (with the known exceptions) to the invariant tests.

====

For entry field validation, implementations sometimes need to know which characters can occur in personal names. While it is a bit fuzzy exactly what this means, they want to distinguish between characters like those in "James Smith-Faley, Jr." and those in "!#@♥≠". Note that it is important to be reasonably lenient: it is extremely annoying for people not to be able to add legitimate names, like "di Silva", because those names have characters like space.

Typically, these personal name validations should not be language-specific; I might be using a website in a language other than the one for my name, for example. While a more sophisticated validation might use context among characters, a basic validation just wants to know "what characters can be part of names?". The text should explain that:

    1. It is only a guideline, and may need tailoring for different environments

    2. It is a lenient, non-language-specific set - for language-specific characters one should see CLDR.

    3. Mention characters:

      • [,.[:whitespace:]]

      • U+002C ( , ) COMMA

      • U+002E ( . ) FULL STOP

      • [:whitespace:]

    4. It includes characters that may not be appropriate for identifiers, and those that would not be parts of words.

    5. It does not include contextual tests

    6. Additional tests may be needed in cases where security is at issue.

    7. The set can be narrowed if name fields are split out. For example, "," may not be necessary if titles are split out; if titles are not allowed, "." may not be necessary.

    8. Word characters contains some characters that may be part of words in a broad sense, such as "c:a" in Swedish or a dictionary word containing hyphenation points, that might not normally be part of names.

    9. Explain the use of NFKC in name validation

Background Information

Characters added by Word Boundaries

Basic Latin - ASCII punctuation and symbols

U+0027 ( ' ) APOSTROPHE

U+002E ( . ) FULL STOP

U+003A ( : ) COLON

U+005F ( _ ) LOW LINE

Latin 1 Supplement - Latin-1 punctuation and symbols

U+00B7 ( · ) MIDDLE DOT

Greek And Coptic - Punctuation

U+0387 ( · ) GREEK ANO TELEIA

Hebrew - Additional punctuation

U+05F3 ( ‎׳‎ ) HEBREW PUNCTUATION GERESH

U+05F4 ( ‎״‎ ) HEBREW PUNCTUATION GERSHAYIM

General Punctuation - General punctuation

U+2018 ( ‘ ) LEFT SINGLE QUOTATION MARK

U+2019 ( ’ ) RIGHT SINGLE QUOTATION MARK

U+2024 ( ․ ) ONE DOT LEADER

U+2027 ( ‧ ) HYPHENATION POINT

U+203F ( ‿ ) UNDERTIE

U+2040 ( ⁀ ) CHARACTER TIE

U+2054 ( ⁔ ) INVERTED UNDERTIE

Vertical Forms - Glyphs for vertical variants

U+FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLON

CJK Compatibility Forms - Glyphs for vertical variants

U+FE33 ( ︳ ) PRESENTATION FORM FOR VERTICAL LOW LINE

U+FE34 ( ︴ ) PRESENTATION FORM FOR VERTICAL WAVY LOW LINE

CJK Compatibility Forms - Overscores and underscores

U+FE4D ( ﹍ ) DASHED LOW LINE

U+FE4E ( ﹎ ) CENTRELINE LOW LINE

U+FE4F ( ﹏ ) WAVY LOW LINE

Small Form Variants - Small form variants

U+FE52 ( ﹒ ) SMALL FULL STOP

U+FE55 ( ﹕ ) SMALL COLON

Halfwidth And Fullwidth Forms - Fullwidth ASCII variants

U+FF07 ( ' ) FULLWIDTH APOSTROPHE

U+FF0E ( . ) FULLWIDTH FULL STOP

U+FF1A ( : ) FULLWIDTH COLON

U+FF3F ( _ ) FULLWIDTH LOW LINE

Table 3. Candidate Characters for Inclusion in Identifiers

0027 (') APOSTROPHE

002D (-) HYPHEN-MINUS

002E (.) FULL STOP

003A (:) COLON

00B7 (·) MIDDLE DOT

058A (֊) ARMENIAN HYPHEN

05F3 (׳) HEBREW PUNCTUATION GERESH

05F4 (״) HEBREW PUNCTUATION GERSHAYIM

0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG

200C () ZERO WIDTH NON-JOINER*

200D () ZERO WIDTH JOINER*

2010 (‐) HYPHEN

2019 (’) RIGHT SINGLE QUOTATION MARK

2027 (‧) HYPHENATION POINT

30A0 (=) KATAKANA-HIRAGANA DOUBLE HYPHEN

30FB ( ・ ) KATAKANA MIDDLE DOT

Word Characters vs Identifier Characters

The following are the word characters from #31, minus Cf:

[\p{alpha}\p{WB=Extend}\p{WB=FO}\p{WB=LE}\p{WB=ML}\p{WB=MB}\p{WB=EX}-\p{cf}]

And the Identifier characters from #31 (including table 1)

[[:L:][:Nl:][:Mn:][:Mc:][\u0027\u002D\u002E\u003A\u00B7\u058A\u05F3

\u05F4\u0F0B\u200C\u200D\u2010\u2019\u2027\u30A0\u30FB][:Pc:]

-[:Pattern_Syntax:]

-[:Pattern_White_Space:]]]

Here are Word characters minus Identifier characters.

Basic Latin - ASCII punctuation and symbols

U+0027 ( ' ) APOSTROPHE

U+002E ( . ) FULL STOP

U+003A ( : ) COLON

Greek And Coptic - Punctuation

U+0387 ( · ) GREEK ANO TELEIA

Cyrillic - Historic miscellaneous

U+0488 ( ҈ ) COMBINING CYRILLIC HUNDRED THOUSANDS SIGN

U+0489 ( ҉ ) COMBINING CYRILLIC MILLIONS SIGN

Arabic - Koranic annotation signs

U+06DE ( ۞ ) ARABIC START OF RUB EL HIZB

General Punctuation - General punctuation

U+2018 ( ‘ ) LEFT SINGLE QUOTATION MARK

U+2019 ( ’ ) RIGHT SINGLE QUOTATION MARK

U+2024 ( ․ ) ONE DOT LEADER

U+2027 ( ‧ ) HYPHENATION POINT

Combining Diacritical Marks For Symbols - Enclosing diacritics

U+20DD ( ⃝ ) COMBINING ENCLOSING CIRCLE

U+20DE ( ⃞ ) COMBINING ENCLOSING SQUARE

U+20DF ( ⃟ ) COMBINING ENCLOSING DIAMOND

U+20E0 ( ⃠ ) COMBINING ENCLOSING CIRCLE BACKSLASH

Combining Diacritical Marks For Symbols - Additional enclosing diacritics

U+20E2 ( ⃢ ) COMBINING ENCLOSING SCREEN

U+20E3 ( ⃣ ) COMBINING ENCLOSING KEYCAP

U+20E4 ( ⃤ ) COMBINING ENCLOSING UPWARD POINTING TRIANGLE

Enclosed Alphanumerics - Circled Latin letters

U+24B6 ( Ⓐ ) CIRCLED LATIN CAPITAL LETTER A

..

U+24E9 ( ⓩ ) CIRCLED LATIN SMALL LETTER Z

Supplemental Punctuation - Medievalist punctuation

U+2E2F ( ⸯ ) VERTICAL TILDE

Cyrillic Extended B - Combining numeric signs

U+A670 ( ꙰ ) COMBINING CYRILLIC TEN MILLIONS SIGN

U+A671 ( ꙱ ) COMBINING CYRILLIC HUNDRED MILLIONS SIGN

U+A672 ( ꙲ ) COMBINING CYRILLIC THOUSAND MILLIONS SIGN

Vertical Forms - Glyphs for vertical variants

U+FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLON

Small Form Variants - Small form variants

U+FE52 ( ﹒ ) SMALL FULL STOP

U+FE55 ( ﹕ ) SMALL COLON

Halfwidth And Fullwidth Forms - Fullwidth ASCII variants

U+FF07 ( ' ) FULLWIDTH APOSTROPHE

U+FF0E ( . ) FULLWIDTH FULL STOP

U+FF1A ( : ) FULLWIDTH COLON

And the Identifier Characters minus the Word Characters

Armenian - Punctuation

U+058A ( ֊ ) ARMENIAN HYPHEN

Tibetan - Marks and signs

U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG

Katakana - Katakana punctuation

U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN

Katakana - Conjunction and length marks

U+30FB ( ・ ) KATAKANA MIDDLE DOT