L2/09-232RFrom: Mark DavisDate: 2009-08-14 (revised) Proposal
For entry field validation, implementations sometimes need to know which characters can occur in personal names. While it is a bit fuzzy exactly what this means, they want to distinguish between characters like those in "James Smith-Faley, Jr." and those in "!#@♥≠". Note that it is important to be reasonably lenient: it is extremely annoying for people not to be able to add legitimate names, like "di Silva", because those names have characters like space. Typically, these personal name validations should not be language-specific; I might be using a website in a language other than the one for my name, for example. While a more sophisticated validation might use context among characters, a basic validation just wants to know "what characters can be part of names?". The text should explain that:
Background InformationCharacters added by Word BoundariesBasic Latin - ASCII punctuation and symbolsLatin 1 Supplement - Latin-1 punctuation and symbolsU+00B7 ( · ) MIDDLE DOTGreek And Coptic - PunctuationU+0387 ( · ) GREEK ANO TELEIAHebrew - Additional punctuationGeneral Punctuation - General punctuationU+2018 ( ‘ ) LEFT SINGLE QUOTATION MARKU+2019 ( ’ ) RIGHT SINGLE QUOTATION MARKU+2024 ( ․ ) ONE DOT LEADERU+2027 ( ‧ ) HYPHENATION POINTU+203F ( ‿ ) UNDERTIEU+2040 ( ⁀ ) CHARACTER TIEU+2054 ( ⁔ ) INVERTED UNDERTIEVertical Forms - Glyphs for vertical variantsU+FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLONCJK Compatibility Forms - Glyphs for vertical variantsU+FE33 ( ︳ ) PRESENTATION FORM FOR VERTICAL LOW LINEU+FE34 ( ︴ ) PRESENTATION FORM FOR VERTICAL WAVY LOW LINECJK Compatibility Forms - Overscores and underscoresSmall Form Variants - Small form variantsHalfwidth And Fullwidth Forms - Fullwidth ASCII variantsU+FF07 ( ' ) FULLWIDTH APOSTROPHEU+FF0E ( . ) FULLWIDTH FULL STOPU+FF1A ( : ) FULLWIDTH COLONU+FF3F ( _ ) FULLWIDTH LOW LINETable 3. Candidate Characters for Inclusion in Identifiers0027 (') APOSTROPHE
002D (-) HYPHEN-MINUS Word Characters vs Identifier CharactersThe following are the word characters from #31, minus Cf: [\p{alpha}\p{WB=Extend}\p{WB=FO}\p{WB=LE}\p{WB=ML}\p{WB=MB}\p{WB=EX}-\p{cf}]
And the Identifier characters from #31 (including table 1)[[:L:][:Nl:][:Mn:][:Mc:][\u0027\u002D\u002E\u003A\u00B7\u058A\u05F3
\u05F4\u0F0B\u200C\u200D\u2010\u2019\u2027\u30A0\u30FB][:Pc:] -[:Pattern_Syntax:] -[:Pattern_White_Space:]]] Here are Word characters minus Identifier characters. Basic Latin - ASCII punctuation and symbolsGreek And Coptic - PunctuationU+0387 ( · ) GREEK ANO TELEIACyrillic - Historic miscellaneousU+0488 ( ҈ ) COMBINING CYRILLIC HUNDRED THOUSANDS SIGNU+0489 ( ҉ ) COMBINING CYRILLIC MILLIONS SIGNArabic - Koranic annotation signsU+06DE ( ۞ ) ARABIC START OF RUB EL HIZBGeneral Punctuation - General punctuationU+2018 ( ‘ ) LEFT SINGLE QUOTATION MARKU+2019 ( ’ ) RIGHT SINGLE QUOTATION MARKU+2024 ( ․ ) ONE DOT LEADERU+2027 ( ‧ ) HYPHENATION POINTCombining Diacritical Marks For Symbols - Enclosing diacriticsU+20DD ( ⃝ ) COMBINING ENCLOSING CIRCLEU+20DE ( ⃞ ) COMBINING ENCLOSING SQUAREU+20DF ( ⃟ ) COMBINING ENCLOSING DIAMONDU+20E0 ( ⃠ ) COMBINING ENCLOSING CIRCLE BACKSLASHCombining Diacritical Marks For Symbols - Additional enclosing diacriticsU+20E2 ( ⃢ ) COMBINING ENCLOSING SCREENU+20E3 ( ⃣ ) COMBINING ENCLOSING KEYCAPU+20E4 ( ⃤ ) COMBINING ENCLOSING UPWARD POINTING TRIANGLEEnclosed Alphanumerics - Circled Latin lettersSupplemental Punctuation - Medievalist punctuationU+2E2F ( ⸯ ) VERTICAL TILDECyrillic Extended B - Combining numeric signsU+A670 ( ꙰ ) COMBINING CYRILLIC TEN MILLIONS SIGNU+A671 ( ꙱ ) COMBINING CYRILLIC HUNDRED MILLIONS SIGNU+A672 ( ꙲ ) COMBINING CYRILLIC THOUSAND MILLIONS SIGNVertical Forms - Glyphs for vertical variantsU+FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLONSmall Form Variants - Small form variantsHalfwidth And Fullwidth Forms - Fullwidth ASCII variantsAnd the Identifier Characters minus the Word Characters Armenian - PunctuationU+058A ( ֊ ) ARMENIAN HYPHENTibetan - Marks and signsU+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEGKatakana - Katakana punctuationU+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHENKatakana - Conjunction and length marksU+30FB ( ・ ) KATAKANA MIDDLE DOT |
Unicode & Int’l SW > UTC >