Unicode & Int’l SW‎ > ‎UTC‎ > ‎

Validating UCD

The Unicode character property values in the UCD files can be validated via by means of regular expressions. Such validation can also be useful in testing of implementations that return property values. The method of validation depends on the type of property, as described below. These expressions use Perl syntax, but may be of course be converted to other formal conventions for use with other regular expression engines.

Enumerated and Binary Properties

These properties can be validated by generating a regular expression using the PropertyValueAliases.txt file. For example, to validate the East_Asian_Width properties, parse the following lines, and produce a regular expression which is the concatenation of each of the values.

# East_Asian_Width (ea)

ea ; A         ; Ambiguous
ea ; F         ; Fullwidth
ea ; H         ; Halfwidth
ea ; N         ; Neutral
ea ; Na        ; Narrow
ea ; W         ; Wide

The regular expression would thus be /A|Ambiguous|F|Fullwidth|H|Halfwidth|N|Neutral|Na|Narrow|W|Wide/

Hybrid Properties

The CCC property is both an enumerated property and a numeric property. Its validating regular expression is formed as in Enumerated and Binary Properties, but also adding numbers 0..255.

Unihan Properties

The validating regular expressions for these properties are described in http://www.unicode.org/reports/tr38/.

Other Properties

Regular expressions for the other properties in the UCD are provided in Tables 18 and 18a. Table 18 supplies common subexpressions used in the regular expressions in Table 18.

Table 18 Common Subexpressions

Variable Value
$positiveDecimal ([0-9]+\.[0-9]+|unassigned)
$decimal
-?$positiveDecimal
$optionalDecimal
-?[0-9]+(\[0-9]+)?
$name
[a-zA-Z0-9]+([_-\ ][a-zA-Z0-9]+)*
$codepoint
(10|[A-Z0-9])?[A-Z0-9]{4}
  

Table 18a. Regular Expressions for Property Values

Abbr Name Regex for Allowable Values
age Age /$positiveDecimal|unassigned/
nv Numeric_Value /$decimal/ Field 2
/$optionalDecimal/ Field 3
blk Block /$name/
sc Script
dm Decomposition_Mapping /$codepoint+/
FC_NFKC FC_NFKC_Closure
cf Case_Folding /$codepoint+/
lc Lowercase_Mapping
tc Titlecase_Mapping
uc Uppercase_Mapping
sfc Simple_Case_Folding /$codepoint/
slc Simple_Lowercase_Mapping
stc Simple_Titlecase_Mapping
suc Simple_Uppercase_Mapping
bmg Bidi_Mirroring_Glyph /$codepoint/
isc ISO_Comment /$name/
na1 Unicode_1_Name /$name/
na Name /$name/


[Ed Note: some of the anchors in http://www.unicode.org/reports/tr44/tr44-3.html have a spurious trailing "_html". They should be removed]
[Ed Note: I just modified the expressions to pull out the subexpressions. I'm hoping that Eric can fix up the actual values, and add missing ones.]
Comments