Validating UCD

The Unicode character property values in the UCD files can be validated via by means of regular expressions. Such validation can also be useful in testing of implementations that return property values. The method of validation depends on the type of property, as described below. These expressions use Perl syntax, but may be of course be converted to other formal conventions for use with other regular expression engines.

Enumerated and Binary Properties

These properties can be validated by generating a regular expression using the PropertyValueAliases.txt file. For example, to validate the East_Asian_Width properties, parse the following lines, and produce a regular expression which is the concatenation of each of the values.

# East_Asian_Width (ea)

ea ; A ; Ambiguous

ea ; F ; Fullwidth

ea ; H ; Halfwidth

ea ; N ; Neutral

ea ; Na ; Narrow

ea ; W ; Wide

The regular expression would thus be /A|Ambiguous|F|Fullwidth|H|Halfwidth|N|Neutral|Na|Narrow|W|Wide/

Hybrid Properties

The CCC property is both an enumerated property and a numeric property. Its validating regular expression is formed as in Enumerated and Binary Properties, but also adding numbers 0..255.

Unihan Properties

The validating regular expressions for these properties are described in http://www.unicode.org/reports/tr38/.

Other Properties

Regular expressions for the other properties in the UCD are provided in Tables 18 and 18a. Table 18 supplies common subexpressions used in the regular expressions in Table 18.

Table 18 Common Subexpressions

Table 18a. Regular Expressions for Property Values

[Ed Note: some of the anchors in http://www.unicode.org/reports/tr44/tr44-3.html have a spurious trailing "_html". They should be removed]

[Ed Note: I just modified the expressions to pull out the subexpressions. I'm hoping that Eric can fix up the actual values, and add missing ones.]