Validating UCD
The Unicode character property values in the UCD files can be validated via by means of regular expressions. Such validation can also be useful in testing of implementations that return property values. The method of validation depends on the type of property, as described below. These expressions use Perl syntax, but may be of course be converted to other formal conventions for use with other regular expression engines.
Enumerated and Binary Properties
These properties can be validated by generating a regular expression using the PropertyValueAliases.txt file. For example, to validate the East_Asian_Width properties, parse the following lines, and produce a regular expression which is the concatenation of each of the values.
# East_Asian_Width (ea)
ea ; A ; Ambiguous
ea ; F ; Fullwidth
ea ; H ; Halfwidth
ea ; N ; Neutral
ea ; Na ; Narrow
ea ; W ; Wide
The regular expression would thus be /A|Ambiguous|F|Fullwidth|H|Halfwidth|N|Neutral|Na|Narrow|W|Wide/
Hybrid Properties
The CCC property is both an enumerated property and a numeric property. Its validating regular expression is formed as in Enumerated and Binary Properties, but also adding numbers 0..255.
Unihan Properties
The validating regular expressions for these properties are described in http://www.unicode.org/reports/tr38/.
Other Properties
Regular expressions for the other properties in the UCD are provided in Tables 18 and 18a. Table 18 supplies common subexpressions used in the regular expressions in Table 18.
Table 18 Common Subexpressions
Table 18a. Regular Expressions for Property Values
[Ed Note: some of the anchors in http://www.unicode.org/reports/tr44/tr44-3.html have a spurious trailing "_html". They should be removed]
[Ed Note: I just modified the expressions to pull out the subexpressions. I'm hoping that Eric can fix up the actual values, and add missing ones.]