The Unicode character property values in the UCD files can be validated
via by means of regular expressions. Such validation can also be useful
in testing of implementations that return property values. The method
of validation depends on the type of property, as described below. These expressions use Perl syntax, but may be
of course be converted to other formal conventions for use
with other regular expression engines. Enumerated and Binary Properties These properties can be validated by generating a regular expression using the PropertyValueAliases.txt file. For example, to validate the East_Asian_Width properties, parse the following lines, and produce a regular expression which is the concatenation of each of the values. # East_Asian_Width (ea) ea ; A ; Ambiguous ea ; F ; Fullwidth ea ; H ; Halfwidth ea ; N ; Neutral ea ; Na ; Narrow ea ; W ; Wide The regular expression would thus be /A|Ambiguous|F|Fullwidth|H|Halfwidth|N|Neutral|Na|Narrow|W|Wide/ Hybrid Properties The CCC property is both an enumerated property and a numeric property. Its validating regular expression is formed as in Enumerated and Binary Properties, but also adding numbers 0..255. Unihan Properties The validating regular expressions for these properties are described in http://www.unicode.org/reports/tr38/. Other Properties Regular expressions for the other properties in the UCD are provided in Tables 18 and 18a. Table 18 supplies common subexpressions used in the regular expressions in Table 18. Table 18 Common Subexpressions
Table 18a. Regular Expressions for Property Values
[Ed Note: some of the anchors in http://www.unicode.org/reports/tr44/tr44-3.html have a spurious trailing "_html". They should be removed] [Ed Note: I just modified the expressions to pull out the subexpressions. I'm hoping that Eric can fix up the actual values, and add missing ones.] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||