Unihan property names

Mark Davis, 2009-07-08

I was working on the following action. We were not able to come to consensus as to how to proceed in the ed committee, so I'm bringing this to the UTC.

117 A040 Mark Davis Update PropertyAliases.txt and PropertyValueAliases.txt with the Unihan properties. L2/08-352 UCD 2008-11-12 2008-11-12

The document is: http://www.unicode.org/L2/L2008/08352-stability-prop.html. That document didn't specify the property names or aliases, and I also needed default values. I followed the analogy of kRSUnicode, which shows up as:

# Unicode_Radical_Stroke (URS)

# @missing: 0000..10FFFF; Unicode_Radical_Stroke; <none>

The CompatibilityVariant and the numerics have defaults for String and Numeric properties, however.

Note that we probably should list CompatibilityVariant as a derived property in #44, since it -- according to #38 -- is derived from the compat decomp in UnicodeData; and is thus (I presume) just filtered to only be CJK_Ideographs.


We really ought to have a faq explaining what the heck the difference is between the properties Ideograph and Unified Ideograph, since it is pretty impossible to guess from the names. It appears that Ideograph is really a derived property and equal to the following. Is this relationship by intent or accident?

Unified Ideograph + HANGZHOU numerals + compat ideographs + 3006 + 3007





On that basis, here was what I came up with.




# ================================================

# Numeric Properties

# ================================================

CJK_AC ; CJK_AccountingNumeric ; kAccountingNumeric

CJK_ON ; CJK_OtherNumeric ; kOtherNumeric

CJK_PN ; CJK_PrimaryNumeric ; kPrimaryNumeric


# ================================================

# String Properties

# ================================================


CJK_CV ; CJK_CompatibilityVariant ; kCompatibilityVariant


# ================================================

# Miscellaneous Properties

# ================================================

IIC ; IICore ; kIICore

IRG_G ; IRG_GSource ; kIRG_GSource

IRG_H ; IRG_HSource ; kIRG_HSource

IRG_J ; IRG_JSource ; kIRG_JSource

IRG_K ; IRG_KSource ; kIRG_KSource

IRG_KP ; IRG_KPSource ; kIRG_KPSource

IRG_T ; IRG_TSource ; kIRG_TSource

IRG_U ; IRG_USource ; kIRG_USource

IRG_V ; IRG_VSource ; kIRG_VSource


URS ; Unicode_Radical_Stroke ; kRSUnicode




# CJK_AccountingNumeric (CJK_AC)

# @missing: 0000..10FFFF; CJK_AccountingNumeric; NaN

# CJK_CompatibilityVariant (CJK_CV)

# @missing: 0000..10FFFF; CJK_CompatibilityVariant; <code point>

# CJK_OtherNumeric (CJK_ON)

# @missing: 0000..10FFFF; CJK_OtherNumeric; NaN

# CJK_PrimaryNumeric (CJK_PN)

# @missing: 0000..10FFFF; CJK_PrimaryNumeric; NaN

# IICore (IIC)

# @missing: 0000..10FFFF; IICore; <none>

# IRG_GSource (IRG_G)

# @missing: 0000..10FFFF; IRG_GSource; <none>

# IRG_HSource (IRG_H)

# @missing: 0000..10FFFF; IRG_HSource; <none>

# IRG_JSource (IRG_J)

# @missing: 0000..10FFFF; IRG_JSource; <none>

# IRG_KPSource (IRG_KP)

# @missing: 0000..10FFFF; IRG_KPSource; <none>

# IRG_KSource (IRG_K)

# @missing: 0000..10FFFF; IRG_KSource; <none>

# IRG_TSource (IRG_T)

# @missing: 0000..10FFFF; IRG_TSource; <none>

# IRG_USource (IRG_U)

# @missing: 0000..10FFFF; IRG_USource; <none>

# IRG_VSource (IRG_V)

# @missing: 0000..10FFFF; IRG_VSource; <none>

Here are some back and forths on this, edited heavily for brevity.

I would really really prefer that we don't invent any new names, long or short.

For the long names, the kFoo names are perfectly adequate in my opinion, they are what the Unihan users know (at the very least, they have to be the "preferred" long names).

For the short names, with 88 properties, they are bound to be impossible to remember (e.g. for me CJK_ON suggests kJapaneseOn rather than kOtherNumeric).


I can understand that, but we also need to be consistent with what we've done already with kRSUnicode, and other properties. Note that the kxxx names are retained as aliases, and there'd be no problem with your continuing to use them in the xml.

Having the k form (which can be very long) be the short alias is simply bizarre. As far as the UCD properties were concerned, the tags in Unihan were just gorp; this is the point at which we are really fully recognizing (some of) them as UCD properties, and we should give them consistent names, as we have *already done* with kRSUnicode.

Note that with Unicode_Radical_Stroke, the we didn't even use the k form as an *alias*; the official name was all and only the first two fields below. In the above, I added the 'k' form as an alias..

URS ; Unicode_Radical_Stroke ; kRSUnicode