Unicode & Int’l SW‎ > ‎UTC‎ > ‎

*TODO Updated Confusable Data

Subject:
Updated Confusable Data
Date: 2009-08-06
From: Mark Davis
To:
UTC

I regenerated the confusables for UTS#39: Unicode Security Mechanisms, adding some characters from Mozilla and some related characters. I'd appreciate any feedback on additions/corrections. I've not really looked at any of the 5.1/5.2 additions, so help there would be especially appreciated, and we will want to release the new data soon after U5.2 is released.

The draft source is at:


By adding x ≈ y (x is visually confusable with y) to that source, it adds new confusables to the data. (The x ≈ y relationship in the file is expressed by the standard ";" delimiter between the two items, as in standard Unicode data files.)

Note that confusability means that in some common fonts, at common UI font sizes, the characters look similar enough to be mistaken. So, for example, look at the following:
(‎ י ‎)	05D9	 HEBREW LETTER YOD
(‎ ' ‎) 0027 APOSTROPHE
The YOD doesn't look much like the apostrophe in a traditional font, but it does look like it in some modern fonts; so it gets added.

The draft summary result is at:

This is after generating equivalence classes. What that means is that if we had xy and yz in the source, then it also adds yx, z ≈ y, xz, and z ≈ x. The equivalence classes not only apply transitivity and symmetry for whole strings, but also to substrings. That means that if we have xy and ywz and qym in the source, then the equivalence class also adds xwz and qxm. This file displays these equivalence classes by picking a representative, and mapping all others to it. For example:
	(‎ ! ‎)	0021	 EXCLAMATION MARK
← (‎ ǃ ‎) 01C3 LATIN LETTER RETROFLEX CLICK
← (‎ ! ‎) FF01 FULLWIDTH EXCLAMATION MARK
This means that { !  ǃ !} are all in the same equivalence class. Similarly, all of the following are in the same equivalence class.
	(‎ / ‎)	002F	 SOLIDUS
← (‎ 丿 ‎) 4E3F CJK UNIFIED IDEOGRAPH-4E3F # →⼃→
← (‎ 〳 ‎) 3033 VERTICAL KANA REPEAT MARK UPPER HALF
← (‎ ⼃ ‎) 2F03 KANGXI RADICAL SLASH
← (‎ ᜵ ‎) 1735 PHILIPPINE SINGLE PUNCTUATION
← (‎ ⁁ ‎) 2041 CARET INSERTION POINT
← (‎ ⁄ ‎) 2044 FRACTION SLASH
← (‎ ∕ ‎) 2215 DIVISION SLASH
← (‎ ╱ ‎) 2571 BOX DRAWINGS LIGHT DIAGONAL UPPER RIGHT TO LOWER LEFT
← (‎ ⧸ ‎) 29F8 BIG SOLIDUS
← (‎ / ‎) FF0F FULLWIDTH SOLIDUS
The comments (after #) indicate that the mapping is indirect. That is, the above comment indicates that it is added because 丿 (in some fonts) looks like ⼃ which (in some fonts) looks like /.

New Script and Character Proposals

I'm also thinking that it would be a good idea to ask explicitly in script proposals for a list of possible confusables in the proposal. That would help to keep it up to date for new scripts, especially historic ones.
Comments