*TODO Updated Confusable Data

I regenerated the confusables for UTS#39: Unicode Security Mechanisms, adding some characters from Mozilla and some related characters. I'd appreciate any feedback on additions/corrections. I've not really looked at any of the 5.1/5.2 additions, so help there would be especially appreciated, and we will want to release the new data soon after U5.2 is released.

The draft source is at:

http://unicode.org/repository/*checkout*/draft/reports/tr39/data/source/formatted-source.txt

By adding x ≈ y (x is visually confusable with y) to that source, it adds new confusables to the data. (The x ≈ y relationship in the file is expressed by the standard ";" delimiter between the two items, as in standard Unicode data files.)

Note that confusability means that in some common fonts, at common UI font sizes, the characters look similar enough to be mistaken. So, for example, look at the following:

(‎ י ‎) 05D9 HEBREW LETTER YOD

(‎ ' ‎) 0027 APOSTROPHE

The YOD doesn't look much like the apostrophe in a traditional font, but it does look like it in some modern fonts; so it gets added.

The draft summary result is at:

http://unicode.org/repository/*checkout*/draft/reports/tr39/data/confusablesSummary.txt

This is after generating equivalence classes. What that means is that if we had x ≈ y and y ≈ z in the source, then it also adds y ≈ x, z ≈ y, x ≈ z, and z ≈ x. The equivalence classes not only apply transitivity and symmetry for whole strings, but also to substrings. That means that if we have x ≈ y and yw ≈ z and q ≈ ym in the source, then the equivalence class also adds xw ≈ z and q ≈ xm. This file displays these equivalence classes by picking a representative, and mapping all others to it. For example:

(‎ ! ‎) 0021 EXCLAMATION MARK

(‎ ǃ ‎) 01C3 LATIN LETTER RETROFLEX CLICK

(‎ ! ‎) FF01 FULLWIDTH EXCLAMATION MARK

This means that { ! ǃ !} are all in the same equivalence class. Similarly, all of the following are in the same equivalence class.

(‎ / ‎) 002F SOLIDUS

(‎ 丿 ‎) 4E3F CJK UNIFIED IDEOGRAPH-4E3F # →⼃→

(‎ 〳 ‎) 3033 VERTICAL KANA REPEAT MARK UPPER HALF

(‎ ⼃ ‎) 2F03 KANGXI RADICAL SLASH

(‎ ᜵ ‎) 1735 PHILIPPINE SINGLE PUNCTUATION

(‎ ⁁ ‎) 2041 CARET INSERTION POINT

(‎ ⁄ ‎) 2044 FRACTION SLASH

(‎ ∕ ‎) 2215 DIVISION SLASH

(‎ ╱ ‎) 2571 BOX DRAWINGS LIGHT DIAGONAL UPPER RIGHT TO LOWER LEFT

(‎ ⧸ ‎) 29F8 BIG SOLIDUS

(‎ / ‎) FF0F FULLWIDTH SOLIDUS

The comments (after #) indicate that the mapping is indirect. That is, the above comment indicates that it is added because 丿 (in some fonts) looks like ⼃ which (in some fonts) looks like /.

New Script and Character Proposals

I'm also thinking that it would be a good idea to ask explicitly in script proposals for a list of possible confusables in the proposal. That would help to keep it up to date for new scripts, especially historic ones.