Javascript Character Picker

We're looking at a Javascript Character Picker that would allow people to insert arbitrary Unicode characters, and would appreciate feedback.

The idea is to use a main category on the left as a pull-down menu. When that main category is chosen, then it populates a subcategory pull-down menu in the center. Picking one of the subcategories fleshes out the table of buttons, one per character. Clicking on the button inserts the character. Hovering over the button shows the code point. There is a mockup at http://macchiato.com/picker/MyApplication.html. This mockup is purely for looking at the character categories and ordering, not the UI!

Design Goals

    1. Data size is important, so we wouldn't have names for the characters, or be able to search by name. However, we can add names for a small number of characters, exposed when the user hovers over the button. These are currently the following:

    1. Again because of data size, we are excluding the following:

      • [[:cn:][:cs:][:co:][:cc:][:deprecated:]]

      • We may exclude more, such as Han characters. This can be controlled via switches.

    2. The categorization does not need to be a partition. That is, a character can be in multiple categories if it makes it easier to find.

    3. If a user doesn't have fonts that cover all the characters, then boxes will show up. It is too difficult to filter those out, because they may vary according to browser capabilities and loaded fonts.

    4. Each set of buttons should contain somewhere around 100 items. They should be at most about 1,000.

    5. In most cases, the buttons should sorted. However, for scripts where the order is not that important or code point order is good enough, they may be unsorted to save on data size.

    6. The data for the categorization and the ordering of buttons is generated by a Java program, which has full access to Unicode properties via UnicodeSet and other classes in ICU4J. In particular goals are that:

      1. The data can be updated to new versions of Unicode easily.

      2. The data can be customized to exclude or include certain categories of characters at build time.

    1. The Javascript program knows nothing about Unicode properties, and is driven purely by generated data.

Known Issues

    1. More needs to be done on the categorization and the ordering of buttons, so feedback is welcome - please send to markdavis@google.com (Subject: Picker ...)

    2. Some of the subcategories show up empty (no buttons) in the Mockup.

    3. There is no effort to make the mockup look pretty - it is only for trying out the categories and ordering.

    4. The blowup of a character has a link to the Unicode utilities or Unihan database; again, this is purely for testing.

Mockup Sorting

The current sorting of buttons is according to the following levels, in order.

    1. ASCII

    2. Normalized letters (NFKD)

    3. Unnormalized letters

    4. Unicode Collation Algorithm Order

    5. Code point order

Each level is skipped if the characters have the same value at that level. For example, if one character is ASCII and the other isn't, the ASCII goes first. If both are or are not ASCII, then the next level is tested, and so on. The final level will always distinguish different code points.

There can be exceptions to this. For example, decimal numbers have sorting turned off (that organizes them nicely with 10 columns).

We may tune this ordering of buttons further.

The ordering of the main categories and of the subcategories within the main ones also needs improvement.

Mockup Categorization

The current categories are derived as much as possible from script and general_category Unicode properties. It is supplemented by a couple of sources:

Data Sources

    • The "subblocks", that is, the subheaders marked with @@ in NamesList.txt. While these are not intended as a categorization, they are pretty useful.

    • The kRSUnicode radical-stroke info from Unihan

      • The radical character is derived from 2E00 + radical number for normal cases, and looking at the ".0" data for the simplified radicals -- uggh.

    • To get archaic characters, Table 4 of UAX #31

    • A categorization of scripts by the continent where they originate (or subcontinent).

    • A full Hangul decomposition

    • Decomposition Type.

    • The renamings and reorganizing is table-driven as much as possible, using a list of regex strings and replacements. These renamings can change both main and subcategories.

Organization

The categories are generally groups of scripts, or by General_Category values for common or inherited script.

    • Letters with no script (like supplementary Math letters) are treated like symbols.

    • Symbols also add some categories based on decomposition_type, like Superscript, Subscript.

    • Archaic characters are in their own subcategories.

    • Hangul is subcategorized by first component of full decomposition

    • Han is ordered by radical.

    • Invisibles include: format, whitespace, and others ([:default_ignorable_code_point:])

There are some renamings for clarity, eg "Mark" => "General Diacritic".

    • Sometimes the subcategory is replaced by the subblock: sometimes combined with it.

    • These turn out to be too fine-grained, so we'll have to group them. For example, we might try organizing nonspacing marks by their CCC values (Above, Below,...), or putting all the game pieces (Chess, Go, etc.) together.

There are still problems with this categorization, so we'll need to tune it further. In particular, the use of Blocks and subblocks for Symbols and Punctutation helps, but we need to play with the groupings. We probably also want to separate some of the bizarro Latin letters into a separate category.

Data Format

The generated data is stored in the Mockup with the following format:

{{"Symbol"},

{"Arrows","↕↜↡↤↧↯⇐⇍⇑⇓⇕"},

{"Block Elements","▀"},

...

},

...

{{"Number"},

{"Decimal","0٠۰߀०০੦૦୦௦౦೦൦๐໐༠၀႐០᠐᥆᧐᭐᮰᱀᱐꘠꣐꤀꩐0"},

...

},

...

For Javascript, this will be close to the same, but translated into JSON.

The string after the subcategory is an ordered list of characters. A non-PUA character represents itself. A PUA character means a range of count N following following the last character: so "A\uE002" means ABC (2 characters following the A). This can be done because all of the button ranges are smaller than the PUA range. Storing it this way compresses the characters from ~100K down to ~8K. This wouldn't represent actual bandwidth, which would depend on the encoding and the zipping. If necessary, still smaller storage might use deltas instead, which would work better for UTF-8 encoded JavaScript.

Character Appearance

Characters may have different appearances depending on the browser and font. For example, these characters may appear with different shapes (glyphs):

♔⌛⚖☤⚜⚝☸☚⛁🀰☃

In my browser they appear as in the following image:

If the user doesn't have a font, or if the browser doesn't do good fallback (finding a font with a glyph for the character if available on the user's system), then the user may see boxes or other representations instead of an appropriate glyph.