insert arbitrary Unicode characters, and would appreciate feedback.
idea is to use a main category on the left as a pull-down menu. When
that main category is chosen, then it populates a subcategory pull-down
menu in the center. Picking one of the subcategories fleshes out the
table of buttons, one per character. Clicking on the button inserts the
character. Hovering over the button shows the code point. There is a mockup at http://macchiato.com/picker/MyApplication.html
. This mockup is purely for looking at the character categories and ordering, not the UI!
Again because of data size, we are excluding the following:
size is important, so we wouldn't have names for the characters, or be
able to search by name. However, we can add names for a small number of
characters, exposed when the user hovers over the button. These are
currently the following:
categorization does not need to be a partition. That is, a character
can be in multiple categories if it makes it easier to find.
a user doesn't have fonts that cover all the characters, then boxes
will show up. It is too difficult to filter those out, because they may
vary according to browser capabilities and loaded fonts.
- We may exclude more, such as Han characters. This can be controlled via switches.
Each set of buttons should contain somewhere around 100 items. They should be at most about 1,000.
most cases, the buttons should sorted. However, for scripts where the order is not that important or code point order is good enough, they may be unsorted
to save on data size.
data for the categorization and the ordering of buttons is generated by
a Java program, which has full access to Unicode properties via
UnicodeSet and other classes in ICU4J. In particular goals are that:
- The data can be updated to new versions of Unicode easily.
- The data can be customized to exclude or include certain categories of characters at build time.
- More needs to be done on the categorization and the ordering of buttons, so feedback is welcome - please send to email@example.com (Subject: Picker ...)
- Some of the subcategories show up empty (no buttons) in the Mockup.
- There is no effort to make the mockup look pretty - it is only for trying out the categories and ordering.
- The blowup of a character has a link to the Unicode utilities or Unihan database; again, this is purely for testing.
The current sorting of buttons is according to the following levels, in order.
- Normalized letters (NFKD)
- Unnormalized letters
- Unicode Collation Algorithm Order
- Code point order
level is skipped if the characters have the same value at that level.
For example, if one character is ASCII and the other isn't, the ASCII
goes first. If both are or are not ASCII, then the next level is
tested, and so on. The final level will always distinguish different
There can be exceptions to this. For example,
decimal numbers have sorting turned off (that organizes them nicely
with 10 columns).
We may tune this ordering of buttons further.
The ordering of the main categories and of the subcategories within the main ones also needs improvement.
The current categories are derived as much as possible from script and general_category Unicode properties. It is supplemented by a couple of sources:
- The "subblocks", that is, the subheaders marked with @@ in NamesList.txt. While these are not intended as a categorization, they are pretty useful.
- The kRSUnicode radical-stroke info from Unihan
- The radical character is derived from 2E00 + radical number for normal cases, and looking at the ".0" data for the simplified radicals -- uggh.
- To get archaic characters, Table 4 of UAX #31
- A categorization of scripts by the continent where they originate (or subcontinent).
- A full Hangul decomposition
- Decomposition Type.
- The renamings and reorganizing is table-driven as much as possible, using a list of regex strings and replacements. These renamings can change both main and subcategories.
The categories are generally groups of scripts, or by General_Category values for common or inherited script.
- Letters with no script (like supplementary Math letters) are treated like symbols.
- Symbols also add some categories based on
decomposition_type, like Superscript, Subscript.
- Archaic characters are in their own subcategories.
- Hangul is subcategorized by first component of full decomposition
- Han is ordered by radical.
- Invisibles include: format, whitespace, and others ([:default_ignorable_code_point:])
There are some renamings for clarity, eg "Mark" => "General Diacritic".
- Sometimes the subcategory is replaced by the subblock: sometimes combined with it.
- These turn out to be too fine-grained, so we'll have to group them. For example, we might try organizing nonspacing marks by their CCC values (Above, Below,...), or putting all the game pieces (Chess, Go, etc.) together.
There are still
problems with this categorization, so we'll need to tune it further. In
particular, the use of Blocks and subblocks for Symbols and Punctutation helps, but we need to play with the groupings. We probably also want to
separate some of the bizarro Latin letters into a separate category.
The generated data is stored in the Mockup with the following format:
string after the subcategory is an ordered list of characters. A
non-PUA character represents itself. A PUA character means a range of
count N following following the last character: so "A\uE002" means ABC
(2 characters following the A). This can be done because all of the
button ranges are smaller than the PUA range. Storing it this way
compresses the characters from ~100K down to ~8K. This wouldn't
represent actual bandwidth, which would depend on the encoding and the
zipping. If necessary, still smaller storage might use deltas instead,
Characters may have different appearances depending on the browser and font. For example, these characters may appear with different shapes (glyphs):
In my browser they appear as in the following image:
If the user doesn't have a font, or if the browser doesn't do good fallback (finding a font with a glyph for the character if available on the user's system), then the user may see boxes or other representations instead of an appropriate glyph.