Globalization Gotchas
I'm preparing a presentation for the next Unicode conference in March, and have been thinking about doing one on the pitfalls that people stumble into when using Unicode in globalizing software.
I have the following draft list, and would like to collect others. If you have any suggestions for additions or improvements, I would appreciate them. (At this point I'm not worried about grammar, spelling, etc. And these are only bullet items; they would have some patter along with them.)
Newer items are marked in yellow.
Unicode
Unicode encodes characters, not glyphs: U+0067 → g g g g g g g g g g g g g...
Unicode does not encode characters by language: French, German, English j have the same code point even though all have different pronunciations; Chinese 大 (da) has the same code point as Japanese 大 (dai).
The word character means different things to different people: glyphs, code points, bytes, code units, or user-perceived characters (grapheme clusters). Say which one you mean.
Some APIs/protocols will count lengths in code points, and others in bytes (or other code units). Make sure you don't mix them up.
Character and block names may be misleading: U+034F COMBINING GRAPHEME JOINER doesn't join graphemes.
Never use unassigned code points; those will be used in future versions of Unicode. If you need your own characters, use private use or non-characters; there are plenty!
Be prepared to handle (at least not corrupt!) any incoming code points from U+0000 to U+10FFFF: if your system is running a back-level version of Unicode, you may receive transmitted unassigned code points from later versions.
Watch for "UCS-2" implementations. They use UTF-16 text, but don't support characters over U+10000; they also may accidentally cause isolated surrogates.
Don't limit API parameters to a single character (and definitely not to a single code unit!). What users think of as a single character (e.g. ẍ, ch) may be a sequence in Unicode.
Use U+2060 word joiner instead of U+FEFF for everything but the BOM function.
Use the latest version of Unicode: as well as necessary characters, there are many corrections, and more processes are guaranteed to be stable.
Avoid using private-use (PUA) characters. If you simply must use them, minimize the opportunity for collision by picking an unusual range.
Character Conversion
text ↔ 74 65 78 74
Length in bytes may not be N * length in characters
One character in charset X may not be one character in Unicode; the ordering may also be different.
UTF-8, UTF-16, and UTF-32 are all Unicode.
Always use "shortest form" UTF-8. (1) It's the Law. (2) It reduces security attacks.
If a protocol allows a choice of charsets, always tag with the correct one. Not all text is correctly tagged with its charset, so character detection may sometimes be necessary. But remember, it's always a guess.
IANA / MIME charset names are ill-defined: vendors often convert same charset different ways. Shift-JIS 0x5C → U+005C or U+00A5: different, unrelated characters with unrelated glyphs.
When converting, never simply omit characters that cannot be converted; at least substitute U+FFFD (when converting to Unicode) or 0x1A (when converting to bytes) to reduce security problems.
User Input
7: → text
If you develop your own text editor, use the OS APIs to handle IMEs (Input Method Engines) for Chinese, Japanese, Korean,...
If you are using "type-ahead" to get to a position in a list (eg typing "Jo" gets to the first element starting with those characters), allow arbitrary input. This is often easiest with visible fields.
If your password field can contain characters that require an IME, a screen pop-up box may reveal the password to onlookers.
Text Analysis
text → i
Segmentation
Combining characters always follow their base: is + , not + .
Words are not just sequences of letters.
Properties
Use properties such as Alphabetic, not hard-coded lists: isAlphabetic(), \p{Alphabetic} in regex
Some General Category properties aren't what you think: use White_Space (not Zs), Alphabetic (not L), Lowercase (not Ll),...
Generally use Script instead of Block: not all Greek characters are in the Greek block. Many characters (punctuation, symbols, accents) used to write a language are shared (Script=Common or Inherited).
Characters may change property values between versions of Unicode (except for specific ones).
Identifiers
When designing syntax, don't use characters outside of Pattern_Syntax for syntax characters; don't use characters outside of Pattern_Whitespace for whitespace characters.
For user-visible identifiers, use XID_Start and XID_Continue as a basis. Profiles may expand or narrow from there.
Watch out for security attacks: using visual similarity to slip bogus text (e.g. counterfeit URL’s) past human eyes (spoofing): writing “paypal.com” with a Cyrillic “a” to phish for users’ account information.
Comparison (Collation): Searching, Sorting, Matching
There are two binary orders: code point/UTF-8/UTF-32 and UTF16 order, where U+10000 < U+E000 (since U+10000 = D800 DC00)
Only use binary order internally; no users expect A < Z < a < z < Ç < ä. Apply normalization to get a unique form, so C◌̧ = Ç.
UCA Order is a far better base than binary to meet user-expectations: a < A < ä < Ç = C◌̧ < z < Z
Ordering depends on context and language:
china < China < chinas < danish
ae < æ < af
z < æ (Danish)
c < d < ... h < ch < i (Slovak)
► http://www.unicode.org/reports/tr10/#Common_Misperceptions
Real language-sensitive order requires tailoring on top of UCA
Don't mix up "stable" and "deterministic" sorting; they are very different.
Protocols must precisely define the comparison operations: LDAP doesn't specify the comparison operation, so lookup may fail (or falsely succeed!) because of that. Aside from getting the wrong results, this opens up opportunity for security attacks.
Text Transformations
text → TEXT, τεξτ, …
Normalization
The ordering of accents in a normalization form may not be the typical type-in order.
Normalization is context independent; don't assume NFC(x + y) = NFC(x) + NFC(y)
People assume that NFC always composes, but some characters decompose in NFC.
Here are the current maximal expansion factors for each form (U4.1):
Trivia: In Unicode 4.0 there are exactly 3 characters that are different in all 4 normalization forms: ϓ, ϔ, ẛ
Case Conversion
As well as Lower and upper case, there is also title case: dz ↔ DZ ↔ Dz
Strings may expand: heiß → HEISS → heiss.
Here is a table of the maximum possible expansions (U4.1):
Casing is context-dependent: ΌΣΟΣ → όσος
Casing may be language-dependent: istanbul ↔ İSTANBUL.
(But don't use language-dependent casing for language-independent structures, like file-system B-Trees.) Here is a table that shows the upper and lowercasing behavior of Turkic vs Normal case mappings:
Up to Unicode 5.0 (not yet released), case folding was not stable. (Two versions of Unicode could have different results from toCaseFold(S), even though all the characters in S are in both versions.)
Don't use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category; these were constrained to be in a partition. Use the separate binary properties Lowercase and Uppercase instead.
There are two different types of lowercase:
Lowercase, the binary property. The character is lowercase in form, but not necessarily in function.
Functionally Lowercase. isCased(x) & isLowercase(x). See Section 3.13 of TUS.
Also three corresponding types of uppercase:
Uppercase and titlecase overlap:
Transliteration
Transliteration (Ελληνικά ↔ Ellēniká) is not the same as Translation (Ελληνικά ↔ Greek)
Transliteration may vary by language: Путин ↔ Putin, Poutine, ...
Горбачёв ↔ Gorbachev, Gorbacev, Gorbatchev, Gorbačëv, Gorbachov, Gorbatsov, Gorbatschow, ...
Transcription is a lossy transliteration: Ελληνικά → Ellinika → Ελλινικα
Rendering
text → text → i
Rendering is context-dependent, and dependent on the font: going character-by-character gives the wrong results.
Glyphs may change shape depending on their surroundings:
A single glyph may result from multiple characters:
Multiple glyphs may result from a single character:
The memory storage order (logical order) may not be the same as the visual order. Don't assume that contiguous text = contiguous display:
Good rendering systems will handle customary type-in order for text plus canonical order. Excellent ones will do any canonically-equivalent order, but those are rare.
There may be differences in the customary glyphs for different languages; specify the font or the language where they have to be distinguished.
Never render a missing glyph as "?": it can cause security problems.
Combining characters normally stack outwards: is + + , not + + . Don't simply overlay diacritics: it can cause security problems.
Linebreak is not just breaking at spaces
Globalization
in de_DE: 1.234,00£ ↔ <GBP, 0.10011010010×212>
Unicode ≠ Globalization. Unicode provides the basis for software globalization, but there's more work to be done...
Don't simply concatenate strings to make messages: the order of components differs by language. Use Java MessageFormat or equivalent.
Don't put any translatable strings into your code; make sure those are separated into a resource file.
Don't assume everyone can read the Latin alphabet. Don't assume icons and symbols mean the same around the world.
Store and transmit neutral-format data wherever possible. Convert that data to the user's preferred formats as "close" to the user as possible. Eg, use Windows Filetime for binary times.
► http://icu.sourceforge.net/userguide/universalTimeScale.html
Tag all data explicitly. Trying to algorithmically determine character encoding and language isn't easy, and can never be exact.
Don't confuse User-Interface language (menus, dialog, help-system,...) with Data language (body text, spreadsheet cells). Globalized programs need to handle, as data, more languages than they have localized UI for.
Formatting and parsing of dates, times, numbers, currencies, ... are locale-dependent; even calendar systems may vary. Use globalization APIs that use appropriate data.
Where OS facilities are not adequate or cross-platform solutions are needed, use ICU (International Components for Unicode) for C, C++, and Java. People have built wrappers for other languages, too.
Locale data typically uses "fallback": if the data isn't found in en_US, look in en. Beware of discrepancies between how this is done in different systems: Java ResourceBundle (J2SE), Java Standard Tag Library (JSTL), Java Server Face (JSF), Apache HTTP,...
English is a relatively compact language; others may require more characters (eg in database fields) and more screen real estate (in UIs). Allocate space flexibly.
Identification
Use RFC 3066 (or its successor) for language IDs, not ISO 639 alone.
Locale IDs are extensions of language IDs, they are not the same. Use CLDR for locale IDs.
If you heuristically compute territory IDs, timezone IDs, currency IDs, etc. (eg from browser settings) make sure the user can override that and pick an explicit value.
Always use an explicit currency ID (ISO 4217). Don't assume the currency is the same as the display locale: <RUR, 1.23457×10³> ↔ 1 234,57р. in Russian, but Rub 1,234.57 in English.
Don't assume that a country always has the same currency.
Don't assume the timezone ID is implied by the user's locale. For the best timezone information, use the TZ database; use CLDR for timezone names.
Computations with mixed timezones or missing timezones are tricky (XML Schema is missing real timezones: you'll have to roll your own datatype).
Be prepared for instabilities in currencies, territories, and timezones. Unfortunately, you also need to worry about instabilities in the IDs also: eg, ISO reused CS for two different territories.
Java
In MessageFormat, watch for words like can't, since ASCII ' has syntactic meaning. Use a real apostrophe (U+2019) where possible: can’t.
In Date and Calendar, the months are numbered from 0 (February is month number 1!). However, weeks and days are numbered from 1.
Java serialized text isn't UTF-8, though it's close. U+0000 and supplementary code points are encoded differently.
Java globalization support is pretty outdated: use ICU to supplement it.
Java ResourceBundle (J2SE), Java Standard Tag Library (JSTL), Java Server Face (JSF), Apache HTTP server, etc. all provide some locale determination mechanism and facility; but they all differ in details.
JavaScript
Always encode characters above U+007F with escapes (\uxxxx).
There is an HTML mechanism to specify the charset of the Javascript source, but it is not widely implemented. The JDK tool native2ascii can be used to convert the files to use escapes