NFC FAQ

What is NFC?

For various reasons, Unicode sometimes has multiple representations of the same character. For example, each of the following sequences (the first two being single-character sequences) represent the same character:

    1. U+00C5 ( Å ) LATIN CAPITAL LETTER A WITH RING ABOVE

    2. U+212B ( Å ) ANGSTROM SIGN

    3. U+0041 ( A ) LATIN CAPITAL LETTER A + U+030A ( ̊ ) COMBINING RING ABOVE

These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for compostion. For more information on these, see the introduction of UAX #15: Unicode Normalization Forms. A function transforming a string S into the NFC form can be abbreviated as toNFC(S), while one that tests whether S is in NFC is abbreviated as isNFC(S).

Can a string in NFC have any combining marks (accents)?

Yes. In general, normalizing to NFC creates precomposed characters where possible. For example, the sequence <A + grave> is transformed into À:

Source Sequence

U+0041 ( A ) LATIN CAPITAL LETTER A

U+0300 ( ̀ ) COMBINING GRAVE ACCENT

NFC

U+00C0 ( À ) LATIN CAPITAL LETTER A WITH GRAVE

However:

    • There may be no suitable "base character" for an accent to attach to (consider an isolated grave combining mark, or one after a Q).

    • In some circumstances, combining marks may actually separate from a base character

What are the cases where combining marks separate?

Here is a list. In particular, composition is frozen to a specific version of Unicode, to maintain backwards compatibility, so new characters cannot compose except in special circumstances.

In addition, there are cases where the addition of a combining character causes a precomposed character to break apart, such as the following:

Source Sequence

U+1EA5 ( ấ ) LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE

U+0328 ( ̨ ) COMBINING OGONEK

NFC

U+0105 ( ą ) LATIN SMALL LETTER A WITH OGONEK

U+0302 ( ̂ ) COMBINING CIRCUMFLEX ACCENT

U+0301 ( ́ ) COMBINING ACUTE ACCENT

What about stability?

Normalizing to NFC is subject to stringent stability requirements to maintain backwards compatibility. For more information, see Normalization Stability Policies.

Does normalizing to NFC have any effect if it can't compose characters?

Yes. Normalizing to NFC also puts certain sequences of combining marks into a well-defined order. For example, the combining marks are reordered below.

Source Sequence

U+0051 ( Q ) LATIN CAPITAL LETTER Q

U+0300 ( ̀ ) COMBINING GRAVE ACCENT

U+0323 ( ̣ ) COMBINING DOT BELOW

NFC

U+0051 ( Q ) LATIN CAPITAL LETTER Q

U+0323 ( ̣ ) COMBINING DOT BELOW

U+0300 ( ̀ ) COMBINING GRAVE ACCENT

Is normalizing to NFC lossy?

No. Unicode considers certain characters to be fundamentally (canonically) equivalent -- the fact that there are multiple representations is an artifact of encoding. Normalizing to NFC maintains that canonical equivalence. Even in the case of CJK compatibility characters, they are also variants of the corresponding 'ordinary' character in that either character could appear in either form. As a matter of fact, the glyphic shape of the sources (eg JIS) has changed over time. The Unicode Consortium does recognize that particular glyphic shapes are sometimes important, and has developed a much more comprehensive mechanism to deal with it. See UTS #37: Ideographic Variation Database.

For a given user, the appearance of a character may change after NFC, due to deficient font/rendering support. For example, the user may have a font that contains U+00C5 ( Å ) LATIN CAPITAL LETTER A WITH RING ABOVE, but not have a font for U+212B ( Å ) ANGSTROM SIGN. If an identifier containing the Angstrom is normalized to NFC, what was formerly a box or missing glyph will now be properly visible. A font is called NFC-Complete if whenever it has appropriate glyphs for the characters in a string S, then it also has appropriate glyphs for the characters in toNFC(S). Ideally every font should be basically NFC-Complete, because it is a small amount of work to add the extra CMAP entries and some ligatures for the few cases where characters decompose.

Note: NFC is sometimes confused with NFKC, which is considered lossy, and is more appropriate for loose matching.

How much text is already NFC?

This really depends on the kind of text involved. For example, according to data at Google:

    • ~99.98% of web HTML page content characters are definitely NFC.

      • Content means after discarding markup, and doing entity resolution.

    • ~99.9999% of web HTML page markup characters are definitely NFC.

      • Because so much of markup is plain ASCII.

Of course, this represents data from just a sample of the web (in this case, the total characters in 700,000,000 documents as of January 23, 2009). Proportions may differ somewhat depending on which sample is chosen, and may change over time as the percentage of languages on the web changes. Normalization is important for some languages like Vietnamese and other languages of South and Southeast Asia even though their percentage of total text is small.

Why does the above say "definitely NFC"?

It is because characters can be divided into 3 classes:

    • those that are always valid in NFC,

    • those that are never valid in NFC, and

    • those that may or may not be valid in NFC, depending on the context.

This information is available in the Unicode Consortium NFC_Quick_Check (NFC_QC) property available in DerivedNormalizationProps.txt. For the figures gathered above, only the characters always valid in NFC are counted. The actual number of valid NFC characters will be even higher, because some percentage of the "maybe" cases will also be in a context that makes them valid NFC.

What's an example of a character that may or may not be valid in NFC?

The character U+030A ( ̊ ) COMBINING RING ABOVE.

    • The string <U+0031 ( 1 ) DIGIT ONE, U+030A ( ̊ ) COMBINING RING ABOVE> is valid NFC, because the ring above doesn't combine with the 1 character.

    • The string <U+0041 ( A ) LATIN CAPITAL LETTER A + U+030A ( ̊ ) COMBINING RING ABOVE> is not valid NFC, because the ring above does combine with the A character.

For a full list of the "maybe" characters, see Unicode Utilities. Note that only about half of these ever combine with previous characters.

Isn't normalizing to NFC costly in terms of performance?

Not really. Here's a sample comparison of numbers. Figures for Lowercase (LC) are added as a "control" - for comparison to a simple function that also needs to look at each character.

* = ICU, UTF-16

† = Google, UTF-8

‡ = Java (JDK) UTF-16

These were measured with 100,000,000 iterations, after warmups, on 2.4GHz Intel CPUs (but different versions), in nanoseconds rounded to two decimals of accuracy. Of course times will vary with different programming languages, CPUs, platforms, and environments. The cost also goes down (significantly) with longer strings, because there is some initialization cost for NFC routines that gets amortized over longer text.

The key fact (as we saw above) is that the vast majority of characters are already in NFC. So while it is not uncommon for a lowercasing operation to do the extra work of converting an uppercase character, it is quite rare that a toNFC function actually has to do any work beyond scanning the characters to check for characters that might require normalization. So over typical samples of text, on average normalization may actually be faster than lowercasing.

Parsing tokens is very performance-sensitive; won't normalizing be too costly?

Normalizing tokens (such as identifiers in a markup language or programming language) doesn't have to cause any significant performance hit. When identifiers are being parsed, a few key facts can be taken into account.

    • Most identifiers are ASCII-only

    • Any string of all characters < U+0300 is unaffected by normalizing to NFC.

Building on these facts, tokenization code can use the following strategy.

    1. Loop over characters < U+0300

      1. This is the fast-path case.

      2. In a very large percentage of cases, this suffices for identifiers, and the loop is exited.

    2. If a character ≥ U+0300 is found, drop into a slower loop that accesses a table for information on characters, typically a Trie.

      1. In that precomputed table, store a bit for whether the character might require NFC normalization. The Unicode Consortium makes this data available with a property called NFC_Quick_Check.

      2. If the bit is on, set a flag and continue.

      3. If we find another character < U+0300, then we can go back to the first loop.

    1. When the routine exits, that flag can be tested to see whether normalization needs to be called.

Because of the character frequencies (see the above topic), the flag will almost never be set. Because of that, the cost will be minimal - small fractions of a percent. Using the above figures, it would add perhaps 20 ns. Optimized state-table implementations can also use the same kinds of approach.

But if many identifiers are unnormalized, won't performance still be an issue?

With the right strategy, normalization need only happen once per identifier. In typical parsers, when an identifier is found it is entered into a symbol table, associating some information. This process can be modified slightly as in the following pseudo-code, so as to add multiple entries where necessary. By doing this, each variant form of the identifier only needs to be normalized once; subsequently it will already be in the symbol table. That lowers the amortized cost to far less than 20 ns per occurrence.

    1. t = parser.getToken();

    2. info = symbolTable.get(t);

    3. if (info != null) continue; // skip the rest if we already have the token in the symbol table

    4. info = new Info(...);

    5. if (parser.mayNeedNfc()) {

    6. nfc = Normalizer.normalize(t, Normalize.NFC);

    7. if (!t.equals(nfc)) {

    8. symbolTable.put(nfc, info);

    9. }

    10. }

    11. symbolTable.put(t, info);

What would be the worst possible case for normalizing to NFC?

The worst performance for most implementations would be (in the 1M character example I've been using) something like: a base character followed by a list of 999,999 characters with CCC != 0, skewed towards non-BMP characters, sorted by CCC in reverse order. CCC is the Unicode Combining Class Property. For good measure, you could throw in one of the cases where NFC is not the shortest form, as above. The reason for this is that most implementations just sort the combining marks using a bubble sort. That has very good behavior for short sequences, and rather bad behavior for long ones. If an implementation really wanted to protect against the worst case, it would take the minor cost of a test for long sequences, and use a different sorting algorithm for that case.

How should I implement NFC normalization?

Well, it is really best not to roll your own. The best course you can take is to use an already tested and optimized library, such as ICU, which supplies C, C++, and Java implementations.

But if you have to do your own, then it is fairly straightforward to do a simple implementation, looking at the reference code in the specification UAX#15: Unicode Normalization Forms. However, be sure to test your implementation thoroughly, because there are some funny edge cases. The Unicode Consortium supplies a test file, but I recommend also testing against a known implementation like ICU's. For full confidence in your result, use a monkey test, where you throw random strings at both implementations and compare the results.

What if I want to optimize?

The simplest optimization - which is a huge win - is to use the NFC_Quick_Check property. Scan through the string, and only if you hit a character that has the values NO or MAYBE do you have to do anything. If you never hit one of those cases (the most frequent situation), then you exit with no changes. If you do hit one, then you need to backup a bit to find the last starter character, and normalize the text up to the next "safe" character (one that doesn't interact with anything before it). For more information, see Detecting Normalization Forms.

It is especially important with an optimized implementation to have a complete suite of tests, as discussed above.

NFC normalization requires large tables, right?

Like many other cases, there is a tradeoff between size and performance. You can use very small tables, at some cost in performance. (Even there, the actual performance cost depends on how often normalization needs to be invoked, as discussed above.)

To see an analysis of the situation, see Normalization Footprint. It is a bit out of date, but gives a sense of the magnitude. For comparison, ICU's optimized tables for NFC take 44 kB (UTF-16) and Google's optimized tables for NFC take 46 kB (UTF-8).

Anything else I should be aware of when using normalized strings?

A couple of things:

    1. While NFC is preserved under substringing, it is not preserved under concatenation. That is A and B might each be normalized, but AB not be.

    2. Normalizing to NFC can grow or shrink the length of a string.

Where can I get more information?