Character Frequency by Unicode Version

This page: http://www.macchiato.com/unicode/character-frequency-by-unicode-version

The image below shows frequency of usage of Unicode characters on the web based on the Unicode version where each character first appeared. One must not read too much into this data; some factors to keep in mind are:
  • certain characters occur with much higher frequency in specific languages; if character is low in frequency it doesn't mean it isn't important for some modern orthography.
  • there is a certain amount of noise in the data. Below a certain threshold the counts are unreliable because they are more likely to be from garbaged files.
  • data is from a sample of the web (a question for the reader: how many pages are there on the web?)
  • the characters are not by usage (views); that is, an occurrence of 'é' on Le Monde.fr counts the same as one on Jeau Bleau's Facebook page, even though the former is viewed vastly more often.
  • it only include public web pages; in particular it does not include the vast amount of character data in emails and other sources.
For dates of Unicode versions, see http://www.unicode.org/history/publicationdates.html


Below are the numbers, including the ten most frequent characters.


On the second sheet are numbers by script. If there is no explicit script, the general category is given.

Character Frequency by Script/Category

Comments