A blog by Darren Burns

Darren

Burns

Hey 👋 I'm Darren.
I'm a software engineer based in Edinburgh, Scotland

Posts

Unicode & emoji 🚀

January 20, 2021
emoji | unicode | python

Unicode Codepoints

Unicode codepoints are used to represent characters. A codepoint is just a number. Every displayable character is represented by a sequence of one or more codepoints.

However, not every codepoint corresponds to a character. Some codepoints are non-printable, and instead function as "modifiers" (for example, joining characters together or switching to "right-to-left" text mode).

The characters we use in English are generally represented using a single codepoint. For example, the codepoint which for "a" is the 97 (base 10).

Typically, we convert the represent the codepoint as hexadecimal (base 16) instead of base 10. So, "a" is represented by the codepoint 0x61. Sometimes you'll see them written like U+0061.

When writing code in Python or JavaScript we can write the codepoint like "\u0061". "\u0061" represents a single character, and can be used in a string and will behave just like the character "a".

>>> "a" == "\u0061"
True

>>> "bab" == "b\u0061b"
True

In Python, for codepoints greater than "0xFFFF we need to use a capital U and pad the width of the literal to 8 characters. For example, we'd write the codepoint 0x10a00 as "\U00010a00".

Converting between codepoints and characters

In Python, we can convert a character to a codepoint using the built-in ord function.

>>> ord("a")
97

Conversely, given a codepoint, we can find out which character it represents using the built-in chr function.

>>> chr(97)
"a"
>>> chr(0x61)
'a'

That is, chr and ord are the inverse of each other.

However, this approach isn't recommended. For one, it implies that characters correspond to a single codepoint. In reality, characters are often represented by multiple codepoints. This is particularly true for emoji and for East Asian languages such as Chinese, Japanese, and Korean (in i18n terms these languages are often referred to as "CJK").

Look what happens when we ask for the codepoint corresponding to "é":

>>> ord("é")
Traceback (most recent call last)
TypeError: ord() expected a character, but string of length 2 found

As hinted at in the exception message, the character "é" actually consists of two codepoints, so ord raises a TypeError. Let's look at what's happening:

>>> len("é")  # Looks like 'é' is indeed 2 codepoints
2

>>> for codepoint in "é":  # Lets look at the 2 codepoints it consists of
...     print(codepoint)

e
 ́
>>> for codepoint in "é":  # Look up the two codepoints
...     print(ord(codepoint))

101
769

The len function returns the number of codepoints the string contains (not the number of bytes or the number of glpyhs that appear on screen when printed).

If we iterate through a string in Python, we're actually iterating over the codepoints that make up the string.

From the example above, we can see that the character "é" consists of two codepoints: "e" (101), and " ́" (769).

Canonical equivalence

Visually identical characters can even be represented as different sequences of codepoints.

For example, consider the character "ü". This character can be represented in two different ways:

>>> print("\u00FC")  # As a single codepoint
ü

>>> print("\u0075\u0308")  # Multiple codepoints
ü

>>> "ü" == "ü"  # Visually identical, but the codepoints differ
False

>>> "\u00FC" == "\u0075\u0308"  # Exact same check as above
False

>>> len("ü")  # Single codepoint version
1

>>> len("ü")  # Two codepoint version
2

This probably isn't the behaviour we'd expect. In reality, we'd want to treat "ü" and "ü" as being the same character.

We can get around this using normalisation. The Python standard library comes to the rescue here with the unicodedata module.

By normalising two strings into a canonical representation, we can check their equivalence as expected.

>>> normalised = unicodedata.normalize("NFC", "\u0075\u0308")
>>> other_normalised = unicodedata.normalize("NFC", "\u00FC")
>>> normalised == other_normalised
True

You'll want to normalise Unicode strings at the boundary of your system (as early as possible!) to ensure that you're always dealing with the canonical representation.

Emoji

Around half of all emoji correspond to a single Unicode codepoint. For example, 🙂 is represented by the codepoint "\U0001F642".

>>> print("\U0001F642")
🙂

The rest are represented by sequences of codepoints.

Combining emoji (👩 + 🎨 = 👩‍🎨)

Zero-Width Joiners

What happens when you "combine" the emoji for "woman" ("👩" == "\U0001F469") with the emoji for "artist palette" ("🎨" == "\U0001F3A8")?

You get a "woman artist" 👩‍🎨 , of course!

👩 + 🎨 = 👩‍🎨!

But how do we combine emoji codepoint sequences in this way?

Here's a hint. When we do len("👩‍🎨") the result is 3. We already know that "woman" and "artist palette" are represented by one codepoint each, so there must be another codepoint in there.

A zero width joiner (or zwidge/ZWJ) is a Unicode codepoint (0x200d) used to combine the definitions of codepoints that appear at each side of it. It has no visual representation and takes up no space (although implementations may have it take a small amount of space):

>>> print("x\u200dx")
x‍x

The zwidge is used to combine two emoji codepoints into a single glyph. We take the codepoints at each side, and combine them into one. So, to combine the "woman" and "artist palette" emoji we just need to place a zwidge \u200D between them:

>>> woman = "👩"
>>> zwidge = "\u200d"
>>> artist_palette = "🎨"
>>> print(woman + zwidge + artist_palette)
👩‍🎨

This idea can be extended to even more complex emoji. The "family" emoji make for good examples:

Family: Woman, Girl: 👩 + ZWJ + 👧 = 👩‍👧
Family: Man, Girl, Boy: 👨 + ZWJ + 👧 + ZWJ + 👦 = 👨‍👧‍👦

By placing a ZWJ between each of the constituent members of a family, we produce a single emoji representing the combined family.

To write out the distinct codepoints that form the "Family: Man, Girl, Boy" emoji, we would do "\U0001F468\u200D\U0001F467\u200D\U0001F466".

This, of course, can be printed as you would expect:

>>> print("\U0001F468\u200D\U0001F467\u200D\U0001F466")
👨‍👧‍👦

There are over 1000 different combinations of emojis you can construct using zero-width joiners as of Unicode 14.0. You can see them all here.

Note that zero-width joiners are only required where the second codepoint is not a combining character or modifier. Recall that in our \u0075\u0308 (ü) example above we didn't need a ZWJ. That's because the accent (\u0308) is a combining character by default. It's not intended to exist in isolation.

Modifying skin tone

Another example of this is emoji modifier codepoints which modify skin colour:

Person: 🧑 U+1F9D1
Medium-dark skin tone: U+1F3FE

Put them together (no ZWJ needed!) and you get... 🧑🏾 a person with medium-dark skin tone!

Skin tone modifiers work as part of more complex emojis too. As long as skin is being shown in the emoji, the modifier will (generally) work.

For example, take the woman technologist emoji 👩‍💻, consisting of the codepoints U+1F469, U+200D, U+1F4BB.

The first codepoint represents a woman emoji. If we place a skin tone modifier after it, we can adjust the skin tone of the woman behind the computer.

skin_modifiers = ["", "\U0001F3FB", "\U0001F3FD", "\U0001F3FF"]
for mod in skin_modifiers:
    emoji = f"\U0001F469{mod}\u200d\U0001F4BB"
    print(emoji)

👩‍💻
👩🏻‍💻
👩🏽‍💻
👩🏿‍💻

Country flags

Country flags are, in general, handled by two special codepoints.

Each codepoint represents a letter in that country's code. These codepoints are called "Regional Indicator Symbols" and are different from their ASCII counterparts. Put these symbols together and you have a "Regional Indicator Pair"!

Let's take the flag of Japan 🇯🇵 for example. The country code for Japan is "JP".

To create the flag for Japan from codepoints, we'd take the use the Regional Indicator symbol for "J" (U+1F1EF), and the regional indicator symbol for "P" (U+1F1F5).

>>> j = "\U0001F1EF"
>>> p = "\U0001F1F5"
>>> print(j)
🇯
>>> print(p)
🇵
>>> print(j + p)
🇯🇵

Although the vast majority of flags work like this, there are some exceptions (particularly around flags that were added to Unicode in recent years).

For example, the flag of Scotland 🏴󠁧󠁢󠁳󠁣󠁴󠁿 consists of a sequence of 7 (SEVEN) codepoints called an "Emoji Tag Sequence."

References & useful links

Related: See this blog post external link from Spotify in 2013 which describes an account hijacking attack which exploited a in Unicode normalisation (tl;dr: they support Unicode usernames, and two usernames could be normalised to the same string).