banner

For a full list of BASHing data blog posts see the index page.     RSS


Mojibake madness

This is the fifth blog post in a series about character encoding mishaps. (See posts 1, 2, 3 and 4.) My aim in each case is to understand how perfectly good characters get mangled into mojibake, otherwise known as gibberish. Today's victims come from two recent data audits and deserve a lot of sympathy. In both cases the datasets I checked were UTF-8 encoded.


Repeatedly run over. Dataset 1 had a number of strings where one character had become eight. Here are two of them:

Hejný   >   Hejný
ý = Latin small letter y with acute (U+00FD) became...

à = Latin capital letter A with tilde (U+00C3)
ƒ = Latin small letter f with hook (U+0192)
Æ = Latin capital letter ae (U+00C6)
’ = right single quotation mark (U+2019)
‚ = single low-9 quotation mark (U+201A)
 = Latin capital letter a with circumflex (U+00C2)
½ = vulgar fraction one half (U+00BD)
 

d’Italia   >   d’Italia
’ = right single quotation mark (U+2019) became...

à = Latin capital letter A with tilde (U+00C3)
¢ = cent sign (U+00A2)
â = Latin small letter a with circumflex (U+00E2)
‚ = single low-9 quotation mark (U+201A)
¬ = not sign (U+00AC)
â = Latin small letter a with circumflex (U+00E2)
„ = double low-9 quotation mark (U+201E)
¢ = cent sign (U+00A2)

I can derive the first mojibake by shuttling in and out of UTF-8 and Windows-1252 encodings. In what follows each byte is represented in hexadecimal:

ý starts as the 2-byte c3 bd in UTF-8.
On to Windows-1252, where each byte is interpreted separately, giving à (c3) and ½ (bd).
Back to UTF-8, where the two characters become à (c3 83) and ½ (c2 bd).
Back to Windows-1252: Ã (c3) ƒ (83) Â (c2) ½ (bd).
Back to UTF-8: Ã (c3 83) ƒ (c6 92) Â (c3 82) ½ (c2 bd).
Back to Windows-1252: Ã (c3) ƒ (83) Æ (c6) (92) Ã (c3) (82) Â (c2) ½ (bd).
Back to UTF-8 for the final encoding of the (now) eight characters.

The second mojibake is more interesting because two of the intermediate characters are 3-byte ones in UTF-8:

starts as the 3-byte e2 80 99 in UTF-8.
On to Windows-1252: â (e2) (80) (99)
Back to UTF-8: â (c3 a2) (e2 82 ac) (e2 84 a2)
Back to Windows-1252: Ã (c3) ¢ (a2) â (e2) (82) ¬ (ac) â (e2) (84) ¢ (a2)
And finish by encoding those eight characters back into UTF-8.

The second mishap could have been avoided if the right single quote in "d’Italia" had been a simple apostrophe. It's always a good idea to use the simplest version of a character in a plain text dataset, like plain quotes instead of curly quotes.


Combination punches. Dataset 2 suffered a back-and-forth between UTF-8 and the 1-byte Mac OS Roman encoding on an Apple machine. Here are some of the victims:

left single quotation mark (e2 80 98) became (e2) Ä (80) ò (98) = ‚Äò
right single quotation mark (e2 80 99) became (e2) Ä (80) ô (99) = ‚Äô
en dash (e2 80 93) became (e2) Ä (80) ì (93) = ‚Äì
em dash (e2 80 94) became (e2) Ä (80) î (94) = ‚Äî
The ë (c3 ab) in "Couëron" became (c3) ´ (ab) in "Cou√´ron"
The Ü (c3 9c) in "Überwachung" became (c3) ú (9c) in "√úberwachung"
The ř (c5 99) in "Přibylova" became (c5) ô (99) in "P≈ôibylova"

Worse yet, the original text in Dataset 2 had combining characters. For example, the German träger was formed from an a followed by a combining diaeresis ¨. The combining diaeresis had the original UTF-8 encoding cc 88, and going byte-by-byte into Mac OS Roman the word became traÃàger, because cc encodes à and 88 encodes à.

In the same dataset, the title of a Romanian journal,
 
   Buletinul Societății de Științe din București, România,
 
suffered serious injuries and was hospitalised as
 
   Buletinul SocietaÃÜtö ii de Sà ciintà e din Bucuresà ci, RomaÃÇnia

It looks at first like the non-ASCII letters were represented with the ASCII base character plus a combining character, and the combining character was then read byte-by-byte in Mac OS Roman:

Societății   >   SocietaÃÜtö ii
Combining breve (UTF-8 cc 86) after "a" gives à (cc) Ü (86)
Combining comma below (UTF-8 cc a6) after "t" gives à (cc) (a6)
   But why the space after ¶?
România   >   RomaÃÇnia
Combining circumflex (UTF-8 cc 82) after "a" gives à (cc) Ç (82)

But this doesn't work for the letters with combining comma below in Științe din București, and why is the "t" missing from the mojibake-d "București"? This one's another mojibake mystery, for now.


Last update: 2021-05-19
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License