banner

For a full list of BASHing data blog posts see the index page.     RSS


Mojibake bonanza

Dozens of gibberished words! New mojibake puzzles! That's what I found recently in a UTF-8 dataset from the Museum of Comparative Zoology (MCZ) at Harvard University. Below are my attempts at reconstructing a few of the mojibake histories.


UTF-8 > Windows-1252 > UTF-8

In this scenario, a two-byte UTF-8 character is first read as two one-byte characters in a Windows-1252 program. The separate Windows-1252 characters are later converted to their two-byte UTF-8 equivalents. The table below shows hexadecimal values.

OriginalUTF-8Windows-1252UTF-8Mojibake
Herich-Schäffer* c3 a4 c3, a4 c3 83, c2 a4 Herich-Schäffer
Lefèbvre c3 a8 c3, a8 c3 83, c2 a8 Lefèbvre
Médanosc3 a9c3, a9c3 83, c2 a9Médanos
Cañonc3 b1c3, b1c3 83, c2 b1Cañon
Falcón c3 b3 c3, b3 c3 83, c2 b3 Falcón
Oberthür c3 bc c3, bc c3 83, c2 bc Oberthür

*Gottlieb August Wilhelm Herrich-Schäffer (1799-1874), German entomologist. The spelling here is the one in the MCZ dataset.

In a three-byte version of this sequence, "Aug. trip 83" (left single quote; hex e2 80 98 in UTF-8) was read by a Windows-1252 program as the three single characters "â" (e2), "€" (80) and "˜" (98). The three characters were then converted to UTF-8: "Aug. trip ‘83".


UTF-8 > Mac OS Roman > UTF-8

A similar scenario, but this time the UTF-8 original was processed as one-byte characters on a Mac.

OriginalUTF-8Mac OS RomanUTF-8Mojibake
Volcánc3 a1c3, a1e2 88 9a, c2 b0Volc√°n
Jordãoc3 a3c3, a3e2 88 9a, c2 a3Jord√£o
Açuc3 a7c3, a7e2 88 9a, c3 9fA√ßu
Tapirapéc3 a9c3, a9e2 88 9a, c2 a9Tapirap√©
Felíxc3 adc3, ade2 88 9a, e2 89 a0Fel√≠x
Dueñasc3 b1c3, b1e2 88 9a, c2 b1Due√±as
Jerónimoc3 b3c3, b3e2 88 9a, e2 89 a5Jer√≥nimo
Vanhöffenc3 b6c3, b6e2 88 9a, e2 88 82Vanh√∂ffen
Izúcarc3 bac3, bae2 88 9a, e2 88 abIz√∫car

Mac OS Roman > Windows-1252 > UTF-8

"So Paulo" started out as "São Paulo" on a Mac, where the "Latin small a with tilde" has the hex encoding 8b. When read on a Windows machine, hex 8b became the Windows-1252 character "single left-pointing angle quotation mark". That character was then converted to UTF-8 as hex e2 80 b9 in the MCZ dataset.


UTF-8 > Windows-1252 > Mac OS Roman > Windows-1252 > UTF-8

"José" possibly became "JosÌ©" in 4 steps:


Bafflers

I can't figure out (yet) what happened in the following 4 cases. Some of the MCZ dataset strings were generated by OCR of specimen labels, so OCR error might be partly to blame.


P.S. Another interesting feature of the MCZ dataset is the variety of substitutes for degrees in latitude/longitude figures. I'm glad I didn't have to check whether all of these formats had been correctly converted to decimal degrees (in the decimalLatitude and decimalLongitude fields in the datatset):
 
10°18'N    the true degree symbol, hex c2 b0
25º10'E    the "masculine ordinal indicator", hex c2 ba
20˚34'N    the "ring above" character, hex cb 9a
20*52'04"S    an asterisk
31, 53.104 N    a comma
20'22'48"S 148'35'45"E    an apostrophe
11.40 S    no symbol at all


Last update: 2020-12-16
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License