For a full list of BASHing data blog posts see the index page.
Dozens of gibberished words! New mojibake puzzles! That's what I found recently in a UTF-8 dataset from the Museum of Comparative Zoology (MCZ) at Harvard University. Below are my attempts at reconstructing a few of the mojibake histories.
UTF-8 > Windows-1252 > UTF-8
In this scenario, a two-byte UTF-8 character is first read as two one-byte characters in a Windows-1252 program. The separate Windows-1252 characters are later converted to their two-byte UTF-8 equivalents. The table below shows hexadecimal values.
|Herich-Schäffer*||c3 a4||c3, a4||c3 83, c2 a4||Herich-SchÃ¤ffer|
|Lefèbvre||c3 a8||c3, a8||c3 83, c2 a8||LefÃ¨bvre|
|Médanos||c3 a9||c3, a9||c3 83, c2 a9||MÃ©danos|
|Cañon||c3 b1||c3, b1||c3 83, c2 b1||CaÃ±on|
|Falcón||c3 b3||c3, b3||c3 83, c2 b3||FalcÃ³n|
|Oberthür||c3 bc||c3, bc||c3 83, c2 bc||OberthÃ¼r|
*Gottlieb August Wilhelm Herrich-Schäffer (1799-1874), German entomologist. The spelling here is the one in the MCZ dataset.
In a three-byte version of this sequence, "Aug. trip ‘83" (left single quote; hex e2 80 98 in UTF-8) was read by a Windows-1252 program as the three single characters "â" (e2), "€" (80) and "˜" (98). The three characters were then converted to UTF-8: "Aug. trip â€˜83".
UTF-8 > Mac OS Roman > UTF-8
A similar scenario, but this time the UTF-8 original was processed as one-byte characters on a Mac.
|Original||UTF-8||Mac OS Roman||UTF-8||Mojibake|
|Volcán||c3 a1||c3, a1||e2 88 9a, c2 b0||Volc√°n|
|Jordão||c3 a3||c3, a3||e2 88 9a, c2 a3||Jord√£o|
|Açu||c3 a7||c3, a7||e2 88 9a, c3 9f||A√ßu|
|Tapirapé||c3 a9||c3, a9||e2 88 9a, c2 a9||Tapirap√©|
|Felíx||c3 ad||c3, ad||e2 88 9a, e2 89 a0||Fel√≠x|
|Dueñas||c3 b1||c3, b1||e2 88 9a, c2 b1||Due√±as|
|Jerónimo||c3 b3||c3, b3||e2 88 9a, e2 89 a5||Jer√≥nimo|
|Vanhöffen||c3 b6||c3, b6||e2 88 9a, e2 88 82||Vanh√∂ffen|
|Izúcar||c3 ba||c3, ba||e2 88 9a, e2 88 ab||Iz√∫car|
Mac OS Roman > Windows-1252 > UTF-8
"S‹o Paulo" started out as "São Paulo" on a Mac, where the "Latin small a with tilde" has the hex encoding 8b. When read on a Windows machine, hex 8b became the Windows-1252 character "single left-pointing angle quotation mark". That character was then converted to UTF-8 as hex e2 80 b9 in the MCZ dataset.
UTF-8 > Windows-1252 > Mac OS Roman > Windows-1252 > UTF-8
"José" possibly became "JosÌ©" in 4 steps:
- The original was in UTF-8, where "é" is a two-byte character, hex c3 a9
- The string went to a Windows program where each byte was read separately, giving "Ã" (c3) and "©" (a9)
- Next to a Mac, where "Ã" is hex cc, not c3, but "©" is again hex a9
- Back to Windows, where cc a9 was read as "Ì" (cc in Windows-1252) and "©" (a9)
- From Windows to a UTF-8 environment, where "Ì" was converted to hex c3 8c and "©" to c2 a9
I can't figure out (yet) what happened in the following 4 cases. Some of the MCZ dataset strings were generated by OCR of specimen labels, so OCR error might be partly to blame.
- In "GroseÂSmith", there's an invisible soft hyphen after the "Â" and the UTF-8 encoding is hex c3 82, c2 ad. There might have been a soft hyphen after the ordinary one in the original "Grose-Smith".
- Gualeguaychú > Gualeguaych£
- Biológica > Biol¢gica
- Herrich-Schä[f?]fer > Herrich-Schè‡Ÿfer
P.S. Another interesting feature of the MCZ dataset is the variety of substitutes for degrees in latitude/longitude figures. I'm glad I didn't have to check whether all of these formats had been correctly converted to decimal degrees (in the decimalLatitude and decimalLongitude fields in the datatset):
10°18'N the true degree symbol, hex c2 b0
25º10'E the "masculine ordinal indicator", hex c2 ba
20˚34'N the "ring above" character, hex cb 9a
20*52'04"S an asterisk
31, 53.104 N a comma
20'22'48"S 148'35'45"E an apostrophe
11.40 S no symbol at all
Last update: 2020-12-16
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License