For a full list of BASHing data blog posts see the index page.

Mojibake bonanza

Dozens of gibberished words! New mojibake puzzles! That's what I found recently in a UTF-8 dataset from the Museum of Comparative Zoology (MCZ) at Harvard University. Below are my attempts at reconstructing a few of the mojibake histories.

UTF-8 > Windows-1252 > UTF-8

In this scenario, a two-byte UTF-8 character is first read as two one-byte characters in a Windows-1252 program. The separate Windows-1252 characters are later converted to their two-byte UTF-8 equivalents. The table below shows hexadecimal values.

Original	UTF-8	Windows-1252	UTF-8	Mojibake
Herich-Schäffer*	c3 a4	c3, a4	c3 83, c2 a4	Herich-SchÃ¤ffer
Lefèbvre	c3 a8	c3, a8	c3 83, c2 a8	LefÃ¨bvre
Médanos	c3 a9	c3, a9	c3 83, c2 a9	MÃ©danos
Cañon	c3 b1	c3, b1	c3 83, c2 b1	CaÃ±on
Falcón	c3 b3	c3, b3	c3 83, c2 b3	FalcÃ³n
Oberthür	c3 bc	c3, bc	c3 83, c2 bc	OberthÃ¼r

*Gottlieb August Wilhelm Herrich-Schäffer (1799-1874), German entomologist. The spelling here is the one in the MCZ dataset.

In a three-byte version of this sequence, "Aug. trip ‘83" (left single quote; hex e2 80 98 in UTF-8) was read by a Windows-1252 program as the three single characters "â" (e2), "€" (80) and "˜" (98). The three characters were then converted to UTF-8: "Aug. trip â€˜83".

UTF-8 > Mac OS Roman > UTF-8

A similar scenario, but this time the UTF-8 original was processed as one-byte characters on a Mac.

Original	UTF-8	Mac OS Roman	UTF-8	Mojibake
Volcán	c3 a1	c3, a1	e2 88 9a, c2 b0	Volc√°n
Jordão	c3 a3	c3, a3	e2 88 9a, c2 a3	Jord√£o
Açu	c3 a7	c3, a7	e2 88 9a, c3 9f	A√ßu
Tapirapé	c3 a9	c3, a9	e2 88 9a, c2 a9	Tapirap√©
Felíx	c3 ad	c3, ad	e2 88 9a, e2 89 a0	Fel√≠x
Dueñas	c3 b1	c3, b1	e2 88 9a, c2 b1	Due√±as
Jerónimo	c3 b3	c3, b3	e2 88 9a, e2 89 a5	Jer√≥nimo
Vanhöffen	c3 b6	c3, b6	e2 88 9a, e2 88 82	Vanh√∂ffen
Izúcar	c3 ba	c3, ba	e2 88 9a, e2 88 ab	Iz√∫car

Mac OS Roman > Windows-1252 > UTF-8

"S‹o Paulo" started out as "São Paulo" on a Mac, where the "Latin small a with tilde" has the hex encoding 8b. When read on a Windows machine, hex 8b became the Windows-1252 character "single left-pointing angle quotation mark". That character was then converted to UTF-8 as hex e2 80 b9 in the MCZ dataset.

UTF-8 > Windows-1252 > Mac OS Roman > Windows-1252 > UTF-8

The original was in UTF-8, where "é" is a two-byte character, hex c3 a9

Bafflers

I can't figure out (yet) what happened in the following 4 cases. Some of the MCZ dataset strings were generated by OCR of specimen labels, so OCR error might be partly to blame.

In "GroseÂSmith", there's an invisible soft hyphen after the "Â" and the UTF-8 encoding is hex c3 82, c2 ad. There might have been a soft hyphen after the ordinary one in the original "Grose-Smith".
Gualeguaychú > Gualeguaych£
Biológica > Biol¢gica
Herrich-Schä[f?]fer > Herrich-Schè‡Ÿfer

P.S. Another interesting feature of the MCZ dataset is the variety of substitutes for degrees in latitude/longitude figures. I'm glad I didn't have to check whether all of these formats had been correctly converted to decimal degrees (in the decimalLatitude and decimalLongitude fields in the datatset):

10°18'N    the true degree symbol, hex c2 b0
25º10'E    the "masculine ordinal indicator", hex c2 ba
20˚34'N    the "ring above" character, hex cb 9a
20*52'04"S    an asterisk
31, 53.104 N    a comma
20'22'48"S 148'35'45"E    an apostrophe
11.40 S    no symbol at all

Last update: 2020-12-16
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License