banner

For a list of BASHing data 2 blog posts see the index page.    RSS


Mojibake with 2 hearts and 52 bytes

Most of the mojibake I puzzle over starts with alphabetic characters in UTF-8 encoding.

For example, the starting point might be the German word schön. In UTF-8, the ö is a 2-byte character, while the other characters are 1-byte:

hexcha1

The hexcha function returns a list of characters and their byte values in hexadecimal:
 
hexcha() {
while read -rn1 char; do echo -ne "$char\t" && hexdump -e '/1 "%02x" " "' <<< "$char" | sed 's/ 0a //'; echo; done <<<"$1" | sed '$d'
}

That's all well and good if a program can understand UTF-8 encoding, but if it's a simple-minded Windows program it might interpret the UTF-8 characters one byte at a time, in Windows-1252 encoding. The result is schön, where à is the 1-byte character with hex value c3, and is b6.

If the Windows interpretation is sent to a program that converts the string back to UTF-8, we get two 2-byte characters:

hexcha2

Pass that new UTF-8 string to a Windows program that converts it to single-byte encoding again and you'll have this splendid bit of nonsense: schön. Quite long mojibake can be inadvertently generated by ping-ponging back and forth between encodings, for example Hejný from Hejný (see this BASHing data post) - 1 character expanding to 8 characters!

I recently audited a UTF-8 dataset of biological records with a mojibake string that baffled me:
Hermosa orquídea âÂ<U+009D>¤ï¸Â<U+008F>🧡

What happened?! I couldn't work out what the starting alphabetic characters might be. Fortunately I could go to the original record, which was on an iNaturalist observation page:

iNat

OK, the start was a red heart emoji and an orange heart emoji. There is a saying among my people, "Strange are the ways of emojis", and these two characters are strange indeed.

The red heart is the Unicode "heavy black heart", ❤, U+2764, e2 9d a4, modified by an invisible following character, "variation selector-16", U+FEOF, ef b8 8f. An emoji-wise program (like your browser, I hope) will interpret these two characters together as ❤️.

The orange heart 🧡 is a 4-byte character, U+1F9E1, f0 9f a7 a1.

(1) The starting point was 10 bytes for the two emojis in UTF-8 encoding:
e2 9d a4 ef b8 8f f0 9f a7 a1.

(2) Neither 9d nor 8f is defined in Windows-1252, but the other bytes are, so interpreting the 10 bytes in Windows-1252 we get these 10 characters:
â<U+009D>¤ï¸<U+008F>🧡

(3) In Unicode, U+009D is the control character "operating system command" and U+008F is "single-shift 3". In UTF-8 encoding, those two characters are c2 9d and c2 8f. Converting the 10 characters in the Windows-1252 string to UTF-8 gives these 20 bytes:
c3 a2 c2 9d c2 a4 c3 af c2 b8 c2 8f c3 b0 c5 b8 c2 a7 c2 a1

(4) Now back to Windows-1252, byte by byte:
âÂ<U+009D>¤ï¸Â<U+008F>🧡

(5) And the final conversion to UTF-8 returns the Windows-1252 string as 52 bytes:
c3 83 c2 a2 c3 82 3c 55 2b 30 30 39 44 3e c3 82 c2 a4 c3 83 c2 af c3 82 c2 b8 c3 82 3c 55 2b 30 30 38 46 3e c3 83 c2 b0 c3 85 c2 b8 c3 82 c2 a7 c3 82 c2 a1

So 10 bytes blew out to 52. It would be nice to know exactly what programs were used in this chain and what the step-by-step results were, but it certainly looks like this particular bit of mojibake arose from a game of UTF-8/Windows-1252 ping-pong.


Last update: 2024-02-09
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License