banner

For a list of BASHing data 2 blog posts see the index page.    RSS


detective

Mojibake detective: the case of the Greek claw

There's something a little strange about this mojibake:

wordplay

From wordplays.com

The Wikipedia source for the first section ("CHELA") does indeed have a 3-byte en-dash on either side of "also called a claw, nipper, or pincer". It's represented in the Wikipedia page code as  – , which is e2 80 93 in UTF-8. The wordplays.com page-builder has faithfully reproduced those 3 bytes as single-byte Windows-1252 characters, namely e2 = â, 80 = and 93 = . And yes, the wordplays.com page has "charset=utf-8" in its header.

What happened next is less obvious. Wikipedia has the Greek word χηλή in its page code (chi, eta, lambda, eta with tonos). How did those four characters become the eight characters φηλή?

Hidden in the Wikipedia page code around the Greek word is a reference to a Wiktionary page:

wiktionary

and those eight bytes CF 87 CE B7 CE BB CE AE represent φηλή in Windows-1252 encoding. But in UTF-8 those bytes taken two at a time spell Greek letters:

shell

Note also the Wikipedia trick of enclosing the Greek letters in a lang span, which I can do here, too: χηλή. (See this page's source code.) So did the wordplays.com page-builder extract those eight bytes from the Wiktionary URL? Or did the word itself get extracted as UTF-8, then get read as Windows-1252?

In the next ("ETYMOLOGY") section the etymonline.com page has the Greek word in italics and its literal spelling in the page code:

etym

What's happened is that the two bytes in the UTF-8 encoding of ē, namely c4 and 93, have again been interpreted byte-by-byte in Windows-1252 encoding to give Ä and . The etymonline.com page has <meta charSet="utf-8"/> in its header.

No resolution in this case, except for me to suggest that Microsoft software is incompatible with the modern Web and that Web developers should always pay close attention to encoding, but I guess that's whistling in the wind.


Previous posts about mojibake:


Next post:
2025-10-24   The difference between two dates: easy solutions and hard


Last update: 2025-10-17
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License