banner

For a full list of BASHing data blog posts see the index page.     RSS


What is +ACY- doing in the data?

Almost all the data files I audit are in UTF-8, but the files have often started out in other encodings. This can lead to some hilarious mojibake and loads of fun for me as I try to reverse the encoding conversion failures.

Last week a file appeared with mojibake I'd never seen before. Here are the original characters followed by the character strings as they appeared in the audit file:

Á --> +AME-
á --> +AOE-
& --> +ACY-
é --> +AOk-
í --> +AO0-
ó --> +APM
ö --> +APY-
ü --> +APw-

After some googling I found the culprit, namely RFC 2152. This 1997 recommendation (not a standard) defines "UTF-7" and is now regarded as obsolete. UTF-7 was designed to be an email-safe way to handle Unicode characters by converting them into 7-bit US ASCII strings.

To show how UTF-7 works, I'll encode the Latin capital letter a with acute, "Á", Unicode U+00C1.

To decode a small number of UTF-7 strings, I recommend the excellent string-functions.com Web page with a UTF-7-vs-US-ASCII table.

To convert a block of text riddled with UTF-7 constructions, the best bet is iconv on the command line:

iconv

A UTF-7 code block is an example of a reversible replo: it's easily fixed by changing the encoding of the file.


Last update: 2021-07-14
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License