banner

For a full list of BASHing data blog posts see the index page.  RSS


Apple + Microsoft = character confusion

A Mac-using client wanted to save a Microsoft Word .docx as a plain text document. The .docx was stored in an iCloud folder. Downloading and opening the file in Word for Mac, the client chose "Save without formatting (.txt)". What could go wrong?

Well, first of all, the .docx original had carriage return + linefeed (CRLF) line endings. The saved text file had only carriage returns. Remember CR-only line endings? From OS 9 and earlier?

Second, the .docx original was in UTF-8 encoding, according to the "properties" .xml files in the .docx archive and my own character encoding check. The saved text, on the other hand...

mess1

Examining the text file with xxd, I saw that non-ASCII characters were encoded in Mac OS Roman. For example, á (U+00E1) was represented by the single byte 87 (hex), which is the Mac OS Roman encoding, not the 2-byte c3 a1 of UTF-8 or the 1-byte e1 of Windows-1252.

OK, iconv to the rescue with

iconv -f macintosh -t utf-8 < saved.txt > saved_utf8.txt

All those non-ASCII letters were now correctly encoded in UTF-8. Problem solved? Not quite.

The client had used hyphens (U+2010, hex e2 80 90) instead of the more usual hyphen-minuses (U+002D, hex 2d) in the original .docx. Whatever routine had converted UTF-8 to Mac OS Roman hadn't recognised the Unicode hyphen characters, and had replaced each of them in the text file with "?". Being perfectly good ASCII characters, the "?" had passed through the iconv step unchanged. I searched for "?" in the UTF-8 text file to correct these where needed.

I suspect there might be settings deep within Word for Mac's menus that would have prevented this stuff-up and saved the .docx to text with the original UTF-8 encoding. As a data specialist I'm just bemused that "don't change encodings or line endings when saving as text" isn't the default.


Last update: 2022-02-09
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License