Character encoding - 2

Uninterpretable characters

Sometimes a UTF-8 file contains characters that are uninterpretable by any program. These will each appear on a webpage or in a terminal or text editor as the Unicode replacement character (hex value ef bf bd), usually displayed as a polygon with a contrasting question mark, like this: �

A replacement character needs to be replaced with whatever character was originally there, and not deleted. Discovering the original character may be difficult, but finding a replacement character by its hex value is easy with sed or AWK:

$ cat file
There's one in the next line
� is a replacement character
This line doesn't have one
$ sed -n '/\xef\xbf\xbd/p' file
� is a replacement character
$ awk '/\xef\xbf\xbd/ {print NR": "$0}' file
2: � is a replacement character

In data auditing I include a search for replacement characters in my gremlin characters searches.

Replacement characters can also appear when non-UTF-8 files are viewed in a UTF-8 locale. In the example shown below, copy.csv is ISO-8859-1 encoded. If I grep for non-characters (anything that isn't a character) in my UTF-8 locale, 9850 of them are found in the ISO-8859-1 file, and as shown these appear in my terminal as replacement characters. When the file is converted to UTF-8 with iconv, the non-characters disappear and get replaced with the Unicode equivalents of the ISO-8859-1 characters.


Mysterious question marks

Data items that have a complicated encoding history (like UTF-8 > windows-1252 > UTF-8) can accumulate question marks in place of original characters. The question marks were inserted by programs en route that couldn't interpret the encoding of the characters concerned. Since a question mark is a valid character in most encodings, once inserted it's always a question mark. Unlike some replacement characters (see above), it can't be converted back to an original character by changing the encoding. The original character is lost and can only be re-inserted after considering context, external sources or the advice of the data manager or custodian.

I use regex to find all the question marks in a data table, together with their immediate context. As shown below, grep extracts matches with zero or more 'not space or tab' followed by a literal question mark followed by zero or more 'not space or tab'. The results shown below are from a real-world data table.

$ grep -o "[^[:blank:]]*?[^[:blank:]]*" table | sort | uniq -c
2 Abendroth?s
2 adresse?e
2 anÌ?os
4 ant?s
8 author?s
8 Beru?cksichtigung
2 brou?ích
19 d?Entomologie
2 ?echoslovenia
2 l?Algérie
2 O?Keefe
1 P?ísp?vek

Once found, the question marks can be replaced globally with sed, as in this 'ganged' command (see also the bulk replace pages):

$ sed 's/adresse?e/adressée/g;s/Beru?cksichtigung/Berücksichtigung/g' table

For routine use I've saved that grep command as a function, qwords. To avoid capturing fields full of URLs with "?" in them, the function excludes strings with "http":

qwords() { grep -o "[^[:blank:]]*?[^[:blank:]]*" "$1" \
| grep -v "http" | sort | uniq -c; }

Another time-saver is the following script, called qfinder, which finds each of the fields with non-URL question marks in a table and saves their contents to a file called qlist-[filename], together with their line and field numbers. The script then works on qlist-[filename] to get the fields containing the question marks, and offers to scan the uniquified results with the less command.

awk -F"\t" '{for (i=1;i<=NF;i++) \
{if ($i ~ /\?/ && $i !~ /http/) \
{print NR FS i FS $i}}}' "$1" | sort -t $'\t' -nk2 -nk1> qlist-"$1"
declare -a label=($(head -n1 "$1" | tr '\t' '\n'))
qflds=$(cut -f2 qlist-"$1" | sort -n | uniq)
echo -e "Table \"$1\" has \"?\" words in the following field(s):" \
for k in $(echo "$qflds"); do echo -e " field $k ($(echo ${label[$k-1]})) \
on $(awk -F"\t" -v fld="$k" '$2==fld' qlist-$1 | wc -l) lines"; done
read -p "Show uniquified results with less? (y/n)" foo
case $foo in
n) exit 0 ;;
y) less -fX <(cut -f2- qlist-"$1" | sort -n | uniq);;


Detecting encoding fails

Replacement characters and mysterious question marks are signs of failed conversions, but there are others. Unfortunately, these others are much harder to find. I use a script I call graph to tally all the visible characters in the table and print their hex values. The script also finds invisible soft hyphens and non-breaking spaces, which are discussed on the gremlin characters page.

pv -w 50 -pbt "$1" \       #Uses "pv" to generate progress bar
| awk 'BEGIN {FS=""} {for (i=1;i<=NF;i++) if ($i ~ /[[:graph:]]/) \
{arr[$i]++}} END {for (j in arr) printf("%s\t%s\n",arr[j],j)}' \
| sort -t $'\t' -k2 \
| while read -r line; do printf "%s\t%s\t" "$line"; cut -f2 <<<"$line" \
| hexdump -e '/1 "%02x" " "' | sed 's/ 0a //'; echo; done | column

In the very large table sam1, the graph function found a few characters not expected in the table, like a copyright symbol (screenshot shows first portion of result):


Grepping for the copyright symbol turns up the odd combination ©":


This combination is almost certainly an encoding fail. What was the original character or characters? In this case as in others, it's probably best to ask the data compiler.