logo

On this page:
    Detecting gremlins
    Visualising gremlins
    Removing gremlins

On the Characters 1 page:
    Is the table in UTF-8?
    Converting to UTF-8
    Tally visible characters

On the Characters 2 page:
    Reversible replos
    Reconstructable replos
    Researchable replos

On the Characters 4 page:
    Combining characters
    Character classes


TSV This marker means that the recipe only works with tab-separated data tables.


Detecting gremlins

By "gremlins" I mean invisible characters other than whitespace, horizontal tab and linefeed. The following script (I call it "gremlins") is my gremlin detector and takes table name as its argument. The first part reports on Windows carriage returns, soft hyphens and no-break spaces. The second part of the script looks for other control characters, and relies on a list of these characters and their "core" hexadecimal values (main byte) in a file called "chars" in my /home/scripts folder. You can download "chars" here. Note that in UTF-8 encoding, the gremlins with "core" hex values from 80 up are two-byte characters with hex c2 as the first byte.

Script to detect and tally gremlin characters

(An earlier version of this script also looked for replacement characters, which aren't invisible. I now use the "rcwords" function for finding �)

#!/bin/bash
red="\033[1;31m"
blue="\033[1;34m"
reset="\033[0m"
printf "\nFirst check for gremlins, please wait...\n\n"
wincr=$(grep -cP "\r" "$1")
if [ "$wincr" -eq "0" ]; then
    wc=none
else
    wc=$(awk -F"\r" 'NF>1 {a+=(NF-1); b++} END {print a" in "b" records"}' "$1")
fi
shy=$(grep -c $'\xc2\xad' "$1")
if [ "$shy" -eq "0" ]; then
    sh=none
else
    sh=$(awk -F"\xc2\xad" 'NF>1 {c+=(NF-1); d++} END {print c" in "d" records"}' "$1")
fi
nbsp=$(grep -c $'\xc2\xa0' "$1")
if [ "$nbsp" -eq "0" ]; then
    nb=none
else
    nb=$(awk -F"\xc2\xa0" 'NF>1 {c+=(NF-1); d++} END {print c" in "d" records"}' "$1")
fi
printf "$red$1$reset has:\n\nWindows carriage returns (\\\r, hex 0d): $blue$wc$reset\nSoft hyphens (hex ad): $blue$sh$reset\nNo-break spaces (hex a0): $blue$nb$reset\n"
printf "_ _ _ _ _ _ _ _ _ _ _ \n"
printf "\nChecking now for gremlin control characters, please wait...\n"
awk 'BEGIN {FS=""; for (n=0;n<256;n++) ord[sprintf("%c",n)]=n; list="\x00|\x01|\x02|\x03|\x04|\x05|\x06|\x07|\x08|\x0b|\x0c|\x0e|\x0f|\x10|\x11|\x12|\x13|\x14|\x15|\x16|\x17|\x18|\x19|\x1a|\x1b|\x1c|\x1d|\x1e|\x1f|\x7f|\xc2\x80|\xc2\x81|\xc2\x82|\xc2\x83|\xc2\x84|\xc2\x85|\xc2\x86|\xc2\x87|\xc2\x88|\xc2\x89|\xc2\x8a|\xc2\x8b|\xc2\x8c|\xc2\x8d|\xc2\x8e|\xc2\x8f|\xc2\x90|\xc2\x91|\xc2\x92|\xc2\x93|\xc2\x94|\xc2\x95|\xc2\x96|\xc2\x97|\xc2\x98|\xc2\x99|\xc2\x9a|\xc2\x9b|\xc2\x9c|\xc2\x9d|\xc2\x9e|\xc2\x9f"} {if ($0 ~ list) {for (i=1;i<=NF;i++) if ($i ~ list) {b[$i]++}}} END {for (j in b) printf("%s\t%02x\n", b[j],ord[j])}' "$1" > /tmp/list
echo
if [ -s /tmp/list ]; then
    awk -v BLUE="$blue" -v RESET="$reset" 'BEGIN {FS=OFS="\t"} FNR==NR {a[$1]=$2;next} {print a[$2]" (hex "$2"): " ,BLUE$1RESET}' ~/scripts/chars /tmp/list
else
    printf "No gremlin control characters found\n\n"
fi
echo
rm /tmp/list
exit 0

The "gremlins" script at work on the data table "ver1":

gremlins

Visualising gremlins

The "gremfinder" script (below) looks for individual gremlin characters identified by the "gremlins" script, based on their "core" hex value (main byte). It automatically adds "c2" to the "core" hex value where needed in UTF-8 encoding, and locates gremlins by field. The script takes as its two arguments the name of the data table and the "core" hex value. It generates a plain-text, tab-separated table with record number, field number and data item (with the gremlin); the table is named "[selected hex value]-list-table".

If wanted, the script then prints a uniquified list of data items in each field, with the invisible gremlin visualised as a space with yellow background coloring. The printing is done from less with two options: -R to allow ANSI colors, and -X to allow the print to persist on screen (return to prompt by pressing "q").

Interactive script to extract data items containing a selected gremlin character   TSV
(More information here)

#!/bin/bash
yelbkg=$(printf "\033[103m")
reset=$(printf "\033[0m")
if ((128 > $(printf "%d" "0x$2"))); then
char=$(printf "\x$2")
else
char=$(printf "\xc2\x$2")
fi
awk -F"\t" -v grem="$char" '$0 ~ grem {for (i=1;i<=NF;i++) {if ($i ~ grem) {print NR FS i FS $i}}}' "$1" | sort -t $'\t' -nk2 -nk1 > "$2"-list-"$1"
echo
echo "Table \"$1\" has \"$2\"-containing words in the following field(s):"
cut -f2- "$2"-list-"$1" | sort | uniq -c | sed 's/[ ]*//;s/[ ]/\t/' | awk -F"\t" '{print "\tfield "$2" in "$1" records"}'
echo
read -p "Show uniquified results with less? (y/n)" foo
echo
if [ "$foo" == "n" ]; then
exit 0
else
cut -f2- "$2"-list-"$1" | sort -n | uniq | sed "s/$char/${yelbkg} ${reset}/g" | less -RX
fi
exit 0

In the screenshot below, "gremfinder" is looking first for the STS (hex c2 93) gremlin in "ver1", then for the CCH (hex c2 94) gremlin. These two are often paired as shown. Hex 93 alone is a left double quotation mark in Windows 1252 encoding, and hex 94 is a right double quotation mark. When encoding conversion to UTF-8 fails, the characters become hex c2 93 (STS in UTF-8; Unicode U+0093) and c2 94 (CCH in UTF-8; Unicode U+0094).

gremfinder

Removing gremlins

Gremlins can be destroyed or replaced globally with tr or sed using the gremlin hex values. It's better to check first where the gremlins are, as described in the preceding section. In the real-world example below, a space and a record separator (RS; hex 1e) have taken the place of a hyphen in the name 'Schmid-Eggr' in record 1468269 of the data table "taxa". Just deleting the gremlin would not correct the text. Here sed replaces the space+RS with a hyphen in record 1468269.

gremremove