logo

On this page:
    Combining characters
    Character classes

On the Characters 1 page:
    Is the table in UTF-8?
    Converting to UTF-8
    Tally visible characters

On the Characters 2 page:
    Reversible replos
    Reconstructable replos
    Researchable replos

On the Characters 3 page:
    Detecting gremlins
    Visualising gremlins
    Removing gremlins


Combining characters

Some characters can be coded in more than one way in Unicode. For example, there are two different ways to represent a Latin "a" with an umlaut (or diaeresis). You can code the single character ä, or you can code a plain a plus a diaeresis ¨. The isolated diaeresis character is a combining character.

I call a character pair built with a combining character a "combo". Combos create data-processing headaches and should be normalized to their equivalent single characters when data cleaning. Combining characters can be detected using a regular expression class with the function "combocheck", which takes table as its one argument:

Detect and count combining characters
(More information here)

combocheck() { grep -cP "\p{M}" "$1" }

The combining characters most used with Latin letters are in the Unicode block Combining Diacritical Marks, which runs from code point U+0300 to U+036F (hexadecimal cc 80 through cc bf, and cd 80 through cd af).

Print a list of all strings containing combining characters,
together with their record numbers

(More information here)

awk -b '{for (i=1;i<=NF;i++) {if ($i ~ /\xcc[\x80-\xbf]/ || $i ~ /\xcd[\x80-\xaf]/) print NR,$i}}' table

Print a list of unique strings containing combining characters,
together with their frequencies

(More information here)

awk -b '{for (i=1;i<=NF;i++) {if ($i ~ /\xcc[\x80-\xbf]/ || $i ~ /\xcd[\x80-\xaf]/) print $i}}' table | sort | uniq -c

Find the hexadecimal codes for the unique strings
found by the previous command (without their frequencies)

(More information here)

awk -b '{for (i=1;i<=NF;i++) {if ($i ~ /\xcc[\x80-\xbf]/ || $i ~ /\xcd[\x80-\xaf]/) print $i}}' table | sort | uniq | xxd -c1 | cut -d" " -f2-

I use sed to replace combos with their single-character equivalents. For example, the string "café" in the first command (screenshot below) contains a plain "e" (hex 65) and a combining acute accent (hex cc 81). The combo is replaced with the single character "Latin small letter e with acute" (hex c3 a9).

combo

Character classes

The function below lets you quickly check the contents of a POSIX character class, like [:alnum:] and [:punct:], for use in regular expressions. The command relies on a plain-text file called "256chars". It's a list of the first 256 Unicode characters, either as the character itself or an abbreviation, along with a short description of the character, its Unicode code point and the order number in decimal, hexadecimal and octal, together with the hex value in UTF-8 encoding. You can download "256chars" here. The function "showchars" takes as its one argument the POSIX class abbreviation (alnum, punct, cntrl, etc), and prints the character and its UTF-8 hex value.

Print an inventory of a POSIX character class
(More information here)

showchars() { awk -F"\t" -v foo="$1" 'BEGIN {class=sprintf("[[:%s:]]",foo)} NR>1 {x=sprintf("%c",$2); if (x ~ class) print "\033[34m"$5"\033[0m "$7}' path/to/256chars | column; }

classes