logo

On this page:
    Blank fields
    Blank records

On the Structure 1 page:
    Fields per record
    Embedded newlines
    Broken CSVs


TSV This marker means that the recipe only works with tab-separated data tables.


Blank fields

A data table might have a complete header but only blank data items in some fields. These fields aren't completely empty because they have a field name in the header, so they're "pseudo-blank". The following script (I call it "empties") will identify any pseudo-blank fields and list them with their field numbers in a new file, "table_emptyfields". It also offers the option of building a new table without the pseudo-blank fields.

Interactive script to find empty fields and build a new data table without them   TSV
(More information here)

#!/bin/bash
 
awk -F"\t" 'NR>1 {for (i=1;i<=NF;i++) a[i]+=length($i)} END {for (j in a) {if (a[j]==0) print j}}' "$1" > /tmp/flds
head -n1 "$1" | tr '\t' '\n' | nl -w1 > /tmp/allflds
if [ ! -s /tmp/flds ]; then
    echo
    echo "No empty fields in $1" && rm /tmp/flds /tmp/allflds && exit
else
    awk -F"\t" 'FNR==NR {b[$1]=$2; next} $1 in b {print $1":"b[$1]}' /tmp/allflds /tmp/flds > "$1"_emptyfields
    echo
    echo "$(wc -l < "$1"_emptyfields) empty field(s) in "$1" table"
    echo
    read -p "Build table with non-empty fields only? (y/n) " build
    if [[ "$build" == "y" ]]; then
        read -p "What name for the new table? " name
        cut --complement -f"$(paste -d',' -s /tmp/flds)" "$1" > "$name"
    else
        rm /tmp/flds /tmp/allflds && exit
    fi
fi
rm /tmp/flds /tmp/allflds
exit 0

The "empties" script runs fairly slowly with big tables, so I use a modification that has a progress bar generated by the pv utility. The first AWK command in the script is replaced by:

pv -w 50 -pbt "$1" | awk -F"\t" 'NR>1 {for (i=1;i<=NF;i++) a[i]+=length($i)} END {for (j in a) {if (a[j]==0) print j}}' > /tmp/flds

empties

Blank records

Print a list of record numbers for completely blank records,
or for records containing only tabs or whitespace

awk 'NF==0 {print NR}' table
 
awk '!NF {print NR}' table

Blank records are fairly rare in real-world data tables. More common are records that contain one or more essential data items (like a unique record ID) and nothing else, or maybe a "?" or "-" or two. These are "pseudo-blank" records. Examples are on lines 4, 5, 9 and 12 in the tab-separated table below, called "table1".

table

The "susscount" function prints a tally of records with "suspect" fields containing no alphanumeric characters.

Tally records with suspiciously "empty" fields   TSV
(Earlier version explained here)

susscount() { awk -F"\t" '{c=0; for (i=1;i<=NF;i++) if ($i !~ /[[:alnum:]]/) {c++}} {print "records with "c" suspect fields"}' "$1" | sort -r | uniq -c; }

susscount

Note that the "susscount" function will return a high number of lines if the data table contains many blank data items. It is mainly for use with tables that would normally be well-filled.

To inspect the records with suspect fields, modify the function to print all records with a suspect field count equal to N, where N is based on the result of "susscount". The modified function is "showsuss".

Print records with suspiciously "empty" fields   TSV

showsuss() { awk -F"\t" -v cnt="$2" '{c=0; for (i=1;i<=NF;i++) if ($i !~ /[[:alnum:]]/) {c++}} c==cnt {print}' "$1"; }

showsuss