logo

On this page:
    Fields per record
    Embedded newlines

On the Structure 2 page:
    Blank fields
    Blank records


TSV This marker means that the recipe only works with tab-separated data tables.


Fields per record

In a table with N fields, every record should have exactly N fields. A record with more or fewer than N fields is broken and needs fixing before any further data auditing or cleaning is done.

Tally the number of fields per record   TSV
(The "broken" function takes table as its one argument. The optional sed command after uniq -c converts the numbering to tab-separated.)

broken() { awk -F"\t" '{print NF}' "$1" | sort | uniq -c | sed 's/^[ ]*//;s/ /\t/' | sort -nr; }

broken

Interactive script to tally the number of fields per record
and identify anomalous records
   TSV
(I call this script "fldnos" — see screenshot)

#!/bin/bash
 
blue="\033[1;34m"
reset="\033[0m"
 
var1=$(awk -F"\t" '{print NF}' "$1" | sort | uniq -c | sed 's/^[ ]*//;s/ /\t/' | sort -nr)
var2=$(echo "$var1" | wc -l)
 
if [ "$var2" -eq "1" ]; then
printf "All $blue$(echo "$var1" | cut -f1)$reset records in $blue$1$reset have $blue$(echo "$var1" | cut -f2)$reset fields\n"
else
printf "$blue$1$reset has:\n"
echo "$var1" | awk -F"\t" -v BLUE="$blue" -v RESET="$reset" '{print BLUE $1 RESET" records with "BLUE $2 RESET" fields"}'
echo
read -p "Show line numbers with wrong field totals? (y/n)" var3
    case "$var3" in
    n) exit 0;;
    y) echo "$var1" | sort -nr | tail -n +2 | cut -f2> /tmp/wrongs
    awk -F"\t" -v BLUE="$blue" -v RESET="$reset" 'FNR==NR {arr[$0]; next} (NF in arr) {print "line "BLUE FNR RESET": "BLUE NF RESET" fields"}' /tmp/wrongs "$1"
    rm /tmp/wrongs;;
    esac
fi
exit 0

fldnos

Embedded newlines

Broken records must be repaired before any further auditing or cleaning is done on a data table. A few breaks in a small table can be repaired in a text editor. Breaks in larger tables can be fixed on the command line with sed or AWK.

The most common causes of broken records are newline characters embedded within data items. Multi-line data items are allowed in spreadsheets, and when a spreadsheet with multi-line items is copy/pasted into a text file, each line takes up one field in the record. Prevention is better than cure, and newlines within spreadsheet cells are best removed while the data items are still in the spreadsheet, for example by replacing each newline with a comma.

Command-line repair methods depend on the nature of the record breaks. The simplest case is that there are only a few broken records, and the records are "simply" broken: whatever was in the data item has been split over two or more lines. The fix is to specify the line (record) number of each of the pre-break lines and then just delete the newline character that caused the break. "tableA" has five records, two of which are simply broken:

newlines1

The repair is a little more difficult if the records aren't simply broken, for example if the newline-separated items made up a list, as in "tableB" in the next screenshot. In this case the newlines should be replaced with a suitable separator, such as a comma and a space:

newlines2

When there are many broken records, it's easiest to put the line (record) numbers of the pre-break lines in a temporary file, the store those numbers in an AWK array and do the repair with AWK:

newlines3

For more information on fixing broken records, see the BASHing data posts here and here. To be honest, most of the data tables I see don't have many (or any) broken records, and I do repairs in a text editor, not on the command line.