logo

Broken records - 1

In a table with N fields, every record should have exactly N fields. A record with more or less than N fields is broken and needs fixing.

Check for broken records using AWK to count fields, either with a typed-in command or with the broken function (to save typing):

$ awk -F"\t" '{print NF}' "$1" | sort | uniq -c | sed 's/^[ ]*//;s/ /\t/' | sort -nr
 
$ broken() { awk -F"\t" '{print NF}' "$1" | sort | uniq -c | sed 's/^[ ]*//;s/ /\t/' | sort -nr; }

For a 10000-record file called table with 20 fields, the result should be

$ broken table
10000 20

If there are many records with other than the expected number of fields, those records can be exported for separate inspection. In this particular case:

$ awk -F"\t" 'NF!=20' table > inspect

The 'broken' check can also be done on an all-fields-double-quoted CSV by using its field separator:

$ awk -F'","' '{print NF}' table.csv | sort -n | uniq -c

Examples of broken records are shown below and on the next page. At the bottom of this page is a script that gives more information than the plain broken function.


Too many fields

In file1 there's an unneeded tab embedded in the 3rd field of record 2:

$ cat file1
aaa   bbb   ccc
ddd   eee   ff   f
ggg   hhh   iii
jjj   kkk   lll
 
$ broken file1
3 3
1 4
 
$ awk -F"\t" 'NF==4 {print NR": "$0}' file1
2: ddd   eee   ff   f

To fix this:

$ sed '2s/ff\tf/fff/' file1
aaa   bbb   ccc
ddd   eee   fff
ggg   hhh   iii
jjj   kkk   lll
 
$ awk -F"\t" 'NR==2 {print $1"\t"$2"\t"$3$4; next} 1' file1
aaa   bbb   ccc
ddd   eee   fff
ggg   hhh   iii
jjj   kkk   lll
 
$ awk 'BEGIN {FS=OFS="\t"} NR==2 {print $1,$2,$3$4; next} 1' file1
aaa   bbb   ccc
ddd   eee   fff
ggg   hhh   iii
jjj   kkk   lll

In file2, the extra within-field tab is here replaced with a single whitespace:

$ cat file2
aaa   bbb   ccc
ddd   eee   ff   X
ggg   hhh   iii
jjj   kkk   lll
 
$ broken file2
3 3
1 4
 
$ awk -F"\t" 'NF==4 {print NR": "$0}' file2
2: ddd   eee   ff   X
 
----------
 
$ sed '2s/ff\tX/ff X/' file2
aaa   bbb   ccc
ddd   eee   ff X
ggg   hhh   iii
jjj   kkk   lll
 
$ awk -F"\t" 'NR==2 {print $1"\t"$2"\t"$3" "$4; next} 1' file2
aaa   bbb   ccc
ddd   eee   ff X
ggg   hhh   iii
jjj   kkk   lll
 
$ awk 'BEGIN {FS=OFS="\t"} NR==2 {print $1,$2,$3" "$4; next} 1' file2
aaa   bbb   ccc
ddd   eee   ff X
ggg   hhh   iii
jjj   kkk   lll

In file3, field 2 of record 2 has been separated off by 2 tabs instead of 1:

$ cat file3
aaa   bbb   ccc
ddd      eee   fff
ggg   hhh   iii
jjj   kkk   lll
 
$ broken file3
3 3
1 4
 
$ awk -F"\t" 'NF==4 {print NR": "$0}' file3
2: ddd      eee   fff
 
----------
 
$ sed '2s/d\t\te/d\te/' file3
aaa   bbb   ccc
ddd   eee   fff
ggg   hhh   iii
jjj   kkk   lll
 
$ awk -F"\t" 'NR==2 {print $1"\t"$3"\t"$4; next} 1' file3
aaa   bbb   ccc
ddd   eee   ff X
ggg   hhh   iii
jjj   kkk   lll
 
$ awk 'BEGIN {FS=OFS="\t"} NR==2 {print $1,$3,$4; next} 1' file3
aaa   bbb   ccc
ddd   eee   ff X
ggg   hhh   iii
jjj   kkk   lll

Note that in this last example, sed is addressed to line 2 and the actual contents of fields 1 and 2. In a real-world case, running sed '2s/\t\t/\t/' or sed '2s/\t\t/\t/g' could delete genuinely blank fields.

On to too few fields...


The following script (I call it "fldnos") is more informative than the broken function. The screenshots below the script show it in action on 2 files.

#!/bin/bash
 
gray="\033[1;37m"
reset="\033[0m"
 
var1=$(awk -F"\t" '{print NF}' "$1" | sort | uniq -c | sed 's/^[ ]*//;s/ /\t/' | sort -nr)
var2=$(echo "$var1" | wc -l)
 
if [ "$var2" -eq "1" ]; then
   printf "All $gray$(echo "$var1" | cut -f1)$reset lines in $gray$1$reset have $gray$(echo "$var1" | cut -f2)$reset fields\n"
else
   printf "$gray$1$reset has:\n"
   echo "$var1" | awk -F"\t" -v GRAY="$gray" -v RESET="$reset" '{print GRAY $1 RESET" lines with "GRAY $2 RESET" fields"}'
   echo
   read -p "Show line numbers with wrong field totals? (y/n)" var3
      case "$var3" in
         n) exit 0;;
         y) echo "$var1" | sort -nr | tail -n +2 | cut -f2> /tmp/wrongs
         awk -F"\t" -v GRAY="$gray" -v RESET="$reset" 'FNR==NR {arr[$0]; next} (NF in arr) {print "line "GRAY FNR RESET": "GRAY NF RESET" fields"}' /tmp/wrongs "$1"
         rm /tmp/wrongs;;
      esac
fi
 
exit 0

broken2
 
broken3