logo

On this page:
    Tally single field
    Tally multiple fields
    Pattern matching
    Duplicates
    Partial duplicates

On the Content 2 page:
    Pseudo-duplicates
    Truncation
    Field disagreements
    Missing but expected
    Geochecking
    Domain schizophrenia


TSV This marker means that the recipe only works with tab-separated data tables.


Tally single field

The most-used function in my data auditing work is "tally". It takes all the data items from a single field in a tab-separated table except the one in the header line, then sorts the items, uniquifies them and reports their frequencies. The function takes table name and field number as its two arguments. (Field numbers are obtained from the "fields" function.) A scan of "tally" results will reveal character, structure and format problems in a field, as well as pseudo-duplication.

Sort, uniquify and count all the data items in a field    TSV
(The optional sed command tweaks the output from uniq -c, left-justifying the number and separating it from the uniquified data item with a tab.)

tally() { tail -n +2 "$1" | cut -f"$2" | sort | uniq -c | sed 's/^[ ]*//;s/ /\t/'; }

I run "tally" sequentially on a series of fields, usually on a working version of the data table from which empty fields have been removed. To move from one field to another, I up-arrow to repeat the command, then backspace the old field number and replace it with the new field number. If I expect the field to have many different data items, I pipe the output of "tally" to less for paging, or sometimes to column if the data items are all very short strings (like year numbers).

A tally of field 13 in the table "biblio":

tally1

Tally multiple fields

Sort, uniquify and count all the data items in multiple fields    TSV

tail -n +2 table | cut -f[field number],[another field number]... | sort | uniq -c

This is a recipe for jointly checking fields with related content. The first screenshot (below) shows the beginning part of a joint tally of the country field (field 23) and countryCode field (field 24) in the data table "occ". Angola and Kenya both have records with missing but expected data items. DRC and Ivory Coast records are pseudo-duplicated because of a no-break space and a spelling error, respectively.

tally2 tally3

Pattern matching

I've learned the hard way that data cleaning based on a pattern match is best done by careful attention to the address of the pattern. As a simple example, a "tally" of the "Month" field (field 11) in the table "demo" shows that March has been represented by "03" in one record, instead of by "3":

pattern1

It would be a mistake to globally replace "03" with "3" in the table, because there could be many other "03" strings in "demo" (e.g. AU140362):

sed 's/03/3/g' demo > demo1    #Wrong!!

It would also be dangerous to identify the record number, then target just that record, for the same reason:

pattern2

Identifying the record number (as above, with AWK) is a good idea if your next step is to modify that record in a text editor. (In Geany text editor, Ctrl + l [letter "ell"] and entering a line (record) number will take you to that record, flagged with an arrow.) On the command line, the safe way to do an insertion, deletion or replacement when cleaning data is to address both the record(s) and the field(s) for that action.

pattern3

See also data item operations.


Duplicates

Exact, character-for-character duplicate records are easily found:

Display exact duplicate records

sort table | uniq -D

dupes1

However, it's unlikely that you'll find exact, character-for-character duplicate records in a data table, especially if the table derives from a database which has added a unique ID string to each record. Removing that unique ID can reveal duplicate records:

dupes2

Partial duplicates

Each of the records in the table "partials" (screenshot below) is unique, even if the ID (IDField1) is ignored. It might be of interest, though, to check the table for partial duplicates, where only some fields are duplicated. The equivalent in a customer database might be records with the same address and phone number, but a differently spelled name.

partials1

The following command looks for records in "partials" with duplicate data items in fields 3 and 5. Although the "partials" table is processed twice by AWK, command execution is very fast. The follow-up sorting orders the duplicates by field 3 and by ID. For more information, see this BASHing data post.

awk -F"\t" 'FNR==NR {a[$3,$5]++; next} a[$3,$5]>1' partials partials | sort -t $'\t' -k3 -k1

partials2

The "two-pass" AWK command can be adapted to any number of fields. However, in a real-world case I had to modify the command because many of the records had blanks in the two fields being checked, and a blank is a duplicate of any other blank. I checked for blanks after noticing that the count of partial duplicates in fields 2 and 17 of "col1" was suspiciously large:

partials3

The modified command added the condition that field 2 was not blank. This reduced the number of partial-duplicate records to two. Apart from differences in the unique record ID (field 1) and the language of the institution name (field 16, "Národní muzeum" vs "National Museum of the Czech Republic"), the two records were identical:

partials4