
TSV This marker means that the recipe only works with tab-separated data tables.
A data auditing procedure
Check and fix if needed:
- character encoding
- structure (TSV?)
- gremlins
- Windows line endings
- broken records
- blank records
- empty fields
- character encoding failures
- excess whitespace
- multiple versions of the same character
- combining characters
- unmatched brackets
- correct and consistent formatting (names, dates, latitude/longitude, measurement units etc)
- inappropriate data items in a field
- truncated data items
- disagreements between fields
- missing but expected data items
- duplicates
- partial duplicates
- pseudo-duplicates
- related-field issues
...and anything else you notice while doing these checks, especially from field tallies!
Reporting results
If a data table has only been audited (not cleaned), a report can be based on results of the various checks done (see above) as documented in the auditing log. My practice in reporting is to group the results by problem, rather than review the table problematic-record by problematic-record.
If a data table has been cleaned, the "reportlist" script (below) will generate a before-and-after report. The script should only be used after any records have been added, deleted or checked and fixed for breaks, as the script assumes that the before-and-after tables have the same number of records and the same number of fields, in the same order.
The "reportlist" script takes as its two arguments the "before" and "after" versions of the data table and prints the results to screen (screenshot below). It also generates a time-stamped file of results called table-changes-[current date and time] in the working directory.
Script to list changes by line (record) and field TSV
#!/bin/bash
stamp=$(date +%F_%H:%M)
paste "$1" "$2" > merged
totf=$(head -1 "$1" | tr '\t' '\n' | wc -l)
awk -F "\t" -v f="$totf" '{for (i=1;i<=f;i++) if ($i != $(i+f)) print "line "NR", field "i": ["$i"] > ["$(i+f)"]"}' merged \
| tee "$1"-changes-"$stamp"
rm merged
exit 0
For example, here's an original table, "demo_old":

And here's the cleaned version, "demo":

