logo

On this page:
    A data auditing procedure
    Reporting results

On the Extras 2 page:
    Single-record viewing
    Single-record paging (YAD)
    Bulk replacements (YAD)
    Field-aware search

On the Extras 3 page:
    A better uniq count
    Reformat tab-separated items
    Visualise single spaces
    A special date check
    A special cross-file check

On the Extras 4 page:
    Progressive split on regex
    Choose a command template
    Hex values in a text string
    Top and bottom of a text file


TSV This marker means that the recipe only works with tab-separated data tables.


A data auditing procedure

Check and fix if needed:

  1. character encoding
  2. structure (TSV?)
  3. gremlins
  4. Windows line endings
  5. broken records
  6. blank records
  7. empty fields
  8. character encoding failures
  9. excess whitespace
  10. multiple versions of the same character
  11. combining characters
  12. unmatched brackets
  13. correct and consistent formatting (names, dates, latitude/longitude, measurement units etc)
  14. inappropriate data items in a field
  15. truncated data items
  16. disagreements between fields
  17. missing but expected data items
  18. duplicates
  19. partial duplicates
  20. pseudo-duplicates
  21. related-field issues

...and anything else you notice while doing these checks, especially from field tallies!


Reporting results

If a data table has only been audited (not cleaned), a report can be based on results of the various checks done (see above) as documented in the auditing log. My practice in reporting is to group the results by problem, rather than review the table problematic-record by problematic-record.

If a data table has been cleaned, the "reportlist" script (below) will generate a before-and-after report. The script should only be used after any records have been added, deleted or checked and fixed for breaks, as the script assumes that the before-and-after tables have the same number of records and the same number of fields, in the same order.

The "reportlist" script takes as its two arguments the "before" and "after" versions of the data table and prints the results to screen (screenshot below). It also generates a time-stamped file of results called table-changes-[current date and time] in the working directory.

Script to list changes by line (record) and field    TSV

#!/bin/bash
stamp=$(date +%F_%H:%M)
paste "$1" "$2" > merged
totf=$(head -1 "$1" | tr '\t' '\n' | wc -l)
awk -F "\t" -v f="$totf" '{for (i=1;i<=f;i++) if ($i != $(i+f)) print "line "NR", field "i": ["$i"] > ["$(i+f)"]"}' merged \
| tee "$1"-changes-"$stamp"
rm merged
exit 0

For example, here's an original table, "demo_old":

demo_old

And here's the cleaned version, "demo":

demo reportlist