logo

On this page:
    Date formatting
    Lat/lon formatting
    Punctuation checks
    Data item fits

On the Format 1 page:
    Fix Windows line endings
    Multiple character versions
    Unmatched braces
    Excess whitespace
    Malformed markup


TSV This marker means that the recipe only works with tab-separated data tables.


Date formatting

Because there are so many different date formats, there are no simple recipes for checking (or cleaning) every kind of date. A data cleaner's first job is to check for consistency. If a date field mainly uses the format MM/DD/YYYY, then all dates in the field should have that format. A second job is to check for malformed and impossible dates, like "19927", "3016" or "4/31/2001". Both checks are most easily done with a field tally.

The BASHing data blog has posts on other details of date checking. How to validate ISO 8601 dates without regex explains the use of AWK commands for checking individual ISO 8601 date components and for ensuring that the start date in an ISO 8601 interval is before the finish date. Changing the month format: a fairly general solution offers a table of month variations (e.g. January, Jan, 1, 01, i, I) and shows how it can be used to build or convert dates in a range of formats (examples below).

One task I do fairly regularly with dates is an example of checking for field disagreements. The tables for this check have an ISO 8601 date field, a year field, a month field and a day field, spreading a date across four tab-separated fields. To check for disagreements across the four fields I adapt this command:

For demonstration purposes the full date field number is E, the year field number Y, the month field number M and the day field number D.
 
awk -F"\t" 'NR>1 {split($E,a,"-"); \
if (a[1]!=$Y || a[2]!=$M || a[3]!=$D) \
print $E FS $Y FS $M FS $D}' [table] \
| sort | uniq -c

dates1

Dates in spreadsheets are a well-known nightmare and may create problems when copy/pasting from a spreadsheet into a text file. The safest way I've found to get single-field dates out of a spreadsheet is to convert the date to text-only format within the spreadsheet before copy/pasting. Multiple-field dates in a spreadsheet are a welcome alternative; have year, month and day in separate fields. The separate date items can be combined into a single-date field in various ways after the data have been moved to a text file:

dates2

For an explanation of the "months" table (in ~/scripts/months) see this BASHing data post.


Lat/lon formatting

I've included this section in the Cookbook because I so often see poor formatting of latitude/longitude figures when auditing data. The most common problems are format inconsistency, poor choice of characters and over-accurate figures.

Format inconsistency. There are three widely used lat/lon formats:

    Degrees-minutes-seconds: 48°51'53'' N, 2°20'56'' E
    Decimal minutes: 48°51.88' N, 2°20.93' E
    Decimal degrees: 48.8647 2.3489

The first two have "subformats". For example, there may or may not be a comma between latitude and longitude when both are in the same data item. The direction letter (N,E,S,W) may precede the figure rather than follow it, and there may not be a space between figure and letter (or letter and figure). Single numbers may or may not have a leading zero (48°5'N vs 48°05'N).

Decimal degrees should not have format variations. The correct use is latitude first, longitude second; latitudes north of the Equator positive, south of the Equator negative; longitudes east from the Prime Meridian to 180° positive, longitudes west from the Prime Meridian to 180° negative.

Whichever format is chosen, it should be used consistently in every data item in a latitude, longitude or combined latitude/longitude field.

Poor choice of characters. I've seen three different attempts at a degree character in a single data table: degree sign (U+00B0, hex c2 b0), masculine ordinal indicator (U+00BA, hex c2 ba) and combining ring above (U+030A, hex cc 8a). The correct character to use is the degree sign. Note that this is a 2-byte character in UTF-encoding but only a 1-byte character (hex b0) in the commonly seen Windows-1252 encoding. Incorrect conversion to UTF-8 leaves the Windows byte as a replacement character (�) rather than a degree sign; iconv does the conversion correctly.

I've also seen several different quote characters used in degrees-minutes-seconds and decimal minutes figures. The correct characters to use are apostrophe ("single quote", U+0027, hex 27) for minutes and quotation mark ("double quote", U+0022, hex 22) for seconds. Two apostrophes one after the other for seconds is poor formatting and unnecessary.

The negative sign in decimal degrees for latitudes south and longitudes west should be a hyphen (or "minus", U+002D, hex 2d), not an en dash (U+2013, hex e2 80 93).

Overaccurate figures. "The building is at -33.8903169365705 151.198409720645."

The BASHing data blog has posts here and here on this issue. Opinions are divided. Some people think it's fine to include for latitude and longitude as many decimal places as the GIS system or GPS or Google Maps/Earth provides. Other people (me among them) think that doing so demonstrates a lack of understanding of measurement error and calculation error. For an independent view of what each of the decimal places means, see the "Precision" table in this Wikipedia article.


Punctuation checks

Proofreading of text includes checking to see whether punctuation marks have been used correctly. Checks of this kind are easy to do on data tables, but ad hoc grep commands based on regular expressions are best, based on the punctuation found in a particular data table. Some examples:

Any commas not followed by a space?
(apple,banana, cherry)
grep -n ",[^ ]" table
 
Any commas followed by a space?
(apple,banana, cherry)
grep -n ",[ ]" table
 
Any opening braces not preceded, or closing braces not followed by a space?
(On the 25th(Monday), we...)
grep -En "[^ ](|)[^ ]" table
 
Any "end" punctuation preceded by a space?
(Such a thing  !)
grep -n "[ ][.,;:\!?]" table
 
Any "doubled-up" commas?
(Smith,, Frank | Hopkins, Charles)
grep -En ",{2,}" table


Data item fits

In most data tables, each non-blank data item will begin and end with an alphanumeric character and will "fit" the field neatly. If not, the item might have been truncated, or some data cleaning might be needed.

The "itemfits" function takes the table name as its one argument and returns line (record) number and field number, then the data item within square braces. Blank data items are excluded, but leading and trailing whitespace are detected. The "itemfits" output should be piped to less to avoid flooding the terminal if you suspect that many data items will be "non-fits".

Locate data items that don't begin and end with an alphanumeric character    TSV

itemfits() { awk -F"\t" '{for (i=1;i<=NF;i++) if ($i != "" && ($i !~ /^[[:alnum:]]/ || $i !~ /[[:alnum:]]$/)) print "line "NR", field "i":\n ["$i"]"}' "$1"; }

fits