banner

For a list of BASHing data 2 blog posts see the index page.    RSS


5, 7, 8, 9, 10, 12, 14, 15, 17. Any advance on 17?

ads

I reckon I could do better than 17 with A Data Cleaner's Cookbook, but I'm suspicious of headlines with numbers. They remind me of the Confucian Four Cardinal Principles and Eight Virtues and all those silly "Top Ten" lists. Nevertheless, I'll go numerological here and ask:

     Are there 3 data quality issues that these 9 websites all agree are troublesome?

Yes, and the big 3 are duplicate data, inaccurate data and missing (or incomplete) data.

The tricky one is missing. It's easy enough to find data gaps and inappropriate NULLs on the command line, but what next? If you don't know the customer's email address or date of birth, or the location of a traffic incident, or the zinc content in a contaminated soil sample, you can't just invent those data items. To fill the gaps you need to go out into the world beyond the data tables and search for answers.

Data analysis, on the other hand, has its own "fixes" for missing data. One is listwise deletion, or "data dropping". Something missing from a record? Throw the record out!

Then there's imputation, which refers to a group of statistical methods. When a data field contains numerical values, you could fill a blank with the mean or median from the same field. If the field values are categorical, you could fill a blank with the mode.

Multiple imputation is a group of more sophisticated techniques. In one sub-group of these methods, instead of just guessing the missing value in Field X based on non-missing values in Field X, you generate an imputed value by first looking at how the non-missing values in Field X relate to non-missing values in other fields.

For a very readable overview of imputation methods, see Rahul Rego's 2025 blogpost on SkillCamper.

The biodiversity datasets I work with are largely non-numerical and don't lend themselves to statistical or algorithmic repair methods, even when analysis is worth doing. If someone has inadvertently left out a species name in a table of species occurrences, it's a big mistake to guess the missing species from the names of other species at the same location. If the coordinates (latitude/longitude) are missing but there's a generalised locality name in the record, like "Manitoba", it's another big mistake to fill the blank with averaged coordinates for other Manitoba records in the same table.

I rely instead on the "Pardon me..." approach, where I ask the data provider if they know what the missing values might be. If the answer is "We don't know", then the missing values stay missing.

Biodiversity datasets are rich in missing values, because that's the nature of biological records. Not all data items are noted down when a record is created, and it's usually difficult or impossible to fill in the blanks later on.

I worked once at a museum that had specimens (think of a stuffed bird or a seashell) with no locality, no date, no collector name, no donor name and no donation date. Orphaned items like these were simply left out of the museum's collection database — a nice demonstration of "data dropping".


Next post:
2025-09-12   Change of habit: Geany out, Mousepad in


Last update: 2025-09-05
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License