For a full list of BASHing data blog posts see the index page.
Three kangaroos in the ocean
A record might be deliberately deleted from a dataset for any of several reasons. The deleted record:
- duplicates another record in the dataset
- isn't relevant (it was added to the wrong dataset)
- is entirely blank apart from a unique identifier code
- lacks important data items
- has malformed data items
- has contradictory data items
- has unexpected and slightly unbelievable data items (outlier record)
- comes from an untrustworthy source
I've seen the term data cleaning applied to deletions like these, but deletion isn't cleaning, it's filtering. The quality of the dataset isn't improved by deletion, as it is by cleaning. Instead, a new, smaller dataset is generated with less information than was present in the original, larger dataset.
My habit as a data auditor is don't delete, investigate. Here's an interesting example, a heat map showing relative numbers of records for the Eastern Grey Kangaroo:
Notice the records in the North Pacific, Indian and South Atlantic Oceans (arrows)? Eastern Greys aren't marine, in my experience...
...so what's the story? Well, the map comes from the Global Biodiversity Information Facility (GBIF), and GBIF got the records from the Atlas of Living Australia (ALA). If you were doing a study on Eastern Grey Kangaroo occurrences you might be tempted to discard those three oceanic records. But with a little digging, two of the three can be located on land:
North Pacific Ocean. The ALA record (accessed 2020-08-30) says this occurrence was noted by citizen scientist Kylie Carman on 2013-12-30 from the "mt bucca biodiversity project", and the supplied longitude, latitude was "151.58419799804688,24.48559951710993". (Too many decimal places, sigh...). On the same day at the same spot, Ms Carman also observed a Little Eagle (ALA record).
In 2018 Ms Carman owned the Mt Bucca property, ca 30 km west of Bundaberg in Queensland, Australia. The property is a wildlife sanctuary in Humane Society International's "Wildlife Land Trust" program. If the latitude of the observation was negative "-24.485...", we're in the right part of the world, but that particular lat/lon is ca 70 km NW of the property. There's a photo (below) of the "Mt Bucca Biodiversity Project" property sign on Flickr taken on 2013-12-13. Unfortunately the photo's EXIF data don't include GPS readings.
So the exact location might be doubtful, but a little investigating has shifted the record from the North Pacific Ocean to southeast Queensland near Bundaberg.
Indian Ocean. This ALA record (accessed 2020-08-30) comes from the Tasmanian Museum and Art Gallery (TMAG). It's a 2007 collection of an Eastern Grey from Maria Island (Tasmania), to which the kangaroos were introduced about 50 years ago.
TMAG gave the lat/lon as "-54.6183 146.8933", which is a long, long way south of Maria Island. The spatial mistake was spotted by an ALA user on 2012-09-02. A TMAG staff member responded on 2013-11-13 , saying "Thank you for alerting me to the incorrect coordinates for this specimen. The correct information is Latitude -42.633 Longitude 148.833. Our database has now been ammended and the coordinates will appear correctly after our next update is sent to the ALA."
The amended lat/lon is also wrong: it's in the ocean 60 km east of Maria Island. I don't know whether TMAG checked and corrected this amendation or sent it to ALA at some point, but in 2020 that kangaroo is still in the southern Indian Ocean, not at its correct location on Maria Island.
South Atlantic Ocean. This one is another citizen science sighting, by Meg Bourne on 2013-05-09. The supplied lat/lon in the ALA record (accessed 2020-08-30) is "-35.184669494628906, -35.184669494628906", a duplication that would set alarm bells ringing in any data auditor's head. The location was only specified as "Suburban residential area surrounded by bush land and farm land".
That (duplicated) latitude cuts through SE Australia just south of Adelaide and just north of Canberra, most of it Eastern Grey country. I've been unable to find other records from the observer/date/latitude and this Eastern Grey record remains useless.
I haven't found much advice online about how to document deletions. In the few cases where I've deleted records from a dataset I was auditing, I cut out the records and pasted them into a new file, together with an in-record note explaining why I thought the deletion was justified. A simpler but slightly less transparent practice is to keep a log (see "Pro tips" on the page linked) of all operations on a dataset, including deletions.
But before deleting any record I hum Don't discard that point, my friend. (I recommend the version by Country Joe and the Fish).
Last update: 2020-09-30
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License