banner

For a full list of BASHing data blog posts, see the index page.     RSS


Dog and cat data

The Australian government's open data portal has a surprisingly large amount of data on dogs and cats. Nearly all of it comes from local councils with open data policies, since it's local government in Australia that registers domestic animals, regulates animal numbers on non-farm properties and answers the call when someone complains about a wandering dog.

An example of a complaints dataset comes from the Townsville (Queensland) City Council and includes 33,526 complaints recorded monthly from October 2013 to December 2018. The data is provided as a CSV and the CSV is unproblematic: no commas or quotes within fields. There are five fields:

dogs1

and dog complaints outnumber cat complaints by about nine to one:

dogs2

uniqc is a handy alias I use to left-justify and tab-separate the tally numbers from uniq -c:
 
alias uniqc="uniq -c | sed 's/^[ ]*//;s/ /\t/'"

Cat-lovers will note that unlike dogs, no Townsville cats were reported for being aggressive, attacking or making noise:

dogs3

Brisbane (Queensland) City Council's complaints data aren't as neatly compiled as Townsville's. The 2,101 complaints recorded in July-September 2018 are broken up by "Category: Type" (field 2) and "Category: Reporting Level" (field 3), with overlap of data item types:

dogs4

Even the "Attack" category is a bit muddled:

dogs5

Brisbane's animal permit dataset (2019-01-03) is better organised. All 107,039 dog registrations are current, and the "Animal: Breed" category (field 4) gives an insight into dog breed popularity. Here are the top 10 registrations:

dogs6

The full list of breeds in the Brisbane permit dataset doesn't include "mongrel", but there are seven blank entries and 560 "Unknown".

Another interesting dataset for exploring pet data comes from the City of Greater Dandenong (Victoria). It lists primary breed, primary colour, de-sexed status (yes/no) and year and month of birth (in ISO 8601 format! Yay!) for 3,363 cats up to 2017-02-18. The oldest cat in the register has the nominal birthday May 1985, making it almost 24 years old if it's still alive. Here's the registration spread by birth-year:

awk -F"," 'NR>1 {year[substr($6,0,4)]++} END {for (i in year) print i "\t" year[i]}' cgdcatsdetails.csv | sort | column

dogs7

Field 6 in "cgdcatsdetails.csv" is the birthday in the form YYYY-MM, where fields are comma-separated (-F","). For each line after the header (NR>1), AWK's "substr" function extracts the year from field 6 (first 4 characters from position 0) and puts it in the array "year", where occurrences of that year are totalled (++). When the end of file is reached, the END statement prints each year and its total occurrences, and the sort command puts the results in numerical order by year.

The oddest data I've noticed among the "animal" datasets on data.gov.au is in Logan City Council's Deceased Animal Collection Requests. For each of the 4,304 requests, the council has entered the suburb name, the latitude/longitude of the suburb in decimal degrees to five or six decimal places, and the same again to 15 decimal places. Carbrook, for instance, is at -27.673862 153.25624 and at -27.673861999297635 153.256240000388146.

I've posted before about spatial data with too many decimal places, but this dataset could be a world-beater. Not only is the lat/lon duplicated in each record, but those 15-place figures locate the suburb's latitude to the nearest tenth of a nanometre, about half the diameter of a chlorine atom.


Last update: 2019-03-31