For a full list of BASHing data blog posts see the index page.  RSS

The data worker's guide to psiphiorrhea

A dataset I recently audited had a record for a marine specimen observed at latitude 6.47457312, longitude -52.5741239, depth 103.8799973 metres. I've changed the coordinates (but not their number of decimal places) to protect the data owner's privacy.

While those coordinates aren't as impressive as the
-33.8903169365705 151.198409720645
I blogged about in 2019 for a huge building in Sydney, Australia, they still specify the specimen's underwater location ±0.55 millimetres in latitude. And the depth measurement is ±0.00005 millimetres.

I suspect that the marine recorder might be afflicted with psiphiorrhea. I concocted this word (pronounced siff-ee-oh-REE-uh) from Greek roots meaning "digit or numeral" and "flux". In the same way that someone who talks far too much is exhibiting logorrhea, or excessive word-iness, someone who uses far too many digits in their numbers is exhibiting psiphiorrhea, or excessive digit-iness.

The psiphiorrhea sufferer cannot be persuaded to use fewer digits. No amount of explaining about significant figures or measurement error will convince the psiphiorrheic that their numbers are absurd.

When questioned about superfluous digits, the psiphiorrheical data compiler will double down and emit evasive excuses:

That's what it said on the Thing-o-meter readout!
My teacher in high school told me never to round off
It's because the single-precision (mumble) floating radix point (mumble)
I didn't notice — the number must have wrapped in the speadsheet cell

Psiphiorrhea is particularly at home in the sciences, where lots of decimal places make data look more respectable. Writing in 2020 about quantitative methods in linguistics, European scholar Jan Vanhove comments

False precision abounds. Numbers are falsely precise if they suggest that the information on which they are based is more fine-grained than it actually was, like saying that the Big Bang happened 13,800,000,023 years ago because it has been twenty-three years since you learnt that it happened 13.8 billion years ago. Falsely precise numbers can also imply that an inference beyond the sample can be made with greater accuracy than is warranted by the uncertainty about that inference, like when a pollster projects that a party will gain 37.14% of the vote but the margin of error is 2 percentage points.

Of course, as data workers we need to be aware that long strings of decimal places can sometimes be important, especially in finance. If you're tabulating currency exchange rates, for example, the Indian rupee figure in
USD1 = INR74.497182 (mid-market quote on 2021-07-13)
is meaningful, especially for large-scale conversions. In a $1 million transaction, the difference between 74.497182 and 74.497181 is 1 whole rupee, which will buy you 1 serve of Center Fresh chewing gum.

So how should data workers deal with psiphiorrhea? The answer might seem counter-intuitutive, but it's simple: ask for more numbers.

Can we get another field from you in that table? Right next to the one where you've got "18.000163"? What we need is an error estimate, maybe something like plus-minus X%. Doesn't matter how many decimal places. You will? That's great.

Putting an error estimate next to a ridiculously over-accurate data item is a win for everybody. The psiphiorrhea sufferer gets to play with more digits, the data user can appreciate just how nonsensical the data item really is, and the data worker gets to say "Don't look at me, I just work with the data I'm given" with a clear conscience.

Last update: 2021-07-21
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License