For a full list of *BASHing data* blog posts see the index page.

# How to find the missing parts of a series

My wife has a data table with a unique serial number for each of its records. The table gets frequent edits and she wanted to check which of the numbers might be missing as a result of past deletions. There are thousands of numbers, so eyeballing for gaps in the series isn't practical.

She gave me a text file ("UIs") with an unsorted list of the numbers. To discover the first and last ones I did a numerical **sort** of "UIs" (*sort -n*) and piped the result to **sed** to print just the first and last lines in the sorted list:

sort -n UIs | sed -n '1p;$p'

I then used **comm** to compare the sorted list of numbers in "UIs" with the sorted full range of numbers from 1 to 7255, built with **seq**. The *-3* option for **comm** suppresses the lines the two lists have in common. I found 24 missing numbers:

comm -3 <(sort UIs) <(sort <(seq 1 7255)) | column

Well, that was easy, but it was just a series of numbers. What if the series wasn't so simple? How to build a complete series of other kinds of strings, to compare with the series from which items might be missing?

**Product codes.** Below is an unsorted list of serial product codes, filename "SKUs". One is missing from the series.

BRM109Z

BRM112Z

BRM99Z

BRM116Z

BRM96Z

BRM102Z

BRM117Z

BRM101Z

BRM120Z

BRM104Z

BRM98Z

BRM105Z

BRM118Z

BRM114Z

BRM119Z

BRM115Z

BRM111Z

BRM107Z

BRM108Z

BRM100Z

BRM110Z

BRM113Z

BRM97Z

To get the start and finish codes I'll use **sort** and **sed** again. A simple sort would put "BRM99Z" after "BRM100Z", so I've used **sort**'s *-V* option for sorting version numbers:

sort -V SKUs | sed -n '1p;$p'

I'll build the complete series of product codes with a BASH **for** loop and **echo**:

comm -3 <(sort SKUs) <(sort <(for i in {96..120}; do echo "BRM"$i"Z"; done))

**A hierarchical series.** I see these occasionally in my data auditing work. For example, a biologist has 6 serially numbered samples from 4 different methods at 3 different sites, or 72 samples in all. Each sample is coded *Site-Method-SampleNumber*. In "samples" (below), the sites are coded BAR, MOG and SPA; the methods are coded BRU, PIT, SIE and SWE; and one of the 72 samples is missing:

MOG-SIE-5 | BAR-SWE-2 | SPA-SWE-4 | SPA-SWE-2 |

SPA-SIE-1 | BAR-PIT-5 | SPA-PIT-5 | MOG-SWE-6 |

MOG-BRU-4 | MOG-PIT-4 | BAR-BRU-2 | BAR-SWE-6 |

BAR-BRU-6 | MOG-SWE-5 | BAR-SWE-3 | SPA-PIT-3 |

MOG-SIE-6 | MOG-SIE-2 | SPA-BRU-1 | BAR-BRU-5 |

BAR-SWE-4 | SPA-SIE-3 | MOG-PIT-5 | MOG-PIT-6 |

MOG-SIE-3 | SPA-SIE-4 | MOG-SWE-1 | BAR-SIE-5 |

SPA-PIT-1 | SPA-PIT-4 | BAR-BRU-3 | BAR-SWE-5 |

MOG-BRU-3 | BAR-PIT-3 | MOG-SIE-4 | BAR-PIT-4 |

BAR-SIE-1 | SPA-BRU-2 | MOG-SWE-4 | BAR-PIT-1 |

BAR-BRU-4 | SPA-SIE-5 | BAR-SIE-4 | BAR-PIT-2 |

MOG-PIT-1 | SPA-BRU-3 | SPA-BRU-6 | SPA-SWE-6 |

MOG-BRU-2 | MOG-PIT-2 | BAR-SWE-1 | BAR-BRU-1 |

BAR-SIE-3 | BAR-PIT-6 | SPA-SIE-6 | MOG-PIT-3 |

SPA-BRU-5 | SPA-SWE-5 | MOG-BRU-1 | MOG-SWE-3 |

SPA-SWE-1 | MOG-BRU-5 | SPA-SIE-2 | SPA-PIT-6 |

BAR-SIE-2 | SPA-BRU-4 | BAR-SIE-6 | MOG-SIE-1 |

MOG-BRU-6 | SPA-PIT-2 | SPA-SWE-3 |

To "reconstitute" this table as a simple unsorted list, copy/paste it into a text file and do

sed 's/\t/\n/g' file | awk NF > samples.

To build and check the complete series, I'll use three BASH arrays and iterate through them as nested entities, **echo**ing the output as a suitably formatted string:

site=(BAR MOG SPA); method=(BRU PIT SIE SWE); sample=(1 2 3 4 5 6)

comm -3 <(sort samples) \

<(sort <(for i in ${site[@]}; do \

for j in ${method[@]}; do \

for k in ${sample[@]}; do \

echo $i"-"$j"-"$k; done; done; done))

**Update.** Reader Markus Weihs suggests a much simpler solution to the hierarchical series problem, using BASH brace expansion. The command

echo {BAR,MOG,SPA}-{BRU,PIT,SIE,SWE}-{1..6}

builds all the possible combinations as a space-separated series:

The space-separated series could be turned into a list with either **tr " " "\n"** or **xargs -n 1**, but an even simpler method is to use **printf**:

comm -3 <(sort samples) \

<(sort <(printf "%s\n" {BAR,MOG,SPA}-{BRU,PIT,SIE,SWE}-{1..6})

Last update: 2021-02-03

The blog posts on this website are licensed under a

Creative Commons Attribution-NonCommercial 4.0 International License