For a full list of BASHing data blog posts see the index page.     RSS

A bulk replacement GUI with YAD

I sometimes need to tidy up data tables containing pseudo-duplicate data items. The example below is from a real-world dataset and is part of a tally of a certain field. The tally function ignores the header and generates a sorted list of data items and their frequencies.

1   P. Fernández, I. Porras & J.A. varela
1   P. Fernández , I. Porras & J.A. Varela
1   P. Fernández & I. Porras & J.A. Varela
1   P. Fernández I. Porras & J.A. Varela
1   P. Fernández, I. Porras & J. A. Varela
7   P. Fernández, I. Porras & J.A Varela
923   P. Fernández, I. Porras & J.A. Varela
2   P. Fernandez, I. Porras & J. Varela
2   P. Férnandez, I. Porras & J. Varela
2   P. Fernández, I. Porras & J.Varela
35   P. Fernández, I. Porras & J. Varela
1   P.Fernández, I. Porras & J. Varela
6   P. Fernández, I. Porras y J.A. Varela

Tidying-up (or "normalising") means that I pick one of the variants as the one to use, or modify it, and with it replace all instances of the variants:

Choose "P. Fernández, I. Porras & J.A. Varela", replace others to get
983   P. Fernández, I. Porras & J.A. Varela

Doing this work on the command line, I found myself making tedium-caused errors, so I wrote a shell script (below) to do the job more visibly in a GUI. I'll demonstrate how the script works using the simple tab-separated file "table":


Doing a tally on field 1 gives this list:


Suppose I want to correct and normalise the Old Storys Creek Road entries. I enter brgy table in the terminal, which is the script's name ("brgy") and the table's name as argument. A YAD window opens on the right side of my desktop:


Using highlight/middle-click-paste, I copy the variants and their frequencies from the tally output in the terminal to the top entry box in the YAD dialog ("Items to be replaced"). I then write a new replacement text in the middle entry box ("Replace with") as I've done here, or highlight/middle-click-paste a replacement text from the top entry box to the middle one, and enter the field number in the bottom entry box:


When I click on the "Replace" button, all the top-box entries are replaced in the table, and a new, time-stamped file is created which backs up the original table. The frequencies in the top entry box are ignored.


I can add other entries to the top entry box from elsewhere in the tally output, because YAD "form" entries are editable. I can also modify a pasted-in replacement text in the middle entry box before hitting the "Replace" button.

After a replacement, the YAD dialog disappears and reappears, blank and ready for more replacements in the selected field. To do replacements in a different field, I quit brgy, do a tally on the other field so I have copy-able text, then re-enter brgy filename.

This GUI method works well in my tidying-up and the progressive backups (time-stamped pre-replacement files) are good insurance. Tidying up is still a tedious job, but that's unavoidable.

The brgy script:

while true; do
choice=$(yad --geometry=400x600+1450+100 --title="" --align=center \
--button="Quit":1 --button="Replace":0 \
--form \
--field="Items to be replaced:":TXT \
--field="Replace with:" \
--field="In field number..." \
"" "" "")
case $? in
1) exit 0;;
0) cp "$1" "$1".$(date +"%Y-%m-%d_%T") && \
awk -v REPL="$(echo "$choice" | cut -d"|" -f2)" \
-v FLD="$(echo "$choice" | cut -d"|" -f3)" \
'BEGIN {FS=OFS="\t"} FNR==NR {a[$0]; next} $FLD in a {$FLD=REPL} 1' \
<(echo -e "$choice" | cut -d"|" -f1 | cut -f2) "$1" > temp && \
mv temp "$1" && \
252) exit 0;;
exit 0

If the YAD dialog gets a "0" return from a click of the "Replace" button, the first thing that happens is that cp copies the table to a duplicate with a time-stamp in "YYYY-MM-DD_HH:MM:SS" format as a filename suffix.
In the example above, the YAD dialog generates as the "choice" variable the string
2\tOld Storeys Creek Road\n2\tOld Storys Creek Rd|Old Storys Creek Road|1|
where "|" separates the form field contents. This string is cut to feed AWK with the replacement string (field 2) and field number (field 3) as shell variables.
AWK is told in a BEGIN statement that input and output lines contain tab-separated fields, because the files I work with are always TSVs.
The variants in "choice" are cut out from a list built with echo -e, to ensure that "\n" is read as a newline and "\t" as a tab, and the tab-separated frequencies in field 1 are ignored (with cut -f2). The edited list is then redirected to AWK, which builds an array "a" containing the variants.
AWK next moves to the file being processed and does replacements in the selected field whenever it finds one of the array strings as field contents in the tallied field ($FLD in a {$FLD=REPL}). AWK also prints all lines in the file (with "1").
The edited lines are sent to the new file "temp", which is renamed with mv as the filename of the file being processed.
The whole process is repeated from the first appearance of the YAD dialog by putting it in a while true loop with a continue statement as the last command in the replacement command sequence.
The odd-looking "252" for exiting in the case statement means The dialog has been closed by pressing Esc or used the window functions to close the dialog (from the YAD man page).
And yes, "brgy" is a near-acronym for "bulk replacement GUI with YAD"!

Last update: 2020-01-13
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License