banner

For a full list of BASHing data blog posts see the index page.  RSS


A dog-cat-horse-turtle problem

Sometimes the text-processing problems posted on Stack Exchange have so many solutions, it's hard to decide which is best.

A problem like that was posted in the "Unix & Linux" section in December 2021:

I have this file: 'dog', 'cat', 'horse', 'turtle'
 
I want to convert the line to:
 
dog
cat
horse
turtle

As of Christmas Eve 2021, there were suggested solutions based on AWK, grep, sed, csvformat (from the csvkit package) and Python. Below are some goodies.


grep. It doesn't get much simpler than asking grep to find strings of letters and to Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line (quote from man page).

grep -o "[[:alpha:]]*" dcht

dcht1


AWK. This elegant solution from Stack Exchange contributor "rowboat" needs some explaining. By setting the input record separator RS to an apostrophe, the first record will be the empty string before the leading apostrophe, the second will be "dog", the third ", ", the fourth "cat", the fifth ", " and so on. AWK checks the record number and selects the numbers that don't have a remainder after division by 2 (!(NR%2)). By default AWK prints those even-numbered records, and with the default output record separator, which is a newline.

awk '!(NR%2)' RS=\' dcht

dcht2

sed. Here I've modified the solution offered by "schrodigerscatcuriosity". First convert all the word separators (', ') to newlines, then delete the leading and trailing apostrophes. Note the use of quotes (") around the sed command, which allows me to have apostrophes in the regular expressions.

sed "s/', '/\n/g;s/'//g" dcht

dcht3

Other shell tools. The OP first tried tr, but couldn't get it to work. tr could be used to replace any punctuation or spaces with newlines, then to squeeze the newlines to single occurrences, like this:

tr -s "[[:punct:]][[:space:]]" "\n" < dcht

dcht4

This creates a blank line where the first apostrophe is replaced by a newline; the blank line could be removed with AWK NF.

Below I first delete the commas with tr, then pass the result to xargs to print each argument on a new line; the apostrophes are ignored.

tr -d "," < dcht | xargs -n1

dcht5

Other suggestions for short, simple solutions are welcome!


Last update: 2022-01-19
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License