banner

For a full list of BASHing data blog posts see the index page.  RSS


TSV to CSV on the CLI (if you really have to)

Regular visitors to this blog will know that I don't like the CSV format. It's awful. In my humble opinion, data workers should aim to use invisible tabs (TSV) or visible pipes (PSV) as field separators in delimited text tables. Sometimes, though, data workers are required to convert a perfectly good TSV or PSV to a CSV. What to do?

I don't recommend opening the TSV or PSV in spreadsheet software and saving the result as a CSV, unless there are no leading or trailing quotes in the data items, or umatched quotes generally. The original quotes might well disappear in the saved CSV.

There are a number of TSV-to-CSV programs for the command line. One is in Haskell, for example, and there also routines to do the job in Perl and Python. But if the individual fields in the TSV don't contain commas or quotes, the TSV-to-CSV conversion is simple — use tr:

tsv2csv1

AWK can also do the job. Set the input field separator to a tab, the output field separator to a comma, rebuild each record with (for example) $1=$1, and print all lines (1):

tsv2csv2

With TSVs containing commas or quotes, the conversion is no longer simple. Here are the rules from the more-or-less-widely-accepted specification for the CSV format, RFC4180:

AWK is again our friend, since it can modify individual fields. In pseudo-code, a TSV-to-CSV command would look like this, where AWK loops through the fields in each line one by one:

awk -v FS="\t" -v FS="," '{for (i=1;i<=NF;i++) \
if (conditions in field "i") {modify field "i"} \
1' file.tsv > file.csv

For that non-problematic example, the code works like this:

tsv2csv3

Following the RFC4180 rules, the code could be:

awk -v FS="\t" -v OFS="," \
'{for (i=1;i<=NF;i++) \
{x=gensub(/"/,"\"\"","g",$i); \
if (x ~ /"/ || x ~ /,/) $i="\""x"\""; \
else $i=$i}} 1' file.tsv > file.csv

The first thing AWK does with each field is precede any quote in the field with an escape quote, storing the result in the variable "x". If there are no quotes in the field, gensub returns the field with no changes:
x=gensub(/"/,"\"\"","g",$i)
 
Next, AWK checks to see whether "x" contains a quote or a comma. If it does, the field is bracketed with quotes:
if (x ~ /"/ || x ~ /,/) $i="\""x"\""
 
If "x" doesn't contain a quote or a comma, the field is left unchanged:
else $i=$i

To test this command I'll use the TSV "test", which has 38 lines of the form

aaa[TAB][field 2][TAB]ccc

Field 2 contains the following data items, here laid out as a PSV:

bbbb|bb,bb|bbbb,|,bbbb|bb,bb,|
,bb,bb|b"bb"b|b"b,b"b|b"bb"b,|
,b"bb"b|b"b,b"b,|,b"b,b"b|"b"bb"b"|"b"b,b"b"|
"b"bb"b,"|",b"bb"b"|"b"b,b"b,"|
",b"b,b"b"|"bbbb"|"bb"bb|,"bbbb"|
,"bb"bb|b""bb""b|b""b,b""b|b""bb""b,|
,b""bb""b|b""b,b""b,|,b""b,b""b|
""b""bb""b""|""b""b,b""b""|""b""bb""b,""|
"",b""bb""b""|""b""b,b""b,""|"",b""b,b""b""|
""bbbb""|""bb""bb|,""bbbb""|,""bb""bb

And here's the result of the conversion of "test" to a CSV, arranged so you can compare the original and the converted lines:

tsv2csv4

To make that clearer I'll colour the escape quotes red and the bracketing quotes blue:

tsv2csv5

I can save the AWK command in the shell function "tsv2csv" for ease of use:

tsv2scv() { awk -v FS="\t" -v OFS="," '{for (i=1;i<=NF;i++) {x=gensub(/"/,"\"\"","g",$i); if (x ~ /"/ || x ~ /,/) $i="\""x"\""; else $i=$i}} 1' "$1"; }

All bets are off if any of the TSV fields contains an unmatched quote, i.e. if the number of quotes in the field is odd. That needs to be checked for and fixed before using the function. A simple checker for unmatched quotes is:

awk -F"\t" '{for (i=1;i<=NF;i++) \
{x=$i; if (gsub("\"","",$i)%2) \
print "line "NR", field "i": "x}}' file

tsv2csv6

But really, are you absolutely certain that you need to convert that TSV to a CSV?


Last update: 2021-10-13
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License