Getting to TSV

Tab-separated format (TSV) is the default for Darwin Core tables built with GBIF's Integrated Publishing Toolkit (IPT). The usual filename suffix for TSV tables is .txt, although on the command line it's easier to use short, suffix-free filenames while checking, like ev for a working copy of event.txt.

Most of the functions and scripts on this website assume that the data table is a TSV. These will not work on comma-separated tables (CSV). For more on why TSV is better than CSV, see the Cookbook.


CSV to TSV

The most reliable way I know to convert a CSV to a TSV on the command line is with csvformat from the csvkit group of utilities:

csvformat -T table.csv > table.tsv

csvformat will also build a TSV from a CSV file that uses semicolons or pipes as field separators. Use the -d option to specify the separator:

csvformat -d ";" -T semicolontable.csv > semicolontable.tsv
csvformat -d "|" -T pipetable.csv > pipetable.tsv

You may get an error message with csvformat if the CSV is not UTF-8, so convert to UTF-8 first.


Spreadsheet to TSV

If you copy a rectangular array of cells in a spreadsheet and paste from the clipboard into a text editor, the data items in the cells will automatically be tab-separated. This method saves you having to choose between spreadsheet export options for field separator, and will avoid the quoting of data items.

To select a rectangular array of cells (in Microsoft Excel, LibreOffice Calc or Gnumeric), first make the top left cell active. If that cell is "A1", do this with Ctrl + Home. To complete the selection, use Ctrl + Shift + End.

If you copy and paste cells from a Windows program like Microsoft Excel, the encoding of the TSV may not be UTF-8 (it depends on the settings in your text editor), so check the encoding after copy-pasting.


Watch for tabs

CSV-to-TSV and spreadsheet-to-TSV will give you too many tab-separated fields if there are tabs already in the data table. While this problem can be detected with a structure check, you can also grep for tabs in a CSV. grep -cP "\t" will count pre-existing tabs and grep -nP "\t" will print lines with tabs and their line numbers. Tab positions can be visualised by replacing tabs with a marker, such as "{HERE}":

finding tabs

If your grep is not PCRE-enabled, you can use grep [-c/n] $'\t' for this check.


Kill the quotes

If a TSV has been built from a CSV with a text editor, it may preserve the CSV's quotes around data items and internal quotes, like this:

CSV: "Tendulkar,Sachin Ramesh",1973-04-24,"nickname ""Little Master"""
TSV: "Tendulkar,Sachin Ramesh"    1973-04-24    "nickname ""Little Master"""

These quotes are data noise. If there aren't too many they can be found with a tally and deleted in a text editor. With care they can be removed with sed replacements:

sed -E 's/^"(.*)"$/\1/;s/"+/"/g;s/\t"/\t/g;s/"\t/\t/g' file

deleting quotes