Windows carriage returns

On Linux and Mac machines, a newline is built with just one character, the UNIX linefeed '\n' ('LF'). On Windows computers, a newline is created using two characters, one after the other: '\r\n' ('CRLF'), where '\r' is called a 'carriage return' ('CR'). Carriage returns aren't necessary in a data table and can cause problems in data cleaning. (For examples of problems, see this BASHing data post.)

There are several ways to find CR characters. You can use sed -n 'l' to visualise any '\r' in a table, and grep to select out the lines with a CR and print their line numbers. Alternatively, a CR character will be shown as '^M' if you use cat -v, where the '-v' option shows non-printing characters other than tabs and linefeeds. In the example below, the file winCR has an invisible Windows carriage return at the end of the first line:

$ cat winCR
aaa   bbb
ccc   ddd
eee   fff
$ sed -n 'l' winCR
$ sed -n 'l' winCR | grep -n "\\r"
$ cat -v winCR | grep -n "\^M"
1:aaa   bbb^M$

It's wise to run these commands with grep's '-c' option first rather than '-n'. The '-c' option returns only the number of lines with a CR, and if that number is big, you avoid having large number of lines printed at high speed in your terminal. If your grep supports Perl-type regular expressions, you can count '\r' characters directly.

$ sed -n 'l' winCR | grep -c "\\r"
$ cat -v winCR | grep -c "\^M"
$ grep -cP "\r" winCR

Another command to find carriage returns is file, which will report on line endings if they're different from a single linefeed, but won't count them:

$ file example.csv
example.csv:      ASCII text, with CRLF line terminators

The easiest way to remove all Windows carriage returns from table is with tr:

$ tr -d '\r' < table > table_without_CR

Deleting all the carriage returns could be a mistake, however, if any of them are within data items. The screenshot below shows a real-world example. In the file afd1, I used sed to replace each of the 2 carriage returns in line 67893 with a single whitespace. Note that this was an 'in-place' edit with sed's '-i' option.

CR fix