Windows carriage returns
On Linux and Mac machines, a newline is built with just one character, the UNIX linefeed '\n' ('LF'). On Windows computers, a newline is created using two characters, one after the other: '\r\n' ('CRLF'), where '\r' is called a 'carriage return' ('CR'). Carriage returns aren't necessary in a data table and can cause problems in data cleaning.
There are several ways to find CR characters. You can use sed -n 'l' to visualise any '\r' in a table, and grep to select out the lines with a CR and print their line numbers. Alternatively, a CR character will be shown as '^M' if you use cat -v, where the '-v' option shows non-printing characters other than tabs and linefeeds. In the example below, the file winCR has an invisible Windows carriage return at the end of the first line:
$ cat winCR
$ sed -n 'l' winCR
$ sed -n 'l' winCR | grep -n "\\r"
$ cat -v winCR | grep -n "\^M"
It's wise to run these commands with grep's '-c' option first rather than '-n'. The '-c' option returns only the number of lines with a CR, and if that number is big, you avoid having large number of lines printed at high speed in your terminal. If your grep supports Perl-type regular expressions, you can count '\r' characters directly.
$ sed -n 'l' winCR | grep -c "\\r"
$ cat -v winCR | grep -c "\^M"
$ grep -cP "\r" winCR
You can strip away everything except the line numbers from the grep -n result with a cut command, by specifying a colon as field delimiter for cut:
$ sed -n 'l' winCR | grep -n "\\r" | cut -d ':' -f1 > list_of_records_with_CR
The easiest way to remove all Windows carriage returns from table is with tr:
$ tr -d '\r' < table > table_without_CR
Deleting all the carriage returns could be a mistake, however, if any of them are within data items. The screenshot below shows a real-world example. In the file afd1, I used sed to replace each of the 2 carriage returns in line 67893 with a single whitespace. Note that this was an 'in-place' edit with sed's '-i' option.