banner

For a full list of BASHing data blog posts see the index page.  RSS


How to watermark a UTF-8 plain text file

Watermarking plain text isn't easy. Plain text files don't have headers (or magic numbers), and although you can insert invisible control characters, those characters may get revealed by text editors and word processors.

For example, suppose I put the ASCII control character "CAN" ("cancel") in a text string. It's invisible in a terminal:

wmark1

but might appear in one form or another in text applications:

wmark2

Top to bottom: Geany text editor, FeatherPad text editor, LibreOffice Writer

What's more, the file command will report that a text file with an ASCII control character is "data", not "text":

wmark3

One solution is to do the watermarking with the Unicode left-to-right mark, U+200E, LTRM. It's invisible in a terminal and doesn't upset file:

wmark4

and it's also invisible in text applications:

wmark5

I picked this character because its appearance at the beginning of a line isn't deeply suspicious. It's possible that the text file might contain right-to-left strings in future, so the LTRM just marks the start of text strings that need to be read from left to right. (If right-to-left text is added later, it would begin with a right-to-left mark (Unicode U+200F).)

A simple way to use the LTRM is with a special pattern at the beginnings of lines. The following text file ("file D") is the opening stanza of the poem The Dream of Houses by the Assyrian-Iraqi poet Sargon Boulus (1944-2007). (Translation by Sinan Antoon.)

There is a street somewhere
lined with houses
Washed by the whiteness of memory
one ceiling after another
I move about inside them
Storming like a night
Fashioning stairs out of my words
Voices too faint to be heard by anyone

I can add a LTRM at the beginning of lines 1, 3, 4 and 6 with sed to build the invisibly watermarked "fileE":

sed $'1s/^/\u200e/;3,4s/^/\u200e/;6s/^/\u200e/' fileD > fileE

wmark6

Finally, a bit of code to reveal my watermark, "1-3-4-6":

awk $'/^\u200e/ {print NR}' fileE | paste -s -d"-"

wmark7

I'm sure there are better ways to do the watermarking job, but this one is fairly inconspicuous. LTRMs look innocent at the beginning of lines, especially line 1. Unfortunately it's quite easy to remove LTRMs and with them the watermarking signal, leaving the text apparently unchanged!

Note the "$" in front of the sed and AWK commands. This allows BASH to interpret the Unicode string "\unnnn" before sed and AWK see it. For more, see this BASHing data post.


Last update: 2021-11-24
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License