I have always had a problem with the concept of intellectual property. The great western tradition of post-enlightenment values have always placed the free flow of art and ideas on a pedestal, as a sacrosanct cornerstone of a just society. That the ideas living in our heads and flowing from our lips were the domain of no king, pope, or policeman is the one of the most important cultural norms that has emerged from the enlightenment into modern liberal democracies. The legal constructs associated with intellectual property, in my evaluation, cannot be reconciled with this. A corpuscle of information cannot be at once free to be spoken or expressed and also be the property of some individual and corporation. Information Theory, the fantastic work pioneered by Claude Shannon, only swells my distaste for intellectual property. We know now that with simple coding, all information is reducible to a common binary form. Film, print, music, photography: all is merely a collection of ordered bits. Which makes the idea of owning information all the more ridiculous, as the process can be just as easily reversed: A song can be represented by a string of Shakespeare quotations, a movie can be rendered in musical score. As an illustration of this, I’ve written a short program that takes any file and converts it to a long, rambling nonsense-poem. Poetry as Piracy.
Making the Wordlists
The first step is generating a set of words to use to generate our poems, categorized by their grammatical type. To do this, I downloaded the English wiktionary. I then used grep, sed, and awk to split it into plain lists of words: nouns, past tense verbs, present participle verbs, and adjectives. I then shuffled these lists, and trimmed them down so that their length was a multiple of 2. I didn’t need to do this, but it simplified the work slightly. In the end, I was left with 17 bits worth of information stored in each noun (131,072 words), 13 bits in each past-tense verb (8192), 13 bits in each present-participle verb (8192), and 15 bits for each adjective.
Sentence Skeletons
I then decided on two rough sentence skeletons:
The ADJECTIVE NOUN PAST-VERBED the ADJECTIVE NOUN.
ADJECTIVE NOUN is PRESENT-VERBING the ADJECTIVE NOUN.
Each of those sentences can store 77 bits of information. A 1Mb file, for example, will require roughly 10,000 sentences, or about a novelette worth of words. If that 1 Mb file was a copyrighted song, you would not in fact have the freedom to print and distribute your nice new novel (not that you would want to, it would be random nonsense.)
Encoding the File
Now, 77 bits is a bit awkward. Just choosing between each sentence type gives me 1 bit of information. I also get punctuation at the end. If I end each sentence with either a period, exclamation mark, two exclamation marks, or three exclamation marks, that gets me an extra two bits of information. This gets me up to 80 bits per sentence, or 10 bytes. I can now easily encode my data as nonsense poetry! I use the first bit to select which tense of verb, the second two decide if I get a period or exclamation series, and the rest determine the sentence itself. If my file isn’t nicely divisible into base 10, I simply add an additional line at the end:
All that remains are NUM memories and NUM regrets.
Where NUM is the base-10 representation of the remaining bytes in the first case, and the number of bytes remaining in the second instance (as a long string of leading zeros will get truncated in converting to decimal).
Decoding the File
Decoding the file is as simple as just reading in each line, checking what sentence type it is, and what the punctuation at the end is, and returning it to the original binary form!