If you've read Mark Danielewski's House of Leaves, then you are no doubt familiar with the madwoman in its attic, Pelafina Lièvre. Her letters to her son Johnny add one more voice to this already polyvocal text, and though her words reside in Appendix IIE, her melancholy and madness encompass a space much broader than the backmatter to which it is relegated. The echoes and redirections of her voice wind through every other page -- so much so that even after reading and re-reading, I still find Pelafina's letters opening new doors between passages once familiar in isolation, now made strange through the lurking affinity of a phrase or two.
Twitter is also a domain rich in polyphonal textuality, and if you've spent much time there at all, you may have encountered an account or two that you knew or suspected to be fake. Some of these are parodies, some are created in order to boost a paying client's follower count. Others try for things that are little more weird, like the famous @horse_ebooks, for example, who pushes its characteristic blend of link spam and Dadaist art to its hundreds of thousands of readers multiple times a day.
I made a fake Twitter account for Pelafina, @pelafina_lievre, and I'm writing this blog entry to explain a bit of how I did it and why.
I've been interested in Twitter bots for their literary and poetic value, and recently I've been motivated and educated by a Google+ community of other Bot enthusiasts. Darius Kazemi (@tinysubversions), the master of bots and other weird Internst stuff, puts all of his code on GitHub, and I've learned a good deal from it. Mark Sample (@samplereality) demonstrates how much mileage one can achieve through a clever combination, as in @WhitmanFML, and the sublime lexicoludics of Adam Parrish's @PowerVocabTweet are simply brilliant.
Now, before working with Pelafina, I'd made a few simple bots of my own. When my classes blog (which is almost always), it's often useful to connect the blogs to our Twitter conversation using an RSS-based auto-tweeter. If This Then That makes this very simple to set up. In a similar vein, my @digital_cfps is a utility bot that finds and tweets CFP announcements with keywords that target digital media studies. For this, I use Yahoo Pipes for some simple RSS filtering, and send the RSS output of that filtered feed to IFTTT for tweeting. Simple stuff, but I find it useful for discovering opportunities I might have otherwise missed:
CFP: Call for Chapter Proposals, Book Project on Video Games “Rated M for Mature: Sex and Sexuality in Video Ga... http://t.co/Jeeusw8aC1
— Ones and Zeroes (@digital_cfps) June 5, 2013
My first take on a creative bot was @BUBLBOBL_EBOOKS, and as its surname and all-caps stylization indicate, I'm optimistic that much of the output of this bot classifies it among that dubious "weird twitter" genre. I think it often succeeds:
HIGHER POINTS ARE SCORED SUPER DRUNK ``insert`— BUBBLE BOBBLE EBOOKS (@bublbobl_ebooks) June 18, 2013
In this case, I started with the ROM for the arcade game, Buble Bobble, and scraped it for sequences that looked like text strings: any series of three or more bytes in a row with values less than 127 (that is, within the standard ASCII range). I put those all into a database, and then I have a Perl script that crams these strings together randomly. Since fewer than half of the stored strings are actually words, there's plenty of typographic noise:
DEDE"" #NAME? BUBBY {ejej PUSH _Ooo hxU@ ZxZx 0uvv DWw— BUBBLE BOBBLE EBOOKS (@bublbobl_ebooks) June 14, 2013
My tweet-composition works like this: select 10 strings from the database, start building a tweet from the longest string in that group, choose a random target length less than 140, and then pad the rest with however many more strings from that 10 it can fit. I'm using Perl for this since I'm already familiar with its text manipulation affordances and, in what is certainly overkill for this application, I store the strings in a MySQL database. This database also updates when a string has been used so that, until the whole thing periodically resets, each new group of 10 will only include unused strings. The Twitter posting is all handled by the Net::Twitter package, which has a very simple interface. This all lives on a web server where I trigger it via a cron job I set up in cPanel.
So I definitely like @BUBLBOBL_EBOOKS for what it is, but it's obviously limited. It only has about 400 text strings to work with, and it runs through them all about every two weeks. More importantly, textual silliness is relatively easy to come by. I wanted to try something different.
Markov chains are a common method for generating plausible but silly-sounding strings of text. The concept sounds complicated when explained, but, as I discovered, it's actually rather elegantly simple once you get down into it. Basically, you start with a sample of text (or numbers or whatever) and analyze it for statistical pairs. These pairs might form a database, say, which could answer a question like, "Given an occurrence of Word A, what is the statistical likelihood that it is followed by Word B?" With that knowledge, one can easily generate a new string of text that demonstrates that same statistical likelihoods.
Open your words.— Pelafina Lièvre (@pelafina_lievre) June 29, 2013
For large sets, I imagine it's more efficient to actually compute and store percentages, but for my purposes, it was easiest to create an associative array (a hash reference, since I'm working in Perl) that stored an a list of every word in my source text, associated to a list of every word that immediately follows that word in the source text. Here's what part of that hash looks like when I use Pelafina's letters as my source:
sole => conflict malicious => in serve => you you that the mighty => heart. what => an interest you but you he I I they
To generate a chain, I choose one word, then look up that word's array of next words and randomly choose a word from that list. That new word now becomes the input and I look up its list of next-words. (Note how since each list of next-words can contain the same word multiple times, choosing one at random still gives me a probability-based selection, without the extra work of having to store those probablities somewhere.)
The only other tuning I do to construct a tweet is to make sure I start on what looks like a sentence-starting word, and I try if possible to end on a terminal punctuation. That's all the intervention it takes to end up creating nearly grammatically complete sentences -- even longish ones -- much of the time:
Quite rightly, he went inside and I muster under looming fame and murmured over your letters.— Pelafina Lièvre (@pelafina_lievre) July 7, 2013
Originally -- because of my misunderstanding of Markov chains -- I built my script so that the database hash stored each word followed by the next few words, like in the 4-word hash snippet below:
sole => conflict was with gravity. malicious => in his manner now, serve => you hot chocolate and you well, and as that purpose. A happy the young cub. Happy mighty => heart. Even Marine Man what => an impression! Of course interest you receive in you must, but realize but the doing alone you saw in me. he assumed. It is I would and would I write will only they pejoratively call medicine.
I can create text from this database in the same way, just using the last letter of my 4-word "stem" as the new input word for finding the next branch to attach. This does produce some interesting constructions, but as (someone) pointed out to me, this is linking n-grams, and not words, so it isn't truly a Markov implementation. I was trying for n-order Markovs, which actually work by adding words in the opposite direction I'd thought. So, for example, a second order chain would be generated by asking "Given Word A followed by Word B, what is the probability that Word C comes next?" Here's part of my database hash for that method:
unleashing arrows => like you ever => come accept understand forgive but your => lookings mother some instruction => in make it => a via a => night
It was pretty easy to adapt my code to work either way, but what matters most is the text output. My goal is to create something that speaks with Pelafina's voice but that uses new combinations of words to generate new poetic images or insights. And when her coherence fails -- as it inevitably will since her source text includes long sequences of acrostics and whatever is going on September 19, 1988 -- perhaps my Pelafina persona could even elicit sympathy. So it becomes a question of tuning the algorithm to produce the right balance of coherence and chaos while still leaving (her own words) "great glaciers of clarity". Below are some examples of the output produced by different settings. For each list, the first number indicates the length of string used as input
So which of these is the most like Pelafina in the book? Which sounds the most original? Once you go over 3 words in either direction, I've found, the output chains are going to be mostly direct quotes from the source text, and while that ensures a proper tone, it doesn't create anything new. In the end, I chose the simple first-order Markov ("1,1" in the list above), and I created a bot that updates Twitter much like BUBLBOBL. The main difference is that Pelafina doesn't need to keep a database of strings -- she just uses a .txt file OCR'd from House of Leaves.
With Pelafina, I'd like to say I've created something poetic here, but in the end, I'm not sure if that's the case. I did, however, make several choices that all limited and shaped Pelafina's output, so whether or not she has her own voice is a result of things I did. The algorithm tuning was a big one, but there were others:
Well, whether it's poetry or not, I learned a lot in the process of making these bots, and I hope that my documenting that process here is beneficial or at least interesting. @BUBLBOBL_EBOOKS isn't very noteworthy at a code level, but I've put my Pelafina code in GitHub in case anyone would like to take a closer look. It's probably not very elegant code, but I've tried to comment thoroughly.
For me the takeaway is that Markov chaining for text generation is actually pretty simple. After a while, the output starts to sound about the same. I have a few other bots in mind that may use similar algorithms, but I've noticed that many of the Twitter bots I like are actualy starting with templates and filling in its parts Mad-Libs-style. There's merit to that approach in that the programmer gets more control (through carefully choosing word lists) of the tone of the output as well as (of course) the structure. I'd like to try out that approach as well.
Word Count: 2366