Twitter bots, Markov chains and "large slices of clarity"

16th Jul 2013

If you've read Mark Danielewski's House of Leaves, then you are no doubt familiar with the madwoman in its attic, Pelafina Lièvre. Her letters to her son Johnny add one more voice to this already polyvocal text, and though her words reside in Appendix IIE, her melancholy and madness encompass a space much broader than the backmatter to which it is relegated. The echoes and redirections of her voice wind through every other page -- so much so that even after reading and re-reading, I still find Pelafina's letters opening new doors between passages once familiar in isolation, now made strange through the lurking affinity of a phrase or two.

Twitter is also a domain rich in polyphonal textuality, and if you've spent much time there at all, you may have encountered an account or two that you knew or suspected to be fake. Some of these are parodies, some are created in order to boost a paying client's follower count. Others try for things that are little more weird, like the famous @horse_ebooks, for example, who pushes its characteristic blend of link spam and Dadaist art to its hundreds of thousands of readers multiple times a day.

I made a fake Twitter account for Pelafina, @pelafina_lievre, and I'm writing this blog entry to explain a bit of how I did it and why.

I've been interested in Twitter bots for their literary and poetic value, and recently I've been motivated and educated by a Google+ community of other Bot enthusiasts. Darius Kazemi (@tinysubversions), the master of bots and other weird Internst stuff, puts all of his code on GitHub, and I've learned a good deal from it. Mark Sample (@samplereality) demonstrates how much mileage one can achieve through a clever combination, as in @WhitmanFML, and the sublime lexicoludics of Adam Parrish's @PowerVocabTweet are simply brilliant.

Now, before working with Pelafina, I'd made a few simple bots of my own. When my classes blog (which is almost always), it's often useful to connect the blogs to our Twitter conversation using an RSS-based auto-tweeter. If This Then That makes this very simple to set up. In a similar vein, my @digital_cfps is a utility bot that finds and tweets CFP announcements with keywords that target digital media studies. For this, I use Yahoo Pipes for some simple RSS filtering, and send the RSS output of that filtered feed to IFTTT for tweeting. Simple stuff, but I find it useful for discovering opportunities I might have otherwise missed:

CFP: Call for Chapter Proposals, Book Project on Video Games “Rated M for Mature: Sex and Sexuality in Video Ga... http://t.co/Jeeusw8aC1
— Ones and Zeroes (@digital_cfps) June 5, 2013

My first take on a creative bot was @BUBLBOBL_EBOOKS, and as its surname and all-caps stylization indicate, I'm optimistic that much of the output of this bot classifies it among that dubious "weird twitter" genre. I think it often succeeds:

HIGHER POINTS ARE SCORED SUPER DRUNK ``insert`
— BUBBLE BOBBLE EBOOKS (@bublbobl_ebooks) June 18, 2013

In this case, I started with the ROM for the arcade game, Buble Bobble, and scraped it for sequences that looked like text strings: any series of three or more bytes in a row with values less than 127 (that is, within the standard ASCII range). I put those all into a database, and then I have a Perl script that crams these strings together randomly. Since fewer than half of the stored strings are actually words, there's plenty of typographic noise:

DEDE"" #NAME? BUBBY {ejej PUSH _Ooo hxU@ ZxZx 0uvv DWw
— BUBBLE BOBBLE EBOOKS (@bublbobl_ebooks) June 14, 2013

My tweet-composition works like this: select 10 strings from the database, start building a tweet from the longest string in that group, choose a random target length less than 140, and then pad the rest with however many more strings from that 10 it can fit. I'm using Perl for this since I'm already familiar with its text manipulation affordances and, in what is certainly overkill for this application, I store the strings in a MySQL database. This database also updates when a string has been used so that, until the whole thing periodically resets, each new group of 10 will only include unused strings. The Twitter posting is all handled by the Net::Twitter package, which has a very simple interface. This all lives on a web server where I trigger it via a cron job I set up in cPanel.

So I definitely like @BUBLBOBL_EBOOKS for what it is, but it's obviously limited. It only has about 400 text strings to work with, and it runs through them all about every two weeks. More importantly, textual silliness is relatively easy to come by. I wanted to try something different.

Markov chains are a common method for generating plausible but silly-sounding strings of text. The concept sounds complicated when explained, but, as I discovered, it's actually rather elegantly simple once you get down into it. Basically, you start with a sample of text (or numbers or whatever) and analyze it for statistical pairs. These pairs might form a database, say, which could answer a question like, "Given an occurrence of Word A, what is the statistical likelihood that it is followed by Word B?" With that knowledge, one can easily generate a new string of text that demonstrates that same statistical likelihoods.

Open your words.
— Pelafina Lièvre (@pelafina_lievre) June 29, 2013

For large sets, I imagine it's more efficient to actually compute and store percentages, but for my purposes, it was easiest to create an associative array (a hash reference, since I'm working in Perl) that stored an a list of every word in my source text, associated to a list of every word that immediately follows that word in the source text. Here's what part of that hash looks like when I use Pelafina's letters as my source:

sole    => 
     conflict
malicious   => 
     in
serve   => 
     you
     you
     that
     the
mighty  => 
     heart.
what    => 
     an
     interest
     you
     but
     you
     he
     I
     I
     they

To generate a chain, I choose one word, then look up that word's array of next words and randomly choose a word from that list. That new word now becomes the input and I look up its list of next-words. (Note how since each list of next-words can contain the same word multiple times, choosing one at random still gives me a probability-based selection, without the extra work of having to store those probablities somewhere.)

The only other tuning I do to construct a tweet is to make sure I start on what looks like a sentence-starting word, and I try if possible to end on a terminal punctuation. That's all the intervention it takes to end up creating nearly grammatically complete sentences -- even longish ones -- much of the time:

Quite rightly, he went inside and I muster under looming fame and murmured over your letters.
— Pelafina Lièvre (@pelafina_lievre) July 7, 2013

Originally -- because of my misunderstanding of Markov chains -- I built my script so that the database hash stored each word followed by the next few words, like in the 4-word hash snippet below:

sole    => 
     conflict was with gravity. 
malicious   => 
     in his manner now, 
serve   => 
     you hot chocolate and 
     you well, and as 
     that purpose. A happy 
     the young cub. Happy 
mighty  => 
     heart. Even Marine Man 
what    => 
     an impression! Of course 
     interest you receive in 
     you must, but realize 
     but the doing alone 
     you saw in me. 
     he assumed. It is 
     I would and would 
     I write will only 
     they pejoratively call medicine.

I can create text from this database in the same way, just using the last letter of my 4-word "stem" as the new input word for finding the next branch to attach. This does produce some interesting constructions, but as (someone) pointed out to me, this is linking n-grams, and not words, so it isn't truly a Markov implementation. I was trying for n-order Markovs, which actually work by adding words in the opposite direction I'd thought. So, for example, a second order chain would be generated by asking "Given Word A followed by Word B, what is the probability that Word C comes next?" Here's part of my database hash for that method:

unleashing arrows   => 
     like 
you ever    => 
     come 
     accept 
     understand 
     forgive 
but your    => 
     lookings 
     mother 
some instruction    => 
     in 
make it     => 
     a 
via a   => 
     night

It was pretty easy to adapt my code to work either way, but what matters most is the text output. My goal is to create something that speaks with Pelafina's voice but that uses new combinations of words to generate new poetic images or insights. And when her coherence fails -- as it inevitably will since her source text includes long sequences of acrostics and whatever is going on September 19, 1988 -- perhaps my Pelafina persona could even elicit sympathy. So it becomes a question of tuning the algorithm to produce the right balance of coherence and chaos while still leaving (her own words) "great glaciers of clarity". Below are some examples of the output produced by different settings. For each list, the first number indicates the length of string used as input

1,1

Quite rightly, he knows I forgive you a brighter note, I write you happy now?
It'll be brave insights. Mead Halls rejoice. All of it.
Sisyphian task. People here think your ankles, threatening to keep us all.

2,1

I keep forgetting you are still with the wind under your belt. Five reams of paper and pencil.
He could semper fi that meal all the letters I wrote you. The first thing they did not spoil your feelings for me.
Donnie had instances like that. At least that much is obvious. The shape that gave you shape?

3,1

You've listened to tyrants and lost faith in your instincts. I am with you. Director claims I outdid Lear.
I hardly wrote a thing. I remember when your father would take me flying. It will become my heart.
As you're well aware, he loved more than anything to fly.

2,2

Olympus. Like Donnie, you too were born with substantial faculties.
Did he really break your nose? Am I being silly?
Every afternoon too. P.s. Time will grant you a place.

1,3

Leon failed? Never could write you like a grown man.
Apparently your new family thinks of you hot chocolate and plans. With too am suspended in just one week!
You've listened to tyrants and lost faith in your instincts.

1,4

Please forgive me please. Please. Please.Do not forget your father stopped me and took me to The Whalestoe.
I forgive you for failing to fall from the vine.
I forgive you for failing to fall from the vine. Hearing it makes my ears bleed.

So which of these is the most like Pelafina in the book? Which sounds the most original? Once you go over 3 words in either direction, I've found, the output chains are going to be mostly direct quotes from the source text, and while that ensures a proper tone, it doesn't create anything new. In the end, I chose the simple first-order Markov ("1,1" in the list above), and I created a bot that updates Twitter much like BUBLBOBL. The main difference is that Pelafina doesn't need to keep a database of strings -- she just uses a .txt file OCR'd from House of Leaves.

With Pelafina, I'd like to say I've created something poetic here, but in the end, I'm not sure if that's the case. I did, however, make several choices that all limited and shaped Pelafina's output, so whether or not she has her own voice is a result of things I did. The algorithm tuning was a big one, but there were others:

I used only the letters from House of Leaves itself, not those in the separately-published Whalestoe Letters. This was just the OCR I had on hand. (By the way, thanks, Mark!)
I left in all of those letters, including ones from May 8, 1987 (which is written as an acrostic) and September 19, 1988 (which is also meaningless at the standard syntactic level). Keeping all this in was a risk, in a way, since my algorithm is supposed to learn from word order and these two letters (and others) lack meaningful word order, but keeping them in add an interesting bit of chaos to the output. I find I can still tell when she dips into these letters, though, since they natrually include some wholly unique chains.
I trimmed out the salutations and closings in each letter. These just seemed repetitively formal in ways that wouldn't necessarily flow well into a tweet.
I privileged sentences. Again, my goal was to make Pelafina's output at least somewhat plausible for the context on Twitter, so I wanted her to have some basic, succinct coherence. So I use some regex to give her a pretty good chance of ending with a period. I'm afraid, however, that this is what leads to her occasionally repeating herself, so I may change my mind about this later.
I didn't completely correct OCR errors. Doing so would improve the output, probably.

Well, whether it's poetry or not, I learned a lot in the process of making these bots, and I hope that my documenting that process here is beneficial or at least interesting. @BUBLBOBL_EBOOKS isn't very noteworthy at a code level, but I've put my Pelafina code in GitHub in case anyone would like to take a closer look. It's probably not very elegant code, but I've tried to comment thoroughly.

For me the takeaway is that Markov chaining for text generation is actually pretty simple. After a while, the output starts to sound about the same. I have a few other bots in mind that may use similar algorithms, but I've noticed that many of the Twitter bots I like are actualy starting with templates and filling in its parts Mad-Libs-style. There's merit to that approach in that the programmer gets more control (through carefully choosing word lists) of the tone of the output as well as (of course) the structure. I'd like to try out that approach as well.

Word Count: 2366

Previous Post Next Post