Thursday, 25 April 2013

Cheating (sort-of) at Draw Something

A year or so ago Draw Something was all the rage. For those who don't remember, it's basically pictionary - each turn you draw a word and your opponent/partner has to guess it from a  given set of letters, then it alternates. Usually its lots of fun, but occasionally, you get something ridiculous and unintelligible.

What?
Unwilling to accept defeat and too stubborn to ask for help, I set out to solve the problem automatically. On the assumption that bad draw something images contain basically no information, I decided to approach the problem using the text alone.

The idea: find all combinations and permutations of the given letters that were of the correct length and sort them by how frequently they are used. Initially I intended to use Google search hits to rank the letter combinations, obviously, this would be unbelievably slow and, as it turns out, violates Google's terms of service. Eventually I settled on a much better solution; use a local copy of Wikipedia. It's important to use something other than a dictionary because Draw Something frequently uses brands and pop-culture references that would't crop up in most standard dictionaries.

The article text alone was around 30GB (unzipped) when I grabbed it at the start of 2012, so lazy parsing is essential. I used python's etree module, but using it lazily turned out to be non obvious. Even when using the etree.iterparse function, all processed nodes are kept in memory forever, they're just loaded one-by-one. The answer, obvious in retrospect, is to delete each node after reading its contents using myElement.clear(). After separating each word of text, I added it to a big dictionary mapping words to their frequency-counts, that dictionary then gets pickled and saved to disk, ready for later use.

Here's the code:
from glob import glob
import cPickle as pickle
from itertools import ifilter
import xml.etree.ElementTree as etree
import string
import re
import unicodedata

wordCount = {}

identity = lambda x: x

def incrementCount(word):
    wordCount[word] = wordCount.get(word, 0) + 1

regexWordSeperators = "&[^;]+?;"
charWordSeperators = "()[]{}|.,/_=:&;\\-^~"
charSperatorTraslation = string.maketrans(charWordSeperators, " "*len(charWordSeperators))
rubbishChars = (string.punctuation + string.digits).translate(string.maketrans("",""), charWordSeperators)
def countWords(text):
    global debugcount
    
    text = unicodedata.normalize('NFKD', u'%s' % text).encode('ascii','ignore') # Much nicer conversion!
    
    text = re.sub(regexWordSeperators, " ", text).replace("%20", " ") # replace html characters like < with spaces
    text = text.translate(charSperatorTraslation, rubbishChars).lower() # clean and prepare for
        # splitting (by whitespace) into words
    text = ifilter(identity, text.split()) # Remove empty strings if there are any
        
    map(incrementCount, text)

def processElement(elem, event):
    txt = None
    if event == 'end' and elem.tag[-4:] == 'text':
        txt = elem.text
    
    elem.clear() # Delete from element tree in memory!!
    root.clear() # Delete empty element from root (which
    #   has some tiny memory footprint leading to a leak that matters with HUGE files)
    return txt


# Parse XML Tree:
itertree = etree.iterparse(glob("./*.xml")[0], events=('end', 'start')) # We don't care about any events except start and end


# Define Article-Text Generator:
firstEvent, root = itertree.next()
textSnippets = ifilter(identity, (processElement(elem, event) for event, elem in itertree))


map(countWords, textSnippets) # Do the counting work here

# Store
with open("wordCount.pckl", 'wb') as f:
    pickle.dump(wordCount, f)
    
print "All of Wikipedia processed! %d words found :)" % len(wordCount)

It takes about an hour to run and the resulting file is 180MB or so; that's a lot of words! it turns out that when you have as much text as is on Wikipedia, occasional typos cause every word to permute into dozens of misspellings. Ideally a good threshold (say 5 occurrences) could be used to filter words before pickling them. Rather than waiting another hour, I just filtered the words after matching them to permutations -  it makes every search slower because it has to load and search 180MB of dictionary before filtering.

Speaking of permutations, Python's itertools library has some wonderfully simple functions for lazy combinatorics. Generating potential words of a certain length is just: permutations('abbdejmnorux', 7)

Combinatorial permutations are naturally enormous, but the fixed sizes for this problem make that insignificant on modern hardware. Here's the final code:
import cPickle as pickle
from itertools import permutations, ifilter, ifilterfalse, islice


# Parameters #

LETTERS = "abbdejmnorux"
LENGTH = 7


# Definitions #

with open("wordCount.pckl", 'rb') as f:
    wikiWordCount = pickle.load(f)

def usageFrequency(word):
    return wikiWordCount.get(word, 0)

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    # This python recipe is taken from: http://docs.python.org/2/library/itertools.html
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in ifilterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element


# Specification of Output #

candidates = unique_everseen(permutations(LETTERS, LENGTH))

candidates = (reduce(lambda x,y: x+y, candidate) for candidate in candidates)

# zip candidate strings to usage frequency, sort+filter by this value, then unzip and dump frequency info
candidates = (pair[0] for pair in sorted(
        ifilter(
            lambda pair: pair[1],
            ((candidate, usageFrequency(candidate)) for candidate in candidates)
        ),
        key = lambda pair: -pair[1]
    )
)


# Evaluation & Output #

print '\n'.join(reversed(list(islice(candidates, 0, 200))))

print "\ndone!\n"
This takes around 30 seconds to crawl though every possible mix of letters, sort them by usage frequency and print out the top few; that's pretty good!

Also:
...
moderna
jourdan
bombard
broaden
majored
dunmore
unarmed
abdomen
rebound
bermuda

done!

Rebound. The answer was rebound. Obvious now, right?

Confusing Draw Something image taken from http://baddrawsomething.tumblr.com/

No comments:

Post a Comment

About Me

Melbourne, Victoria, Australia
I'm hard at work on an MSc in Physics at the University of Melbourne, my research topic is 'building nano-scale neural networks'. In my (limited) spare time I tinker with 3D printing, electronics and programming.