## Thursday, 28 August 2014

### Fahrenheit 451 (a non-technical post)

It's been a while since I updated this blog, so I'm going to cross-post a literary thought from a private Facebook discussion:

Fahrenheit 451 has strong anti-feminine sentiments; the main characters are men, the only thoughtful woman is a woman-in-the-freezer, women are always used as the examples of societal airheaded-naivete, but criticising this seems to play into the book's own narratives of censorship:
"Now let's take up the minorities in our civilization, shall we? Bigger the population, the more minorities. Don't step on the toes of the dog-lovers, the cat-lovers, doctors, lawyers, merchants, chiefs, Mormons, Baptists, Unitarians, second-generation Chinese, Swedes, Italians, Germans, Texans, Brooklynites, Irishmen, people from Oregon or Mexico. The people in this book, this play, this TV serial are not meant to represent any actual painters, cartographers, mechanics anywhere. The bigger your market, Montag, the less you handle controversy, remember that! All the minor minor minorities with their navels to be kept clean. Authors, full of evil thoughts, lock up your typewriters. They did. Magazines became a nice blend of vanilla tapioca. Books, so the damned snobbish critics said, were dishwater. No wonder books stopped selling, the critics said. But the public, knowing what it wanted, spinning happily, let the comic-books survive."

In the postscript to my edition, Bradbury explicitly addresses this complaint (among others) noting that "there's more than one way to burn a book". He further recounts that, ironically, many students in the US have been given editions which are partially censored (mostly swearwords like 'damn' and 'hell').

Here's an interesting idea: the book is improved because because it simultaneously cries-out for criticism of it's gender-stereotypes (or even a re-write) while explicitly warning against the dangers of special-interest censorship. A progressive reading this book in the 21st century needs to wage, in their own mind, the societal-censorship battle that Bradbury describes.

Note: It was written in the 50s, which is an explanation if not an excuse.
Note 2: Women aren't a minority, but they are an interest group in the same style as the others and Bradbury explicitly addresses 'feministas' as a minority in the postscript of my edition.

## Sunday, 9 February 2014

### Pascal's Triangle

While smoothing some evenly-spaced time series data a while ago, I found the need to generate rows from Pascal's triangle; for my own entertainment, I tried to come up with some minimal (but not totally unreadable) code:

def pascalRow(n):
" Returns the nth row of Pascal's triangle"
return [1] if n<=0 else reduce(lambda row, n: row[:-1] + [(row[-1] + n), n], pascalRow(n-1), [0])


This is short and quite fast, but severely lacking in readability; a trait only made slightly excusable by it's modularity, and clarity of function. The question you may be asking though, is 'why Pascal's triangle in the first place'?
Gaussian blurring/smoothing is just another instance of the general rule that 'Gaussians come up everywhere'. When dealing with discreet data, it's important to remember that the binomial distribution is the discrete equivalent of the Gaussian. The $n^{th}$ row of Pascal's triangle is a list of the binomial coefficients $\binom nk$. These are the weightings (up to a normalisation factor of $\frac{1}{2^n}$), describing how each raw data-point will contribute to it's neighbors:

$$x_i^{smoothed} = \sum\limits_{k = 0}^{n} \binom nk x_{i-\frac{n}{2} + k}$$

Each smoothed point is thus heavily influenced by its near neighbours, but not by its more distant ones. Note that $n$ determines the radius of the smoothing filter (in units of your data's sampling period).

This kind of clunky, discrete sum is just a special case of a convolution. Convolutions generalise the notion of smoothing to filtering, enabling a huge range of operations (notably Fourier Transforms) to be performed with these same kinds of discrete sums. Convolutions generalise to to higher dimensions (and continuous spaces) as well. In 2D, the weights or 'filter kernel' become a matrix; filtering is then sliding every pixel in the kernel matrix over each pixel in the data matrix. This gives a complexity of $O(n^2m^2)$ for an $m \times m$ kernel. Gaussians (binomials) have some special properties that can make these sums more efficient however: multidimensional Gaussians are separable. A 2D Gaussian is just the product of two orthogonal 1D Gaussians: $G(x, y) = G(x)G(y)$ [Note that the separated components of a 2D Gaussian might not align with the x and y axes of your data matrix. This is only a problem if the Gaussian is anisotropic, ie. if it has different widths in different directions].

This means that convolutions with Gaussian kernels can be done one dimension at a time; sliding each data-pixel over each pixel in two length $m$ binomial/Gaussian kernels, reducing the time complexity to $O(n^2m)$. This reduction extends to higher dimensions too - you can always break down a Gaussian into 1D components and convolute them sequentially.

## Saturday, 27 April 2013

### 3D Printing Nylon

 Oozy, webbed nylon printing
Nylons can be exceptionally strong, much more so than ABS or PLA, especially for tensile stresses. That makes them a tempting materials for 3D printing mechanical parts, so I grabbed some Taulman Nylon 618 to check it out.

At first my prints were unbelievably oozy - the nylon just flowed from the nozzle and leaked all over the print. I also had problems getting it to adhere to standard blue tape. Apparently both of these are pretty common.

 Damaged bed surface after print removal
I addressed the bed-sticking problem with a cheap sheet of MDF board; 618 nylon sticks to cellulose really well. In fact, it works so well that the board's surface gets damaged each time I remove a print; ideally I'll be able to find some tape-mesh to decrease the contact area, or maybe just something less sticky, I suppose I could also just make it easy to remove the wooden bed and keep switching them out (given how cheap the stuff is).

The oozing problem was far more insidious. Initially, I tried to use retraction, but it had no effect (actually it made things worse by causing the print head to sit still momentarily, melting the plastic beneath it and letting it flow). I then happened upon a great idea; get rid of retraction entirely. Get the print head moving so quickly that not much plastic has time to flow while it's moving. In combination with the 'avoid crossing perimeters' flag in slic3r (which forces the printer to complete each part separately, getting rid of that spiderweb of inter-part trails), and the switch to a 0.2mm nozzle produced magnificent results:
 A lego block being printed with very clean, ooze-free walls
 Notice that only a single line of oozed filament conencts each component thanks to the 'avoid crossing perimeters' flag
I also had to significantly lower the extrusion multiplier to aid in ooze-prevention, though this may just be the result of thicker-than specified filament. I'm still tweaking the extrusion rate settings to get perfect, smooth surfaces, but the intermediate results are totally functional:
 InMoov index finger - the surface is still a bit rough, but the sanded joints are slippery-smooth and its many times stronger than the ABS versions I've managed to make (which suffer from cracking and shattering near the weak, stress-bearing pivot points).
Printing at the full 245°C significantly improved the strength of printed parts and didn't seem to have any apparent adverse effects on the teflon of the Printrbot Ubis hotend (though be aware that every thermistor will have a slightly different calibration). Fortunately the Ubis hotend only includes teflon on the outside of the PEEK barrel, so it should be a few degrees cooler than the nylon inside. I didn't use any bed heating at all. Use of a cooling fan while printing turned out to have a slight detrimental effect.

Nylon loves to absorb water, water that boils and bubbles out of the extruder during printing; it's often recommended to dry it out in an oven at low heat before printing, but I haven't found that to be necessary. There's a small trail of steam from the print nozzle, but there appear to be no adverse effects, though this may not be true if you live in a high-humidity area.

I could comfortably hammer the printed lego parts without damage; a massive improvement over ABS. Despite some reports that gluing/bonding 618 is ineffective, I found that superglue worked perfectly on sanded nylon parts (see the index-finger above).

Apparently, natural nylon is very easy to colour with fabric dyes too; this allows each printed part to be coloured separately, removing the need for multiple rolls in different pre-dyed colours (though I have yet to test this).

All in all I'm quite impressed; nylon 618 is a fiddly material for printing, but the results are certainly worth it for any part that needs significant mechanical strength.

## Thursday, 25 April 2013

### Powersets

While looking over purported Google interview questions, I found one that was just 'write some code to make powersets'. It turns out that the powerset of a set $S$ is the set of all its subsets (including itself):
$$\mathcal{P}( S ) = \{s \subseteq S \}$$
My first attempt at a solution was an ugly iterative algorithm that used lists to represent sets. An hour or so of tinkering led me to this lazy, recursive solution using python's built in set type (which is a hash table and thus has constant-time lookups):
def subsetsOfLength(S, l):
''' Return a generator of all subsets of S that are of length l.

Note: this function uses recursion, so l must be less than your python implementation's
recursion limit (default 1000 for C Python).
'''

if l > len(S):
pass # Return a length-zero iterator when subsets larger than S itself are requested.

elif l == 0:
yield set() # The empty set is the only set of length 0

elif l == 1:
for s in S:
yield {s} # Just yield each element if only length-one elements are required

# This clause isn't strictly required, but it saves len(S) redundant function calls that would all
# return the empty set (to be unioned with x)

else:
S = S.copy() # Avoid mangling the original parameter

while S:
x = {S.pop()} # x is a set containing a single, arbitrary element from S

# Yield the union of x with every possible subset of S\{x} (that has
# the correct length to make the union be length l):
for y in subsetsOfLength(S, l-1):
yield x.union(y)

def powerset(S):
''' Return a generator for the powerset (the set of all [non-strict] subsets) of some set S
'''

for l in xrange(len(S)+1):
for x in subsetsOfLength(S, l):
yield x

if __name__ == '__main__': # Module Test/Demo
print 'Length 3 subsets of S=[1, 5]'
print '\n'.join((str(x) for x in subsetsOfLength(set(range(1, 6)), 3)))
print
print 'Powerset of S=[1, 3]'
print '\n'.join((str(x) for x in powerset(set(range(1, 4)))))
print
print 'The length of the powerset of S with |S| = 12 is %d' % sum(1 for _ in (powerset(set(range(12)))))
# This could be calculated a-priori, as 2^|S|, but that wouldn't be a test

Here's the test output:

Length 3 subsets of S=[1, 5]
set([1, 2, 3])
set([1, 2, 4])
set([1, 2, 5])
set([1, 3, 4])
set([1, 3, 5])
set([1, 4, 5])
set([2, 3, 4])
set([2, 3, 5])
set([2, 4, 5])
set([3, 4, 5])

Powerset of S=[1, 3]
set([])
set([1])
set([2])
set([3])
set([1, 2])
set([1, 3])
set([2, 3])
set([1, 2, 3])

The length of the powerset of S with |S| = 12 is 4096

### Prime Numbers

A few months ago, during a lecture on Haskell and functional programming, we were introduced to the Sieve of Eratosthenes, an elegant way to generate prime numbers. We were shown a neat implementation in Haskell that used just a few lines; I wanted to see if I could replicate that neatness/laziness in Python.

from itertools import takewhile, chain, count, islice
from collections import deque

knownPrimes = [2] # Numbers are tested by counting upwards from here. This also acts as a cache.

def storeAndReturn(p):
knownPrimes.append(p)
return p

def isNotDivisibleByKnownPrimes(n):
return all((n % p) != 0 for p in takewhile(lambda p: p**2 <= n, knownPrimes))

def primeFilter(sequence): return (storeAndReturn(i) for i in sequence if isNotDivisibleByKnownPrimes(i))

def allPrimes(): return chain(knownPrimes, primeFilter(count(knownPrimes[-1] + 1)))

def primesUpTo(limit): return takewhile(lambda p: p < limit, allPrimes())

def firstNPrimes(n): return islice(allPrimes(), None, n, None)

def nthPrime(n): return deque(firstNPrimes(n), maxlen=1).pop() # deque construction is apparently the
# fastest way to consume an iterator (partly because it's raw C)

if __name__ == '__main__': # Module Test
print "Primes below 10:"
print '\n'.join(map(str, primesUpTo(10)))

print "First 10 primes:"
print '\n'.join(map(str, firstNPrimes(10)))

print "10,000th prime:"
print nthPrime(10000)


I succeeded, and even added some more interesting ways to use the allPrimes() generator (although a few of the one-liners should probably be broken-out to aid readability).

Primes below 10:
2
3
5
7
First 10 primes:
2
3
5
7
11
13
17
19
23
29
10,000th prime:
104729

Neat, huh?

I love infinite generators - it's flat-out cool to be able to create an object describing every prime number using just a few lines.

### Cheating (sort-of) at Draw Something

A year or so ago Draw Something was all the rage. For those who don't remember, it's basically pictionary - each turn you draw a word and your opponent/partner has to guess it from a  given set of letters, then it alternates. Usually its lots of fun, but occasionally, you get something ridiculous and unintelligible.

 What?
Unwilling to accept defeat and too stubborn to ask for help, I set out to solve the problem automatically. On the assumption that bad draw something images contain basically no information, I decided to approach the problem using the text alone.

The idea: find all combinations and permutations of the given letters that were of the correct length and sort them by how frequently they are used. Initially I intended to use Google search hits to rank the letter combinations, obviously, this would be unbelievably slow and, as it turns out, violates Google's terms of service. Eventually I settled on a much better solution; use a local copy of Wikipedia. It's important to use something other than a dictionary because Draw Something frequently uses brands and pop-culture references that would't crop up in most standard dictionaries.

The article text alone was around 30GB (unzipped) when I grabbed it at the start of 2012, so lazy parsing is essential. I used python's etree module, but using it lazily turned out to be non obvious. Even when using the etree.iterparse function, all processed nodes are kept in memory forever, they're just loaded one-by-one. The answer, obvious in retrospect, is to delete each node after reading its contents using myElement.clear(). After separating each word of text, I added it to a big dictionary mapping words to their frequency-counts, that dictionary then gets pickled and saved to disk, ready for later use.

Here's the code:
from glob import glob
import cPickle as pickle
from itertools import ifilter
import xml.etree.ElementTree as etree
import string
import re
import unicodedata

wordCount = {}

identity = lambda x: x

def incrementCount(word):
wordCount[word] = wordCount.get(word, 0) + 1

regexWordSeperators = "&[^;]+?;"
charWordSeperators = "()[]{}|.,/_=:&;\\-^~"
charSperatorTraslation = string.maketrans(charWordSeperators, " "*len(charWordSeperators))
rubbishChars = (string.punctuation + string.digits).translate(string.maketrans("",""), charWordSeperators)
def countWords(text):
global debugcount

text = unicodedata.normalize('NFKD', u'%s' % text).encode('ascii','ignore') # Much nicer conversion!

text = re.sub(regexWordSeperators, " ", text).replace("%20", " ") # replace html characters like &lt; with spaces
text = text.translate(charSperatorTraslation, rubbishChars).lower() # clean and prepare for
# splitting (by whitespace) into words
text = ifilter(identity, text.split()) # Remove empty strings if there are any

map(incrementCount, text)

def processElement(elem, event):
txt = None
if event == 'end' and elem.tag[-4:] == 'text':
txt = elem.text

elem.clear() # Delete from element tree in memory!!
root.clear() # Delete empty element from root (which
#   has some tiny memory footprint leading to a leak that matters with HUGE files)
return txt

# Parse XML Tree:
itertree = etree.iterparse(glob("./*.xml")[0], events=('end', 'start')) # We don't care about any events except start and end

# Define Article-Text Generator:
firstEvent, root = itertree.next()
textSnippets = ifilter(identity, (processElement(elem, event) for event, elem in itertree))

map(countWords, textSnippets) # Do the counting work here

# Store
with open("wordCount.pckl", 'wb') as f:
pickle.dump(wordCount, f)

print "All of Wikipedia processed! %d words found :)" % len(wordCount)


It takes about an hour to run and the resulting file is 180MB or so; that's a lot of words! it turns out that when you have as much text as is on Wikipedia, occasional typos cause every word to permute into dozens of misspellings. Ideally a good threshold (say 5 occurrences) could be used to filter words before pickling them. Rather than waiting another hour, I just filtered the words after matching them to permutations -  it makes every search slower because it has to load and search 180MB of dictionary before filtering.

Speaking of permutations, Python's itertools library has some wonderfully simple functions for lazy combinatorics. Generating potential words of a certain length is just: permutations('abbdejmnorux', 7)

Combinatorial permutations are naturally enormous, but the fixed sizes for this problem make that insignificant on modern hardware. Here's the final code:
import cPickle as pickle
from itertools import permutations, ifilter, ifilterfalse, islice

# Parameters #

LETTERS = "abbdejmnorux"
LENGTH = 7

# Definitions #

with open("wordCount.pckl", 'rb') as f:

def usageFrequency(word):
return wikiWordCount.get(word, 0)

def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
# This python recipe is taken from: http://docs.python.org/2/library/itertools.html
seen = set()
if key is None:
for element in ifilterfalse(seen.__contains__, iterable):
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
yield element

# Specification of Output #

candidates = unique_everseen(permutations(LETTERS, LENGTH))

candidates = (reduce(lambda x,y: x+y, candidate) for candidate in candidates)

# zip candidate strings to usage frequency, sort+filter by this value, then unzip and dump frequency info
candidates = (pair[0] for pair in sorted(
ifilter(
lambda pair: pair[1],
((candidate, usageFrequency(candidate)) for candidate in candidates)
),
key = lambda pair: -pair[1]
)
)

# Evaluation & Output #

print '\n'.join(reversed(list(islice(candidates, 0, 200))))

print "\ndone!\n"
This takes around 30 seconds to crawl though every possible mix of letters, sort them by usage frequency and print out the top few; that's pretty good!

Also:
...
moderna
jourdan
bombard
majored
dunmore
unarmed
abdomen
rebound
bermuda

done!

Rebound. The answer was rebound. Obvious now, right?

Confusing Draw Something image taken from http://baddrawsomething.tumblr.com/

### This blog

For a long time I've been intending to document and share my various projects and hacks; this blog hopefully represents the end of my laziness in that regard. For a while I'm going to be retroactively documenting some recent projects, but soon, this blog should become a place where I can share progress and work though problems as I find them.