The wake-sleep algorithm for unsupervised neural networks

Before we start, let’s get some Python plumbing out of the way

import numpy as np
from itertools import islice

# I usually program in languages where this is built in :)
# https://stackoverflow.com/questions/6822725/rolling-or-sliding-window-iterator-in-python

def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result

… back to regular scheduled programming

The report starts here:

The wake-sleep algorithm for unsupervised neural networks

Paper by Geoffrey E Hinton, Peter Dayan, Brendan J Frey and Radford M Neal

Read, interpreted and Pythonized by Vadim Liventsev for Graphical Models of Statistical Inference course at Skoltech

What is this all about?

Artificial neural networks are typically used for supervised learning: given some training data with inputs and outputs create a model that maps inputs to outputs as closely as possible. This paper introduces wake-sleep algorithm for unsupervised learning (training data contains only inputs) with neural networks and explains its theoretical underpinnings.

How does it work?

Wake-sleep algorithm mostly comes down to 2 clever tricks:

Clever trick 1: Probability - Entropy approach

A neural network can be seen as a coding scheme: the first layer takes an input vector and the last one outputs its representation (a code). If a reference set of “correct” representations is known, one can minimize mean squared deviation of actual representations from reference representations. But it’s not necessary. Instead, one can take another approach inspired by coding theory: minimize the complexity of representations. Set the weights of the connections in a neural network so that inputs are encoded in the most compact way possible.

Sounds simple, but developing this idea into an actual algorithm requires some preparation:

Meet Binary Stochastic Neuron

Binary Stochastic Neuron is a neuron that takes a vector of binary values and outputs 0 or 1 with probability defined by the logistic function

Stochastic Neuron

where s_v is the output of neuron v, b_v is the bias of neuron v, s_u is the output of neuron u and w_uv is the weight of the connection from neuron u to neuron v

or, in Python:

class StochasticNeuron():
    def __init__(self, input_weights = np.array([]), bias = 0):
        self.input_weights = np.array(input_weights)
        self.bias = bias
        
    def activation_probability(self, inputs):
        inputs = np.array(inputs)
        return 1 / (1 + np.exp(- self.bias - np.sum(self.input_weights * inputs)))
    
    def activation_function(self, inputs):
        return np.random.binomial(1, self.activation_probability(inputs))
    
    def __call__(self, inputs):
        return self.activation_function(inputs)

For example,

n = StochasticNeuron([0, -0.5, 1], 0)

n.activation_probability([False, True, False])

0.3775406687981454

n([False, True, False])

n.activation_probability([False, True, True])

0.6224593312018546

n([False, True, True])

And we can see that the neuron is stochastic by calculating it’s output several times in a row

[n([False, True, False]) for i in range(15)]

[1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]

Stochastic neural networks

A layer of a neural network is just a tuple of neurons

# Unbiased, unadjusted layer
def create_layer(input_size, size):
    return [StochasticNeuron(np.array([1.0 for i in range(input_size)]), 0) for i in range(size)]

layer = create_layer(3, 10)
inp = [1, 2, -4]
[neuron(inp) for neuron in layer]

[1, 1, 0, 0, 0, 0, 0, 0, 0, 0]

Once again, let’s infer the ouput of the layer several times and note the stochasticity

# Once again, it is stochastic
[[neuron(inp) for neuron in layer] for i in range(5)]

[[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
 [0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
 [0, 1, 0, 1, 0, 0, 0, 0, 0, 0]]

A multi-layer neural network is built by stacking several layers together with outputs of layer n as inputs of layer n+1

def create_network(layer_sizes):
    return [create_layer(prev_size, size) for prev_size, size in window(layer_sizes, 2)]

Let’s build the network displayed below:

JustNN

net = create_network([3, 4, 2])

The state of the entire network can be inferred by calculating neuron outputs layer-by-layer:

def infer(layers, inp):
    state = [inp]
    for layer in layers:
        state.append([neuron(state[-1]) for neuron in layer])
    return state

We can try to set the input vector to 1,0,0 and infer the state of the network

state = infer(net, [True, False, False])
state

[[True, False, False], [0, 1, 1, 1], [1, 1]]

OK, but what does it have to do with graphical models?

Great question! The image below is (probably) the most cited graphical model:

Sprinkler

It is a Bayesian network describing the reasoning of a certain British detective who woke up late on a foggy British day to notice that his British lawn is wet. The arcs (edges, arrows) of the graph define conditional probabilities of lawn being wet because of rain or because of a sprinkler.

Notice that there is little meaningful difference between nodes in a Bayesian network and our stochastic binary neurons. It is easy to show:

sprinkler_neuron = StochasticNeuron([-10], 0)
grass_neuron = StochasticNeuron([5,5], 0)

rain = True
print("Rain: " + str(bool(rain)))
sprinkler = sprinkler_neuron([rain])
print("Sprinkler active: " + str(bool(sprinkler)))
grass_wet = grass_neuron([rain, sprinkler])
print("Grass wet: " + str(bool(grass_wet)))

Rain: True
Sprinkler active: False
Grass wet: True

So, essentially, stochastic neural networks and graphical models are 2 languages used to discuss the same mathematical structure

But didn’t we set out to solve learning, not inference?

Yes! Statistical inference and learning are in a sense complementary problems: inference discusses “how would this neural network/graphical model behave?” and learning - “how to build a neural network/graphical model that behaves the way we want?”. Hinton et al defined “the way we want” as minimizing the description length of the net’s output given the input.

Consider the following problem: you have n objects, some of which are more likely to occur than others, defined by some probability distribution. You have to come up with a code for every object so that the expected length of the code will be as short as possible. This problem is less theoretical than it might seem: languages (natural and programming languages alike) attempt to solve whis exact problem by describing concepts that occur often with short words (dog) and rare ones with long words (concatenation). Languages are far from perfect at it, but if a perfect language were to exist, every word in it would be of length -logP where P is the probability of this word

-logP is called description length (Shannon coding theorem) and it is applicable to any probability distribution as every probability distribution, in theory, defines a coding scheme. Our stochastic neural network is no exception: when an input vector and weights of all connections are fixed, every possible output of the net has a description length.

It is derived as follows:

The cost C of describing a single neuron is

Neuron DL

where α is network state (aggregate of states of all enurons), s_j^α is the state of neuron j given network state α and p_j^α is the activation probability of neuron j (P(s_j^α)=1)

Then the cost of describing state α in its entirety is simply the cost of describing all the hidden states in all the hidden layers plus the cost of describing the input vector given the hidden states:

Full cost

where d is the input vector. Note that decomposition of C(α,d) into 2 addants is very meaningful. C(d|α) indicates how well the state of the network represents the input vector while C(α) corresponds to the complexity of said state

And if we were to minimize this cost (we are) using gradient descent, we would (we will) use the following delta rule derived from the above formula:

Delta rule

Long time, no Python! Let’s implement it!

def learn(layers, state, rate):
    for layer, (inputs, outputs) in zip(layers, window(state, 2)):
        for neuron, output in zip(layer, outputs):
            # Yes, I could greatly optimize it by caching probabilities
            # As soon as I have some spare time
            p = neuron.activation_probability(inputs)
            neuron.input_weights -= np.array(
                [rate * inpt * (output - p) for inpt, weight in zip(inputs, neuron.input_weights)]
                )

Clever trick 2: using 2 reverse neural nets to train each other

There are several issues that haven’t been addressed yet:

Our delta rules requires state α to already exist. And using some dummy state is a bad idea since that will jeopardize the quality of learning
We have a network that outputs represenations of the input vector, but we don’t have a good way to reconstruct the input vector from its representations

Hinton et al proposed the following solution:

2-way network

This architecture can be thought of as either a network where each pair of adjacent layers is connected by 2 sets of connections (one in each direction) or as 2 networks with shared state: one is responsible for recognition: building a representation of input data, another for generation: reconstructing the input vector

Learning happens iteratively, each iteration consists of:

Recognition inference. Calculating state α.
Generative learning. Changing generative weights so that d is better described by α using the delta rule mentioned above.
Generative inference. Calculating a new (fantasy) d
Recognition learning. Changing recognition weights so that α is better described by d

English -> Python translation:

def wakesleep(layer_sizes, inputs, learning_rate, iterations):
    recognition_connections = create_network(layer_sizes)
    generative_connections = create_network(reversed(layer_sizes))
    state = [inputs]
    
    for i in range(iterations):
        state = infer(recognition_connections, state[-1]) # That reverses state
        learn(generative_connections, reversed(state), learning_rate)
        state = infer(generative_connections, state[-1]) # That reverses it once again
        learn(recognition_connections, reversed(state), learning_rate)

    return reversed(state)

tuple(wakesleep([2, 10, 30], [-1, 1], 1, 100))

([1, 1],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [0,
  0,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  1,
  0,
  1,
  1,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  1,
  1])

Voila!