Visualizing RNNs

Josh Varty

After reading about the unreasonable effectiveness of recurrent neural networks I know:
One thing I struggle with is visualizing what's actually going on. In particular I have a lot of trouble reconciling this code:
with this image:
To help me understand exactly what's going on I want visualize each step of the forward pass. What do our inputs look like? What do our weights look like? How do things look at the beginning of training when compared to the end?
In order to tackle this we're going to have to use very small weight matrices or things will get out of hand very quickly. Even Andrej Karpathy's simple Min-Char RNN has weight matrices with thousands of parameters. No matter how hard we try, we won't be able to put all those numbers on screen.

An Untrained RNN

With simplicity in mind, instead of predicting the next character in English text (as Min-Char RNN does), let’s build a network that predicts the next output in the sequence:
1,0,1,0,1,0,1,0,1,0 ... 
This seems simple. Maybe even too simple, but a sequence like this will allow us to have small weight matrices, the largest of which will be just 3x3.
For this visualization we’ll use a modified version of Min-Char RNN. Our vocab_size is just 2 (each input character is 1 or 0) and since we’re predicting such a simple pattern we’ll use a hidden_size of 3.
Our network is fed an input of 1,0,1,0 and is tasked with predicting the target output sequence 0,1,0,1, one character at a time. Since our network is untrained, note that it outputs probabilities of roughly 0.50 at each time step.
Click "Play" below or step through at your own pace:
#Initialize weights and biases
input_weights_U = np.random.randn(hidden_size, vocab_size) * 0.01
hidden_weights_W = np.random.randn(hidden_size, hidden_size) * 0.01
hidden_bias = np.zeros((hidden_size, 1))
output_weights_V = np.random.randn(vocab_size, hidden_size) * 0.01
output_bias = np.zeros((vocab_size, 1))
#Forward pass
xs, hidden_states, outputs, probabilities = {}, {}, {}, {}
loss = 0
hidden_states[-1] = np.copy(hidden_state_prev)
for t in range(len(inputs)): 
    # one-hot-encoding the input character 
    xs[t] = np.zeros((vocab_size,1))  
    character = inputs[t]
    xs[t][character] = 1 
    target = targets[t]
    # Compute hidden state 
    hidden_states[t] = np.tanh(input_weights_U @ xs[t] + hidden_weights_W @ hidden_states[t-1] + hidden_bias) 
    # Compute output and probabilities
    outputs[t] = output_weights_V @ hidden_states[t] + output_bias
    probabilities[t] = np.exp(outputs[t]) / np.sum(np.exp(outputs[t]))
    #Compute cross-entropy loss
    loss += -np.log(probabilities[t][target,0]) 
This animation looks pretty cool but you probably won't learn much at first glance. Take a minute to match up the weights in the diagram with the weights in the code. You can hover over them and it will highlight the relationship for you. If you're really feeling up for it, you could work out the matrix multiplications by hand. The big takeaways here are:

A Trained RNN

So now that we've seen a network that doesn't work, let's take a look at one that does. After training our network with gradient descent (a visualization I'll save for another time) we're left with weights that look like:
So what happens when we run the network using these weights?
#Initialize weights and biases
input_weights_U = ...  #trained via gradient descent
hidden_weights_W = ... #trained via gradient descent
hidden_bias = ...      #trained via gradient descent
output_weights_V = ... #trained via gradient descent
output_bias = ...      #trained via gradient descent
#Forward pass
xs, hidden_states, outputs, probabilities = {}, {}, {}, {}
loss = 0
hidden_states[-1] = np.copy(hidden_state_prev)
for t in range(len(inputs)): 
    # one-hot-encoding the input character 
    xs[t] = np.zeros((vocab_size,1))  
    character = inputs[t]
    xs[t][character] = 1 
    target = targets[t]
    # Compute hidden state 
    hidden_states[t] = np.tanh(input_weights_U @ xs[t] + hidden_weights_W @ hidden_states[t-1] + hidden_bias) 
    # Compute output and probabilities
    outputs[t] = output_weights_V @ hidden_states[t] + output_bias
    probabilities[t] = np.exp(outputs[t]) / np.sum(np.exp(outputs[t]))
    #Compute cross-entropy loss
    loss += -np.log(probabilities[t][target,0]) 
This time our network outputs probabilities for 0 and 1 with almost 100% confidence. If you pay close attention to hidden_state you'll notice that it also alternates between [1,-1,1] and [-1,1,-1]. I can't pretend that I know why this is, but it's interesting to see.
In fact, the more we start thinking about it, the more it's interesting that hidden_state learns anything at all. Hidden state is meant to encode useful information about things we've seen in the past, but our problem doesn't depend on past information. Our network should really only care about whatever the current input (1 or 0) to our network is.

Making Hidden State Useful

If we want hidden state to be useful, we'll have to give it a problem where it's actually needed. We'll modify our original sequence slightly and have our network predict the next character in
1,1,0,1,1,0,1,1,0, ... 
Now our network can no longer get away with simply predicting the opposite of the input. It will have to take special care to determine whether we're on the first or second 1 when predicting the output.
Below is an RNN trained to respond to input characters 1,1,0,1 with 1,0,1,1. Keep an eye on hidden_state as the animation progresses.
#Initialize weights and biases
input_weights_U = ...  #trained via gradient descent
hidden_weights_W = ... #trained via gradient descent
hidden_bias = ...      #trained via gradient descent
output_weights_V = ... #trained via gradient descent
output_bias = ...      #trained via gradient descent
#Forward pass
xs, hidden_states, outputs, probabilities = {}, {}, {}, {}
loss = 0
hidden_states[-1] = np.copy(hidden_state_prev)
for t in range(len(inputs)): 
    # one-hot-encoding the input character 
    xs[t] = np.zeros((vocab_size,1))  
    character = inputs[t]
    xs[t][character] = 1 
    target = targets[t]
    # Compute hidden state 
    hidden_states[t] = np.tanh(input_weights_U @ xs[t] + hidden_weights_W @ hidden_states[t-1] + hidden_bias) 
    # Compute output and probabilities
    outputs[t] = output_weights_V @ hidden_states[t] + output_bias
    probabilities[t] = np.exp(outputs[t]) / np.sum(np.exp(outputs[t]))
    #Compute cross-entropy loss
    loss += -np.log(probabilities[t][target,0]) 
This time hidden state is actually encoding useful information we can observe directly. Whenever the network sees the first 1 in the sequence, hidden state is set to [0.72, 0.58, 0.96]. You can see this at the first and final steps of the animation. In contrast when the network sees the second 1, hidden state is set to [-0.76, 0.89, 0.95]. Varying hidden state like this allows our network to output the proper probabilities at each step despite our inputs (1) and parameters (U, W and V) being the same.
Indeed much of the power of RNNs stems from hidden state and interesting ways of convincing our network to either remember or forget different things about the input sequences. Andrej's original blog post covers this in detail for a number of different domains.
Interested in more about RNNs? Follow me on Twitter at: @ThisIsJoshVarty