After reading about the unreasonable effectiveness of recurrent neural networks I know:

They work with sequences (eg. stock data, natural language)
We can train them with backprop, the same way we do with convolutional nets
They are flexible and work with a variety of structures

One-to-many: Take an image and return a caption for it.
Many-to-one: Take a movie review and return if it's positive or negative
Many-to-Many: Take an English sentence and return a Spanish sentence

An Untrained RNN

With simplicity in mind, instead of predicting the next character in English text (as Min-Char RNN does), let’s build a network that predicts the next output in the sequence:
1,0,1,0,1,0,1,0,1,0 ...
This seems simple. Maybe even too simple, but a sequence like this will allow us to have small weight matrices, the largest of which will be just 3x3.

For this visualization we’ll use a modified version of Min-Char RNN. Our vocab_size is just 2 (each input character is 1 or 0) and since we’re predicting such a simple pattern we’ll use a hidden_size of 3.

Our network is fed an input of 1,0,1,0 and is tasked with predicting the target output sequence 0,1,0,1, one character at a time. Since our network is untrained, note that it outputs probabilities of roughly 0.50 at each time step.

#Initialize weights and biases input_weights_U = np.random.randn(hidden_size, vocab_size) * 0.01 hidden_weights_W = np.random.randn(hidden_size, hidden_size) * 0.01 hidden_bias = np.zeros((hidden_size, 1)) output_weights_V = np.random.randn(vocab_size, hidden_size) * 0.01 output_bias = np.zeros((vocab_size, 1)) #Forward pass xs, hidden_states, outputs, probabilities = {}, {}, {}, {} loss = 0 hidden_states[-1] = np.copy(hidden_state_prev) for t in range(len(inputs)): # one-hot-encoding the input character xs[t] = np.zeros((vocab_size,1)) character = inputs[t] xs[t][character] = 1 target = targets[t] # Compute hidden state hidden_states[t] = np.tanh(input_weights_U @ xs[t] + hidden_weights_W @ hidden_states[t-1] + hidden_bias) # Compute output and probabilities outputs[t] = output_weights_V @ hidden_states[t] + output_bias probabilities[t] = np.exp(outputs[t]) / np.sum(np.exp(outputs[t])) #Compute cross-entropy loss loss += -np.log(probabilities[t][target,0])

This animation looks pretty cool but you probably won't learn much at first glance. Take a minute to match up the weights in the diagram with the weights in the code. You can hover over them and it will highlight the relationship for you. If you're really feeling up for it, you could work out the matrix multiplications by hand. The big takeaways here are:

Computing hidden state is most of the work
We use the same weights for every input of our forward pass
hidden_states[t-1] changes as we go through each step
We output probabilities for the target character at each step
Since our network is untrained the output probabilities are:

50% 0
50% 1

A Trained RNN

#Initialize weights and biases input_weights_U = ... #trained via gradient descent hidden_weights_W = ... #trained via gradient descent hidden_bias = ... #trained via gradient descent output_weights_V = ... #trained via gradient descent output_bias = ... #trained via gradient descent #Forward pass xs, hidden_states, outputs, probabilities = {}, {}, {}, {} loss = 0 hidden_states[-1] = np.copy(hidden_state_prev) for t in range(len(inputs)): # one-hot-encoding the input character xs[t] = np.zeros((vocab_size,1)) character = inputs[t] xs[t][character] = 1 target = targets[t] # Compute hidden state hidden_states[t] = np.tanh(input_weights_U @ xs[t] + hidden_weights_W @ hidden_states[t-1] + hidden_bias) # Compute output and probabilities outputs[t] = output_weights_V @ hidden_states[t] + output_bias probabilities[t] = np.exp(outputs[t]) / np.sum(np.exp(outputs[t])) #Compute cross-entropy loss loss += -np.log(probabilities[t][target,0])

This time our network outputs probabilities for 0 and 1 with almost 100% confidence. If you pay close attention to hidden_state you'll notice that it also alternates between [1,-1,1] and [-1,1,-1]. I can't pretend that I know why this is, but it's interesting to see.

In fact, the more we start thinking about it, the more it's interesting that hidden_state learns anything at all. Hidden state is meant to encode useful information about things we've seen in the past, but our problem doesn't depend on past information. Our network should really only care about whatever the current input (1 or 0) to our network is.

Making Hidden State Useful

If we want hidden state to be useful, we'll have to give it a problem where it's actually needed. We'll modify our original sequence slightly and have our network predict the next character in
1,1,0,1,1,0,1,1,0, ...
Now our network can no longer get away with simply predicting the opposite of the input. It will have to take special care to determine whether we're on the first or second 1 when predicting the output.

Below is an RNN trained to respond to input characters 1,1,0,1 with 1,0,1,1. Keep an eye on hidden_state as the animation progresses.

This time hidden state is actually encoding useful information we can observe directly. Whenever the network sees the first 1 in the sequence, hidden state is set to [0.72, 0.58, 0.96]. You can see this at the first and final steps of the animation. In contrast when the network sees the second 1, hidden state is set to [-0.76, 0.89, 0.95]. Varying hidden state like this allows our network to output the proper probabilities at each step despite our inputs (1) and parameters (U, W and V) being the same.

Visualizing RNNs

Josh Varty

An Untrained RNN

A Trained RNN

Making Hidden State Useful