- They work with sequences (eg. stock data, natural language)
- We can train them with backprop, the same way we do with convolutional nets
- They are flexible and work with a variety of structures
**One-to-many**: Take an image and return a caption for it.**Many-to-one**: Take a movie review and return if it's positive or negative**Many-to-Many**: Take an English sentence and return a Spanish sentence

`1`

,`0`

,`1`

,`0`

,`1`

,`0`

,`1`

,`0`

,`1`

,`0`

`...`

This seems simple. Maybe even

`3x3`

.
`vocab_size`

is just `2`

(each input character is `1`

or `0`

) and since we’re predicting such a simple pattern we’ll use a `hidden_size`

of `3`

.
`1`

,`0`

,`1`

,`0`

and is tasked
with predicting the target output sequence `0`

,`1`

,`0`

,`1`

, one
character at a time. Since our network is untrained, note that it outputs probabilities of roughly
`0.50`

at each time step.
#Initialize weights and biases input_weights_U = np.random.randn(hidden_size, vocab_size) * 0.01 hidden_weights_W = np.random.randn(hidden_size, hidden_size) * 0.01 hidden_bias = np.zeros((hidden_size, 1)) output_weights_V = np.random.randn(vocab_size, hidden_size) * 0.01 output_bias = np.zeros((vocab_size, 1)) #Forward pass xs, hidden_states, outputs, probabilities = {}, {}, {}, {} loss = 0 hidden_states[-1] = np.copy(hidden_state_prev) for t in range(len(inputs)): # one-hot-encoding the input character xs[t] = np.zeros((vocab_size,1)) character = inputs[t] xs[t][character] = 1 target = targets[t] # Compute hidden state hidden_states[t] = np.tanh(input_weights_U @ xs[t] + hidden_weights_W @ hidden_states[t-1] + hidden_bias) # Compute output and probabilities outputs[t] = output_weights_V @ hidden_states[t] + output_bias probabilities[t] = np.exp(outputs[t]) / np.sum(np.exp(outputs[t])) #Compute cross-entropy loss loss += -np.log(probabilities[t][target,0])

- Computing hidden state is most of the work
- We use the same weights for every input of our forward pass
`hidden_states[t-1]`

changes as we go through each step- We output probabilities for the target character at each step
- Since our network is untrained the output probabilities are:
- 50%
`0`

- 50%
`1`

#Initialize weights and biases input_weights_U = ... #trained via gradient descent hidden_weights_W = ... #trained via gradient descent hidden_bias = ... #trained via gradient descent output_weights_V = ... #trained via gradient descent output_bias = ... #trained via gradient descent #Forward pass xs, hidden_states, outputs, probabilities = {}, {}, {}, {} loss = 0 hidden_states[-1] = np.copy(hidden_state_prev) for t in range(len(inputs)): # one-hot-encoding the input character xs[t] = np.zeros((vocab_size,1)) character = inputs[t] xs[t][character] = 1 target = targets[t] # Compute hidden state hidden_states[t] = np.tanh(input_weights_U @ xs[t] + hidden_weights_W @ hidden_states[t-1] + hidden_bias) # Compute output and probabilities outputs[t] = output_weights_V @ hidden_states[t] + output_bias probabilities[t] = np.exp(outputs[t]) / np.sum(np.exp(outputs[t])) #Compute cross-entropy loss loss += -np.log(probabilities[t][target,0])

`0`

and `1`

with almost 100% confidence. If you pay close attention to `hidden_state`

you'll notice that it
also alternates between `[1,-1,1]`

and `[-1,1,-1].`

I can't pretend that I know why this is, but it's interesting to see.
`hidden_state`

learns anything at all. Hidden state is meant to encode useful information about things we've seen in the past, but our problem doesn't depend on past information.
Our network should really only care about whatever the current input (`1`

or `0`

) to our network is.
`1`

,`1`

,`0`

,`1`

,`1`

,`0`

,`1`

,`1`

,`0`

, `...`

Now our network can no longer get away with simply predicting the opposite of the input. It will have to take special care to determine whether we're on the first or second

`1`

when predicting the output.
`1`

,`1`

,`0`

,`1`

with `1`

,`0`

,`1`

,`1`

.
Keep an eye on `hidden_state`

as the animation progresses.
#Initialize weights and biases input_weights_U = ... #trained via gradient descent hidden_weights_W = ... #trained via gradient descent hidden_bias = ... #trained via gradient descent output_weights_V = ... #trained via gradient descent output_bias = ... #trained via gradient descent #Forward pass xs, hidden_states, outputs, probabilities = {}, {}, {}, {} loss = 0 hidden_states[-1] = np.copy(hidden_state_prev) for t in range(len(inputs)): # one-hot-encoding the input character xs[t] = np.zeros((vocab_size,1)) character = inputs[t] xs[t][character] = 1 target = targets[t] # Compute hidden state hidden_states[t] = np.tanh(input_weights_U @ xs[t] + hidden_weights_W @ hidden_states[t-1] + hidden_bias) # Compute output and probabilities outputs[t] = output_weights_V @ hidden_states[t] + output_bias probabilities[t] = np.exp(outputs[t]) / np.sum(np.exp(outputs[t])) #Compute cross-entropy loss loss += -np.log(probabilities[t][target,0])

`1`

in
the sequence, hidden state is set to `[0.72, 0.58, 0.96]`

. You can see this at the first and final steps of the animation. In contrast
when the network sees the second `1`

, hidden state is set to `[-0.76, 0.89, 0.95]`

. Varying hidden state like this allows our network
to output the proper probabilities at each step despite our inputs (`1`

) and parameters (`U`

, `W`

and `V`

) being the same.