# Training Neural Networks with Theano

**Training neural networks** involves quite a few tricky bits. We try to make everything clear and easy to understand, to get you training your neural networks as quickly as possible.

**Theano** allows us to write relatively concise code that follows the structure of the underlying maths. To run the code yourself, download the notebook at https://github.com/ASIDataScience/training-neural-networks-notebook

## Recognising hand-written digits

We will train a network to classify digits. More precisely, we want a network that when presented with an image of a hand-written digit will tell us what digit it is `('0', '1', ..., '9')`

. The data we will use for this task is known as the MNIST dataset. It has a long tradition in neural networks research, as the dataset is quite small but still very tricky to classify correctly.

You can download the MNIST dataset which we are using.

### Import Theano

You can install theano with the Python package manager `pip`

. At the command line type

`pip install theano`

or check out the theano documentation if you run into trouble

The `theano.tensor`

module contains useful functionality for manipulating vectors and matrices, like we will be doing here, so let's import it along with the full package.

```
import theano
import theano.tensor as T
import numpy as np
```

With the data loaded into a variable `mnist_dataset`

, we split it into three part: a training set, used to teach the network to recognize digits, a validation set that could be used to tune and compare models, and finally a test set, used to see how well it has learned.

We then prepare the datasets, splitting each set into the images and their labels, and store them in Theano shared variables, a bit of theano magic that's explained later. Understanding this massaging of the data isn't crucial.

```
train_set, valid_set, test_set = mnist_dataset
prepare_data = lambda x: (theano.shared(x[0].astype('float64')),
theano.shared(x[1].astype('int32')))
(training_x, training_y), (valid_x, valid_y), (test_x, test_y) = map(prepare_data, (train_set, valid_set, test_set))
```

Let's look at the first three images in the training set, then set about building a machine learning model that has the ability to recognise digits itself!

We've defined a function `plot_mnist_digit`

exactly for printing out the training images - it's just for prettiness.

```
for i in range(3):
plot_mnist_digit(training_x.get_value()[i]) # training_x is a theano object containing *images*
print ('Image class: '+ str(training_y.get_value()[i])) # training_y is a theano object containing *labels*
```

`Image class: 5`

`Image class: 0`

`Image class: 4`

# The neural network model

The neural network model that we will build defines a probability distribution

*P*(*Y* = *y* | *X* = x; *θ*),

where *Y* represents the image class, which means it is a random variable that can take the values `0-9`

, *X* represents the image pixels and is a vector-valued random variable (we collapse the image matrix into a vector), and *θ* is the set of model parameters that we are going to learn.

In this tutorial we build two models: first implementing a logistic regression model, then extending it to a neural network model.

## Multi-class logistic regression

The equation for our first model is give by

*P*(*Y* = *y* | x; *θ*) ∝ [*σ*(x^{T}W+b^{T})]_{y},

where [x]_{i} is the *i*th entry of vector x, and the use of the proportionality symbol ∝ means that the probability is equal to the expression on the right hand side times a constant chosen such that ∑_{y}*P*(*Y* = *y* | x; *θ*) = 1

The parameter set *θ* for this model is *θ* = {W, b}, where W is a matrix and b is a vector. We also use the non-linearity *σ* given by

$$\sigma(t) = \frac{1}{1+e^{-t}}$$

When applied to a vector or matrix, the sigmoid function *σ*(*t*) is applied entrywise.

## Understanding the logistic regression model

One way to think about the logistic regression model is that it takes the input (x), puts it through a linear combination (x^{T}W + b^{T}) and then finally through a non-linearity: *σ*(x^{T}W + b^{T}).

The result of this operation is a vector representing the entire discrete distribution over all the possible classes - in our case the ten possible digits `0-9`

. To get the probability of a particular class *y* = 6 we extract the 6*t**h* entry of the probability vector: [*σ*(x^{T}W+b^{T})]_{y}

Graphically the model looks like this:

Each indiviual entry of the vectors **x** and *P*(*Y*) is shown as a circle -- known as the units (or artificial *neurons*) of the network. We have *D* input units (the dimensionality of the input vector, which is the flattened matrix of pixels) and *C* output units (the number of classes, which is the digits `0-9`

). The model parameters **W** and **b** are represented as arrows. We also show the application of the sigmoid functions, but we do not represent the normalization that makes the probabilities sum up to 1.

Another way to write the model above is using the SoftMax function. A good exercise is deriving the SoftMax function based on the fact that we can also write the same model using SoftMax and the equal sign instead of the proportionality sign:

*P*(*Y* = *y* | x; *θ*)=[SoftMax(Wx+b)]_{y},

## Neural network with a single hidden layer

Our second model is given by

$$P(Y = y \ |\ \boldsymbol{\textrm{x}} ; \theta) = \left[ \textrm{SoftMax} \left( \boldsymbol{\textrm{h}} ^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{hy}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}} _\boldsymbol{\textrm{hy}}\right)\right]_y, \\
\boldsymbol{\textrm{h}} = \tanh \left( \boldsymbol{\textrm{x}} ^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{xh}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}}_\boldsymbol{\textrm{xh}}\right)$$

This is very similar to the logistic regression model. Here we have introduced a new vector-valued random variable *h*. We call this a 'hidden' variable or 'latent' variable, as we do not have any data observations for it. This variable may not even correspond to any quantity in the real world, but we use it to increase the power of our statistical model.

We now also have more parameters: *θ* = {W_{xh}, W_{hy}, b_{xh}, b_{hy}}.

tanh is the hyperbolic tangent function given by

$$\tanh(t) = \frac{e^t-e^{-t}}{e^t+e^{-t}}$$

Like with the sigmoid, when applied to a vector or matrix, tanh function is applied entrywise.

Graphically, this model looks like this:

Now our depiction shows a hidden layer with *M* units (this number can be different from the number of input neurons and number of output neurons), and we have two different nonlinearities in the graph: *t**a**n**h* and sigmoids (but again we are not graphically representing the SoftMax normalization).

## Teaching the network

The way we're going to make our network learn is by trying to find some values for our parameters *θ* so that the network is as likely as possible to guess the correct class. That is, given a data set of training images x_{1}, x_{2}, …, x_{N} and correct labels *y*_{1}, *y*_{2}, …, *y*_{N}, we want to find the parameters that maximize probability of the correct labels given the images.

This method of choosing parameters is called *maximum likelihood* (ML), and we can express it mathematically as finding the parameters *θ* which maximize the likelihood function:

*θ*^{*} = argmax_{θ} *P*(*Y*_{1} = *y*_{1}, *Y*_{2} = *y*_{2}, …, *Y*_{N} = *y*_{N} | X_{1} = x_{1}, X_{2} = x_{2}, …, X_{N} = x_{N}; *θ*)

And since our data points are independent, we can write this joint probability as a product of probabilities.

## Writing the likelihood function for an entire dataset

In our likelihood function above, each random variable pair (*X*_{1}, *Y*_{1}),(*X*_{2}, *Y*_{2}),…,(*X*_{N}, *Y*_{N}) refers to a single data point. But since virtually all of our computations need to deal with multiple data points, we will find it both useful and computationally efficient to express both the mathematics and our Python code in terms of datasets. Thus, we will express the scalar random variables *Y*_{1}, *Y*_{2}, …, *Y*_{N} as the vector-valued random variable *Y*, and the vector-valued random variables *X*_{1}, *X*_{2}, …, *X*_{N} as the matrix-valued random variable *X*, where the matrix *X* has as many **rows** as there are data points.

Using this notation, we rewrite the maximum likelihood equation above:

*θ*^{*} = argmax_{θ} *P*(*Y* = y | *X* = X; *θ*)

Similarly, we can specify the logistic regression model it terms of multiple datapoints:

*P*(*Y* | *X* = X; *θ*)=SoftMax(XW+1b^{T})

Here the result is a matrix of probabilities with as many rows as there are data points, and as many columns as there are classes. We also consider the SoftMax to normalize the result of the linear combination (XW+1b^{T}) in such a way that each row of the result is a proper probability distribution summing to 1. Note that we have had to multiply the bias vector b by the vertical vector of all ones 1 in order to add the bias term for every single data point.

The neural network model equations follow a similar pattern, which it would be a good exercise to write out for yourself.

## Computing the log-likelihood of a dataset

In most machine learning applications, it is better to maximize the log-likelihood rather than the likelihood. This is done because the log-likelihood tends to be simpler to compute and more numerically stable than the likelihood. In terms of the math, this doesn't make things much more complicated, as all we need to add is a log in front of the likelihood:

*θ*^{*} = argmax_{θ} log*P*(*Y* = y | *X* = X; *θ*)

Since the logarithm of a product is the sum of the logarithms, we can write:

$$\log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta) = \sum_{i=1}^N \log P( Y_i = \boldsymbol{\textrm{y}}_i\ | \ X_i = \boldsymbol{\textrm{X}}_i ; \theta)$$

that is, the log joint probability is the sum of the log marginal probabilities.

Now let's plug in the probability of a dataset from above to obtain:

log*P*(*Y* = y | *X* = X; *θ*)=∑[log(SoftMax(XW+1b^{T}))]_{⋅, y}

where we use the notation [M]_{⋅, y} to mean that from the matrix M we construct a new vector (*a*_{1}, *a*_{2}, …, *a*_{n}) such that *a*_{i} = *M*_{i, yi}∀*i*. We use a slight abuse of notation and we use the sum symbol to indicate the summation of the entries in the vector; we need a summation because the log joint probability is the sum of the log marginal probabilities.

## Classifying a dataset

Once we have a set of parameters *θ* that we are happy with, we can use the model to classify new data.

We have built a model that gives us a distribution over classes given the data. How do we assign a class to a new data point? The simplest way is to choose the class with the highest probability under the model to be the class we assign. We can write this mathematically, again using vector notation:

$$\hat{\boldsymbol{\textrm{y}}} = {\arg\max} \ P(Y \ |\ \boldsymbol{\textrm{X}} ; \theta) $$

# Gradient ascent

Computing the maximum likelihood parameters is computationally unfeasible, so we're going to use a method called *gradient ascent* to find a set of parameters that are really good but perhaps not the absolute best.

The idea of gradient ascent is very simple. Given a function *f* : ℝ^{n} → ℝ, we want to iteratively find points *f*(x_{n}) such that the value of the function gets progressively higher, that is: *f*(x_{n + 1})>*f*(x_{n})∀*n*. One way to do this is taking the direction of steepest ascent, which is just the gradient of the function $\frac{\partial f}{\partial \boldsymbol{\textrm{x}}}$, and taking a step in that direction times a constant *λ* known as the *learning rate* that describes how big the step should be. We express this mathematically as:

$$ \boldsymbol{\textrm{x}}_{n+1} \leftarrow \boldsymbol{\textrm{x}}_n + \lambda \frac{\partial f}{\partial \boldsymbol{\textrm{x}}} $$

The last detail is choosing the starting point, x_{0}, which we can arbitrarily choose by setting to zero or to some random value. Graphically, the algorithm looks like this, with each color representing the path from a different starting point:

Note that this method tends to find the top of the nearest hill ('local' maximum), and not the overall best point ('global' maximum).

It is also not guaranteed to increase the value of the function at each step; if the learning rate is too large, the algorithm could potentially jump across the top of the hill to a lower point on the other side of the hill. Much research goes into optimization methods, and many neural networks models are trained with methods that are more complicated than gradient ascent as it's presented here, but this same idea is at the base of all of those methods.

Finally, most people talk about and use gradient **descent** on the **negative** log-likelihood rather than gradient **ascent** on the log-likelihood; this is because gradient descent is the standard algorithm in the field of optimization. Gradient ascent in used in this tutorial to keep things a bit simpler.

## Optimizing our model with gradient ascent

This is extremely easy: we just apply the equation above to our parameters taking the gradient of the log-likelihood:

$$\begin{align} \boldsymbol{\textrm{W}} &\leftarrow \boldsymbol{\textrm{W}} + \lambda \frac{\partial \log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)}{\partial \boldsymbol{\textrm{W}}} \\ \boldsymbol{\textrm{b}} &\leftarrow \boldsymbol{\textrm{b}} + \lambda \frac{\partial \log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)}{\partial \boldsymbol{\textrm{b}}} \\ \end{align}$$# Coding with Theano

Coding with Theano is extremely simple: once we have the equations behind the model, we pretty much type them directly. Since we will be training on multiple data points, we are going to encode the model as we wrote it for datasets, using vectors and matrices.

We'll start with the logistic regression. Our model is:

*P*(*Y* | X; *θ*)=SoftMax(XW+1b^{T})

The first thing we do is declare all of the variables (X, y, W, b) that we will be using and their types like you would in Java or C:

```
n_classes = 10 # each digit is one of 0-9
dims = 28 * 28 # our input data is flattened 28x28 matrices of image pixels
X = T.dmatrix() # Theano double matrix
y = T.ivector() # Theano integer vector
W = theano.shared(np.zeros([dims,n_classes])) # Theano shared double matrix
b = theano.shared(np.zeros(n_classes)) # Theano shared double vector
```

As you can see above, we defined *W* and *b* to be *shared* variables. This means that the values of these variables are persistent -- their values live on after we have run Theano operations. This is opposed to regular Theano variables, which only take values when Theano runs, and otherwise only exist in the abstract.

The reason for making *W* and *b* shared variables is that we want to run multiple iterations of gradient descent, and to do that, we need their values to persist. Furthermore, we want to find good parameters through training, but we will then want to use the same parameters for prediction, so we need them to be persistent

Let's now write our statistical model in Theano. We basically copy the following equation into code:

*P*(*Y* | X; *θ*)=SoftMax(XW+1b^{T})

`P = T.nnet.softmax(T.dot(X,W) + b) # the matrix of probabilities of all classes for all data points`

Theano provides us with a ** T.nnet.softmax** function to compute SoftMax, correctly normalizing the probabilities so that each row of the matrix

**is a proper probability distribution that sums to 1.**

`P`

Note that we didn't need to multiply ** b** by the

**1**vector, Theano will do the correct addition for us automatically, just like numpy would do it. Here is a simple numpy example illustrating this:

```
amatrix = np.zeros((3,2)) # 3x2 matrix of all zeros
avector = np.array((1,2)) # the vector [1,2]
amatrix + avector
```

```
array([[ 1., 2.],
[ 1., 2.],
[ 1., 2.]])
```

Our next equation is the log-likelihood (**LL**) of a dataset:

log*P*(*Y* = y | *X* = X; *θ*)=∑[log(SoftMax(XW+1b^{T}))]_{⋅, y}

`LL = T.mean(T.log(P)[T.arange(P.shape[0]), y]) # the log-likelihood (LL)`

OK... there's a lot going on here.

We have used two important tricks: - using mean instead of sum, - using the strange indexing `[T.arange(P.shape[0]), y]`

Let's go over them one by one.

#### Use the mean instead of the sum

Imagine that we were to construct a new dataset that contains each data point in the original dataset twice. Then the log-likelihood of the new dataset will be double that of the original. More importantly, the gradient of the log-likelihood of the new dataset will also be double the gradient of the log-likelihood of the original dataset. But we would like the size of the gradient to not depend on the amount of duplication in our dataset, and the easiest way to accomplish that is to divide the gradient by the size of the dataset.

Since taking the mean is equivalent to taking the sum and then dividing by the number of data points, what we are computing here is a type of "normalized" log-likelihood that will cause our gradient descent algorithm to be robust to change in dataset size. The quantity we are computing in the code can be more precisely described as the average log-likelihood for a single datapoint.

#### Use indexing to create a vector from a matrix

The second thing we do is use ** [T.arange(P.shape[0]), y]** to apply the mathematical operation denoted in the equation by [⋅]

_{⋅, y}, that is, constructing a new vector (

*a*

_{1},

*a*

_{2}, …,

*a*

_{n}) from a matrix M such that

*a*

_{i}=

*M*

_{i, yi}∀

*i*.

As cryptic as it may be, this is a peculiar, but standard numpy way to index. For example, given a matrix

**M = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]

**

If we wanted to extract one element from each row, say the 1st element from the first row, and the last from the others

**M[0,0], M[1,3], M[2,3]

**

We could write that as a single index expression by combining the indexes

**M[(0,1,2), (0,3,3)]

**

But now the first index is just (0…#rows − 1), or, in code, ** np.arange(M.shape[0])**.

So we can write:

```
M = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
M[np.arange(M.shape[0]), (0,3,3)]
```

`array([ 1, 8, 12])`

We're done with the model. There's one more thing we need to do, and that is specify the gradient updates

```
learning_rate = 0.5 # we tuned this parameter by hand
updates = [
[W, W + learning_rate * T.grad(LL, W)],
[b, b + learning_rate * T.grad(LL, b)]
]
```

## Theano functions

OK, we're done coding the model. What do we do next?

When working with Theano, the next step is to create a Theano function. A Theano function is the basic unit of Theano code that we call to do something. In our case, this something will be performing a single gradient ascent iteration.

We create a Theano function by calling the function ** theano.function** (yes, we create a function by calling a function).

**has four important parameters that we provide in order to get a Theano function:**

`theano.function`

`inputs`

-- the list of input variables of the Theano function, similar to the inputs of a Python function`outputs`

-- the list of output variables of the Theano function, similar to the inputs of a Python function`updates`

-- a list of updates for shared variables, in the format we used above when we defined the variable`updates`

`givens`

-- a dictionary that allows substituting some variables from the model with other variables

In our case, we want the input to be the training dataset, the updates to be the gradient ascent updates, and while we don't really need an output, it will be helpful to get the log-likelihood as an output to see that we are doing the right things.

However, we will use the ** givens** instead of the

**to provide the data to the function.**

`input`

Doing it this way is more efficient, as we've already loaded up the training dataset into memory as a shared Theano function, when we first loaded the data.

Our Theano function will look like this:

```
training_function = theano.function(
inputs = [], # use givens instead of the inputs as it's more efficient
outputs = LL, # output log-likelihood just to check that it is improving
updates = updates, # these are the gradient updates, the one part that's really important
givens = {X: training_x, # we indicate that the model variables X and y defined in the abstract
y: training_y} # should take the values in the shared variables training_x and training_y
)
```

OK, let's run ten iterations of our code and see what the log-likelihood does

```
for i in range(10):
current_LL = training_function()
print("Log-likelihood = " + str(current_LL) + "\t\t" +
"Average probability of the correct class = " + str(np.exp(current_LL))
)
```

```
Log-likelihood = -2.302585093 Average probability of the correct class = 0.0999999999998
Log-likelihood = -1.83912090251 Average probability of the correct class = 0.158957103494
Log-likelihood = -1.52505436796 Average probability of the correct class = 0.217609225573
Log-likelihood = -1.31424405922 Average probability of the correct class = 0.268677350658
Log-likelihood = -1.16824101592 Average probability of the correct class = 0.310913352197
Log-likelihood = -1.06176739247 Average probability of the correct class = 0.34584402773
Log-likelihood = -0.98204553767 Average probability of the correct class = 0.374544170518
Log-likelihood = -0.919423093595 Average probability of the correct class = 0.398749015602
Log-likelihood = -0.869394886438 Average probability of the correct class = 0.419205139229
Log-likelihood = -0.827688588441 Average probability of the correct class = 0.437058341404
```

## Using the model for classification

Great, it appears we're improving the log-likelihood of the model on the training data. Now to use it to classify. Recall our equation that expresses how to use the model to get the class:

$$\hat{\boldsymbol{\textrm{y}}} = {\arg\max} \ P(Y \ |\ \boldsymbol{\textrm{X}} ; \theta) $$

Let's put that in code:

`y_hat = T.argmax(P, axis=1)`

Note that we had to specify **axis=1**, that is, we want to get the **argmax** for each **row**, as each row represents the distribution for one datapoint.

### Create a Theano function that classifies a dataset

Similarly to the training function, the classification function will use givens to pass in the test dataset, and output ** y_hat** which we just defined

```
classification_function = theano.function(
inputs = [],
outputs = y_hat,
givens = {X:test_x} # don't need the true labels test_y here
)
```

Now let's run the classification once, and print the first ten images, the true labels, and the labels assigned by the model

```
test_y_hat = classification_function()
print ("Classification error: "+ str(100 * (1 - np.mean(test_y_hat test_y.get_value()))) + "%")
for i in range(10):
plot_mnist_digit(
test_x.get_value()[i] # test_x is a theano object of images
)
print ('Image class: \t\t' + str(test_y.get_value()[i])) # test_y is a theano object of *true labels*
print ('Model-assigned class: \t' + str(test_y_hat[i])) # test_y_hat is a theano object of *predicted labels*
```

`Classification error: 15.12%`

```
Image class: 7
Model-assigned class: 7
```

```
Image class: 2
Model-assigned class: 2
```

```
Image class: 1
Model-assigned class: 1
```

```
Image class: 0
Model-assigned class: 0
```

```
Image class: 4
Model-assigned class: 4
```

```
Image class: 1
Model-assigned class: 1
```

```
Image class: 4
Model-assigned class: 4
```

```
Image class: 9
Model-assigned class: 9
```

```
Image class: 5
Model-assigned class: 2
```

```
Image class: 9
Model-assigned class: 9
```

## Training a neural network

So far we have trained a logistic regression model. The neural network model is so similar that we can imlepement it with just a few changes to the code.

We need

- to declare the hidden layer variable - to decide on the size of the hidden layer (we'll keep it small so it will run on your personal computer) - new parameters

```
H = T.dmatrix() # Theano double matrix
hidden\_layer\_size = 20
W_xh = theano.shared(0.01 * np.random.randn(dims, hidden_layer_size))
W_hy = theano.shared(np.zeros([hidden_layer_size, n_classes]))
b_xh = theano.shared(np.zeros(hidden_layer_size))
b_hy = theano.shared(np.zeros(n_classes))
```

Remember our model? Let's write it using matrices and vector for an entire dataset:

$$\boldsymbol{\textrm{H}} = \tanh \left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{xh}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}_\boldsymbol{\textrm{xh}}\right), \\
P(Y \ |\ \boldsymbol{\textrm{x}} ; \theta) = \textrm{SoftMax} \left( \boldsymbol{\textrm{H}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{hy}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}} _\boldsymbol{\textrm{hy}}\right)$$

Let's code it up:

```
H = T.tanh( T.dot(X, W_xh) + b_xh )
P = T.nnet.softmax( T.dot(H, W_hy) + b_hy )
```

Now let's add the gradient updates:

```
LL = T.mean(T.log(P)[T.arange(P.shape[0]), y]) # the log-likelihood (LL)
updates = [
[W_xh, W_xh + learning_rate * T.grad(LL, W_xh)],
[W_hy, W_hy + learning_rate * T.grad(LL, W_hy)],
[b_xh, b_xh + learning_rate * T.grad(LL, b_xh)],
[b_hy, b_hy + learning_rate * T.grad(LL, b_hy)],
]
```

### Redefining Log-loss

The one extremely important thing we did here was redefined ** LL**.

This is a crucial point about how Theano works:

Whenever we define a theano variable, like we did with**, we create a new object. When we define a new theano variable in terms of another Theano variable, like we did with**

`P`

**, using ****

`LL`

`LL = T.mean(T.log(P)[T.arange(P.shape[0]), y])`

** we implicitly create a new object for **`LL`

** that has a reference to our variable ** P** we just defined.

Now the crucial part: say we redefine ** P**. Then our variable

**still has a reference to the**

`LL`

*old*variable

**, and we need to update the reference to**

`P`

**by re-running the definition for**

`LL`

**for everything to work correctly.**

`LL`

Bugs in Theano are very commonly produced by exactly this. It is a good reason to always use Theano in scripts rather than in a notebook like we are here.

Phew, we are now ready to train!

```
training_function = theano.function(
inputs = [],
outputs = LL,
updates = updates,
givens = {X: training_x[:5000], # use only 10% of the data so model not too complicated
y: training_y[:5000]} # to train on a personal computer
)
for i in range(60): # train more than for logistic regression as this model is more complex
current_LL = training_function()
print(
"Log-likelihood = " + str(current_LL) + "\t\t" +
"Average probability of the correct class = " + str(np.exp(current_LL))
)
```

```
Log-likelihood = -2.30258509299 Average probability of the correct class = 0.1
Log-likelihood = -2.30123532532 Average probability of the correct class = 0.100135067902
Log-likelihood = -2.2998167966 Average probability of the correct class = 0.100277213167
Log-likelihood = -2.29812291581 Average probability of the correct class = 0.100447214752
Log-likelihood = -2.29590367206 Average probability of the correct class = 0.100670379142
Log-likelihood = -2.29281799209 Average probability of the correct class = 0.100981495471
Log-likelihood = -2.28837617693 Average probability of the correct class = 0.101431034253
Log-likelihood = -2.28186920166 Average probability of the correct class = 0.102093195481
Log-likelihood = -2.27228942655 Average probability of the correct class = 0.103075924982
Log-likelihood = -2.25826075997 Average probability of the correct class = 0.104532133214
Log-likelihood = -2.23801271431 Average probability of the correct class = 0.1066702782
Log-likelihood = -2.20944653423 Average probability of the correct class = 0.109761380875
Log-likelihood = -2.17036118024 Average probability of the correct class = 0.114136385658
Log-likelihood = -2.11893562814 Average probability of the correct class = 0.120159454814
Log-likelihood = -2.05448392668 Average probability of the correct class = 0.128158957933
Log-likelihood = -1.9781664624 Average probability of the correct class = 0.138322624675
Log-likelihood = -1.89308937088 Average probability of the correct class = 0.150605812178
Log-likelihood = -1.80356444392 Average probability of the correct class = 0.16471073844
Log-likelihood = -1.71390604967 Average probability of the correct class = 0.180160699808
Log-likelihood = -1.62743071587 Average probability of the correct class = 0.196433620113
Log-likelihood = -1.54612544874 Average probability of the correct class = 0.213071934689
Log-likelihood = -1.47089832383 Average probability of the correct class = 0.22971903039
Log-likelihood = -1.40197788868 Average probability of the correct class = 0.246109704628
Log-likelihood = -1.33918043475 Average probability of the correct class = 0.262060356155
Log-likelihood = -1.28205618089 Average probability of the correct class = 0.27746619282
Log-likelihood = -1.23000534858 Average probability of the correct class = 0.292291014336
Log-likelihood = -1.1823856608 Average probability of the correct class = 0.306546549485
Log-likelihood = -1.13858981262 Average probability of the correct class = 0.320270344716
Log-likelihood = -1.09808364787 Average probability of the correct class = 0.333509593518
Log-likelihood = -1.06041545036 Average probability of the correct class = 0.346311905034
Log-likelihood = -1.02521125894 Average probability of the correct class = 0.358720674451
Log-likelihood = -0.992165825303 Average probability of the correct class = 0.37077279169
Log-likelihood = -0.96103293577 Average probability of the correct class = 0.382497586412
Log-likelihood = -0.931615847613 Average probability of the correct class = 0.393916686506
Log-likelihood = -0.903757946709 Average probability of the correct class = 0.405044659855
Log-likelihood = -0.877333927776 Average probability of the correct class = 0.415890228318
Log-likelihood = -0.852241954255 Average probability of the correct class = 0.426457760591
Log-likelihood = -0.828397170186 Average probability of the correct class = 0.436748759536
Log-likelihood = -0.805726712165 Average probability of the correct class = 0.446763140352
Log-likelihood = -0.784166136418 Average probability of the correct class = 0.456500202015
Log-likelihood = -0.76365701148 Average probability of the correct class = 0.465959288932
Log-likelihood = -0.744145358272 Average probability of the correct class = 0.475140201108
Log-likelihood = -0.725580637598 Average probability of the correct class = 0.484043433528
Log-likelihood = -0.707915058331 Average probability of the correct class = 0.492670316262
Log-likelihood = -0.691103068952 Average probability of the correct class = 0.501023101114
Log-likelihood = -0.675100970071 Average probability of the correct class = 0.509105013643
Log-likelihood = -0.65986663098 Average probability of the correct class = 0.516920271045
Log-likelihood = -0.645359309369 Average probability of the correct class = 0.524474059803
Log-likelihood = -0.631539569966 Average probability of the correct class = 0.531772469537
Log-likelihood = -0.61836928705 Average probability of the correct class = 0.538822386202
Log-likelihood = -0.605811706762 Average probability of the correct class = 0.545631354182
Log-likelihood = -0.593831541869 Average probability of the correct class = 0.552207420302
Log-likelihood = -0.582395074029 Average probability of the correct class = 0.558558973142
Log-likelihood = -0.571470244429 Average probability of the correct class = 0.564694589
Log-likelihood = -0.561026720601 Average probability of the correct class = 0.570622892704
Log-likelihood = -0.551035933447 Average probability of the correct class = 0.576352438247
Log-likelihood = -0.541471083378 Average probability of the correct class = 0.581891611356
Log-likelihood = -0.532307117743 Average probability of the correct class = 0.587248554016
Log-likelihood = -0.523520683596 Average probability of the correct class = 0.592431109513
Log-likelihood = -0.515090060711 Average probability of the correct class = 0.597446785713
```

## Test the model

```
y_hat = T.argmax(P, axis=1)
test\_y\_hat = classification_function()
print ("Classification error: " + str(100 * (1 - np.mean(test_y_hat test_y.get_value()))) + "%")
for i in range(10):
plot_mnist_digit(
test_x.get_value()[i] # test_y is a theano object of *images*
)
print ('Image class: \t\t' + str(test_y.get_value()[i])) # test_y is a theano object of *true labels*
print ('Model-assigned class: \t' + str(test_y_hat[i])) # test_y_hat is a theano object of *predicted labels*
```

`Classification error: 15.12%`

```
Image class: 7
Model-assigned class: 7
```

```
Image class: 2
Model-assigned class: 2
```

```
Image class: 1
Model-assigned class: 1
```

```
Image class: 0
Model-assigned class: 0
```

```
Image class: 4
Model-assigned class: 4
```

```
Image class: 1
Model-assigned class: 1
```

```
Image class: 4
Model-assigned class: 4
```

```
Image class: 9
Model-assigned class: 9
```

```
Image class: 5
Model-assigned class: 2
```

```
Image class: 9
Model-assigned class: 9
```

## We've trained a neural network with Theano!

Along the way we used a good chunk of mathematics, a hand-full of tricks but required very few lines of code.

Hopefully this leads nicely to using Theano more, in particular we would be excited to see you set theano to work with: smarter training algorithms, parallelized training, and training more popular network architectures.

Good luck!

By Gabi Teodoru.