Applications for the ASI Data Science Fellowship are now open:

https://www.asidatascience.com/fellowship

We are always looking for new companies to partner with in our Fellowship program. For more information please email: info@asidatascience.com

]]>Robert (or Rabbie) Burns was a Scottish poet whose corpus includes ‘An Ode to A Haggis’ and the New Year’s Eve favourite ‘Auld Lang Syne’. Each year on January 25th people throughout the UK come together to celebrate his life, for a night that typically revolves around two of Scotland’s most famous culinary exports: haggis and whisky. This year, the ASI team decided to celebrate Burn’s night in a creative manner: building a robot to produce Burns-esque poetry^{1}.

How can a machine generate meaningful text? You can think of a couple of ways to approach this problem. The first strategy is to start with a big collection of words - say all of the words that Burns ever used - and train the algorithm to pick collections of words that form plausible sentences. The second, more fine-grained option, is to generate words letter by letter. To do this, we have a collection of possible letters (A-Z), and we train the algorithm to select letters that produce real words. We do this over and over, and we end up with sentences.

In both cases, the problem boils down to one of probability. We want the machine to generate words and sentences which seem plausible, which is another way of saying that they have high probability. For example, the sentence ‘I ate some haggis’ has a higher probability than the sentence ‘carpet grab bag leg’. Similarly, the combination of letters arranged as ‘the’ is very probable; the combination ‘xyo’ is not very probable.

We can go one step further: in order to generate sentences that sound like they were written by Robert Burns, we train upon sentences written by Burns. The algorithm thus learns to generate sentences that have a high probability in ‘Burns language’ which as you’ll appreciate if you’ve read any poems, is not quite the same as normal English. In this manner, we could teach a machine to write like Burns, or Wikipedia, or Shakespeare.

So how do we actually do this? We need to specify a model by which the bot can learn what sentences are probable, and then produce them. Here we describe an approach using Markov chains (don’t worry, we unpack that term below); in a later blog post we will discuss how neural networks can provide a powerful alternative approach.

We used a Markov chain model to generate sentences in a word-by-word fashion. Despite being rather intimidatingly named, Markov models capture a simple idea: that you can work out the likely next word given the past few words. These past few words constitute a state. The model is trained by splitting up the training corpus into these ‘states’, and for each state noting what the likely next word is. By doing this repeatedly, we can generate sequences of plausible words.

For example, let’s assume we trawl our text and find that we have the phrases ‘the man jumped’, and ‘the man cooked’. If we define our state as ‘the man’, we can quickly see that the only possible words are ‘jumped’ and ‘cooked’, which are each 50% likely. Other words, like ‘banana’ and ‘woman’ are impossible (that is, p=0). Note also, that other words which might be plausible - like ‘ran’ or ‘awoke’- are also assigned p=0, because they didn’t appear in the training material. The model doesn’t know anything apart from what it saw in the data - no knowledge of semantics, or syntax, or grammar. It’s a rather ignorant model, but it still does a surprisingly good job! If you’d like to have a go at generating your own Markov model, you can do so here courtesy of Daniel Ecer.

The process of generating sequences of words is visualised below. To start our sentence, we pick a random state - ‘the man’- from the corpus. We then perform a random walk through the model. At each step, we choose from the likely next words according to their probabilities. So if a word is 90% likely to follow from a given state, we pick it 90% of the time in our random walk. Here ‘jumped’ and ‘cooked’ are equiprobable, and we end up choosing ‘cooked’. This gives us our next state -’ man cooked’- and the process begins again, up to some specified sentence length. For this state, we’ve coloured the possible choices by their probability, with more opaque corresponding to higher probability. You can see that we actually ended up selecting the second most probable option (pork), generating the new state ‘cooked pork’.

We implemented this in Python using a neat open-source package called Markovify. Our corpus was provided by the impeccable Alberto Favaro, who scraped it from here. Using a state size of 3, we found that we could produce rather nice poems:

But I look to the North; But what is a watery grave?

Wha will crack to me my lovely maid.

Chorus.-Carle, an the King come.

What says she my dear, my native soil!

By specifying the first word of the sequence, one can also produce acrostics, spelling out words with the first letter of each word:

And here's the flower that I loe best,

So may no ruffian-feeling in my breast, I can feel, by its throbbings

In vain auld age his body batters, In vain the burns cam down like waters

The final (unecessary) step was to place code into a Slackbot, such that we could integrate it directly into ASI’s slack channel. With the help of a nice guide and a bit of hacking, we ended up with our very own Burns bot, providing novel poetry on-demand:

Over the next couple of posts, we’ll unpack how we packaged our algorithm into a working bot (and give you the opportunity to try out the Burns bot for yourself), and discuss more sophisticated approaches to language generation using a flavour of neural network utilising a Long Short Term Memory (LSTM) units.

^{1} Being civilised and respectful types, we also drank some whisky

*Setrak Balian, ASI Data Science, London*

Imagine a virtual world in which policymakers can try out different policies without having to suffer the real-world consequences.

ASI Data Science teamed up with HM Department for Education to build a simulated world of teachers and schools in order to investigate effects of different policies on the long-standing issues of teacher recruitment and retention.

For any complex simulation, the ingredients can be boiled down to three. First, there must be actors (or variables) in the simulation. For the teacher world, there were two types: teachers and schools.

Second, there must be some interaction between the actors so that they don’t just act independently or at random. This ingredient drives collective changes in time (dynamics) and was modelled as teachers moving between different schools as well as in and out of the school system as a whole. Policies can then be applied to either certain teachers or to the system as a whole. The effectiveness of policies may then be assessed by observing the behaviour of teachers in time - i.e., how many of them move to a more attractive school? Do they move out of the school system?

Third, and most important of all, is data. Data provides the crucial link to the real world and gives credibility to any simulation. The data used was DfE’s extensive teacher database ^{1} which includes teacher-level information such as subjects and hours taught, age and gender, Ofsted scores of their schools and much more.

Suppose every individual teacher was included in the simulation. It will be extremely difficult to keep track of all these teachers. Unsupervised machine learning, a type of artificial intelligence, can help by reducing the entire teacher population into representative groups based on their historical characteristics in the database. The machine learning algorithm used by the ASI-DfE team reduced the number of secondary teachers from about 200,000 to 8 representative actors.

Eight actors are much easier to track than hundreds of thousands but that’s not the only advantage. This so-called clustering is also useful in creating a narrative: it uses data to segment teachers into distinct types. These types can then be used for effectively targeting policies.

The outcome was a prototype “teacher world” whose inhabitants were representative teachers identified using unsupervised machine learning on the DfE teacher database. Different policies can now be applied in the simulation and their effects on teacher behaviour compared. The clustering into eight may also prove valuable in targeting policies.

The Department for Education teacher database was anonymised prior to analysis and the analysis was performed on a highly secure computing environment. ↩

I’m making the move from particle physics to data science because I want my skills to have a more direct, positive impact on people. My project with ASI Data Science and Lambeth Council gave me that chance.

Rogue landlords in the UK illegally make £4.2 billion pounds per year by forcing two million people to live in homes with severe hazards. These homes are often overcrowded, have a high risk of carbon monoxide poisoning, cockroach and rat infestations, faulty electrics, and mould and damp. Despite rogue landlords being well publicised in the media, last year there were fewer than five hundred convictions; this is, in part, due to the complex and lengthy process of gathering enough evidence for a prosecution. Rogue landlords build up a complex web of lies. They have many different properties, registered in many different names, associated with many different companies.

Over a hundred hours of evidence collection by expert fraud offices goes into investigating a single suspected rogue landlord. Starting with the name or address of a suspected rogue landlord, investigators spend hour after hour manually searching databases to build up a profile. The results of their searches are written on reams of paper, which makes it difficult to spot the complex connections between people, companies, and properties.

If investigators had a way to visualise all the information in one place, thousands of hours per year could be saved, resulting in taxpayer savings, in addition to more convictions being made. I realised investigators need a tool that was easy to use, quick to run and could be accessed by anyone anywhere. They need a way to see the complex network, and papers lists are not the solution.

In 6 weeks, I designed and built a web application that turns a manual process that used to take 20 hours of expert time into a search tool that automates the process and within two minutes returns the results in an interactive network graph for the investigators to explore.

The biggest hurdle was accessing the data. I used publicly available APIs provided by Companies House and the Land Registry, but both required further authorisation. Let’s say I learned a great deal about bureaucracy - a skill which will no doubt be invaluable during my career as a data scientist!

After persevering and being granted access to the data, I could put my technical skills to work. I wrote recursive algorithms to repeatedly query the APIs and used natural language processing to match between databases. I got to grips with NoSQL to store the recursion results in a Neo4j graph database, which means all the information can be easily accessed when a search is repeated. For me, the biggest challenge was making a web app that looked as good as the results were. In particle physics we don’t care too much about making things that look good, just things that work.

Lambeth Council love the product and are now using it in all fraud investigations. I’m looking forward to finding my next project as a data scientist.

Sam Short took part in the ASI Fellowship September 2016.

]]>Talent, not money, is the biggest constraint facing companies who want to make the most of their data. Most organisations already have huge amounts of data, and an Accenture survey published in 2015 showed 73% of companies are spending more than 20% of their overall technology budget on big data analytics. Even given these resources, many are still unable to use it effectively to drive their decisions. As Experian concluded last summer, most companies are: ‘drowning in data, but starving for insights.’

How should you solve the talent constraint? At ASI Data Science, we think we’ve cracked the problem. Since 2014, we have run Europe’s leading programme for top quantitative PhDs, postdocs and software engineers to transition to data science. This is an eight week bootcamp where Fellows work directly with sponsoring companies, on real data science projects. For companies, this offers an outstanding opportunity to undertake an exploratory data science project at low risk. Then, after seeing the Fellow in a real world environment, they can choose whether to make the Fellow a job offer. It’s the perfect hiring pipeline for some of the world’s most talented people.

This January marks our seventh ASI Fellowship. Our fellows have completed over 100 projects with firms ranging from seed-stage startups to large multinationals. The ASI Fellowship programme is unique in many ways. Almost 10% of the UK’s STEM subject PhDs apply to join our programme every year, which is remarkable! The quality of talent is exceptional - getting a place is highly competitive and only 1 in 10 qualified applicants will get an offer. And all of this costs less than a typical recruiter’s fee for hiring a data scientist.

Here is a diagram to show how the fellowship works:

But don’t just listen to us. Here are what some of our project companies think:

“The ASI fellowship was a fantastic opportunity for easyJet to experiment with data science in a low risk manner. We hired the Fellows that were working with us, and they have been the foundation of the data science team at easyJet. This is a unique programme that has greatly benefitted easyJet!”

Alberto Villaverde, Head of Data Science at easyJet

“I’ve highly recommended the Fellowship to others. It’s a great way to find very bright Data Scientists and complete well defined Data Science projects.”

Harvinder Atwal, Head of Data Strategy and Advanced Analytics at MoneySuperMarket

"My experience with the ASI and the Fellowship was fantastic - it helped to understand the power of data science, and set the context for ideas in the area for the next phase of of work. I've continued to work with ASI, in a variety of capacities, and found that their combination of technical expertise and experience with transformation management is unrivalled."

Phillip Sinclair, Head of Innovation and Growth, Cabinet Office

“I would recommend the Fellowship, particularly to companies that want to get into data science but don’t have existing teams. Hiring data scientists is very difficult due to the churn in the market and the novelty of the technology, and ASI does a lot to help along that journey.”

Gwyn Jones, Data Science Lead, QuintilesIMS

"I would definitely recommend the Fellowship. It's hard to find great talent and the Fellowship is very helpful for hiring - every Fellow you've selected was worth meeting with."

Mait Muntel, CEO at Lingvist

"Being on the Fellowship pushed us out of our comfort zone and let us learn more about data science and its potential value to us. I have already recommended it to others."

Rauno Ringas, CEO at Qminder

"I would absolutely recommend the Fellowship to others. In fact, I already have. If you have some data and an idea that is just out of your reach, this is the programme for you."

Keith Whatling, Analytics Lead at Arriva (a DeutscheBahn company)

"The Fellowship was very well organised and run, every interaction we had with you, your team or the Fellow was positive and we are very impressed with your company and the Fellowship. I would absolutely recommend."

Dave Crane, founder of Smoke Free

Read

]]>Artificial intelligence could transform the public sector, but adoption has been slow. As the Government Office for Science publishes its report into the subject, Richard Sargent, director of ASI Data Science and former performance lead at the Government Digital Service, says leadership and culture will be crucial for success.

Read the full article here: http://publictechnology.net/articles/opinion/changing-culture-what-government-must-do-make-most-ai

]]>Read more about it here:

http://www.cityam.com/253029/ai-firm-asi-data-science-raises-15m-investment-backing

We're excited to be in the Sunday Times and City AM this week on being backed by some of the smartest (and data-savviest) investors in Europe.

Read more about it here:

http://www.cityam.com/253029/ai-firm-asi-data-science-raises-15m-investment-backing

-- Rudnicki's Nobel Prize Principle

Academic research is wonderful and absolutely fundamental to the progress of humanity in every field. I used to be a researcher myself, and I remember the frustration when, after

]]>*“Only someone who understands something absolutely
can explain it so no one else can understand it.”*

-- Rudnicki's Nobel Prize Principle

Academic research is wonderful and absolutely fundamental to the progress of humanity in every field. I used to be a researcher myself, and I remember the frustration when, after talking with excitement about the incredible results achieved in my field (not by me, but by other amazing researchers), the people I was talking to would ask the feared question, in an unimpressed tone: “Cool, but what is that useful for?”.

When I decided to leave academia I joined ASI Data Science and I think the transition has been smooth: the best part of my current job is that I’m required to keep up-to-date with the latest research trends and to keep studying and experimenting. That said, after turning to industry, it quickly became clear to me that I need to change the way I worked.

The main area that I needed to improve was communication. A couple of years ago, I remember thinking about why I had to spend so much time writing a paper or worrying about presentations instead of conducting more experiments and getting more results. But after having spent time on this side of the barrier, it is now much clearer to me how important it is to communicate the results in the correct way so that others can understand it.

Industry needs academic research. But research needs industry too, and I am very sorry to see so many missed opportunities due to a lack of communication.

Academic talks are usually incredibly dense, full of content but also often inscrutable and impossible for others to understand. I see slides and slides crammed full with words, formulas and plots. None of the slides are designed to convey a clear message to others. It’s almost as if whoever put the slides together thought: “Well, they won’t understand anyway, so why bother trying?” From the other side, I know the real reason is more likely that the researcher doesn’t think they have the time to be bothered with these kinds of details, but this type of attitude towards presentations is a real put off for audiences, whether technical or not.

I’m not talking about giving amazing and inspirational talks. I’m talking more about covering the basics and caring about the audience who will be listening. Who are they? What’s their background? What are they interested in? And simple things to fix such as: are my plots readable? Are the labels clear and big enough to be seen? Is there a legend that is legible or should I explain all of the details so my audience will understand, or perhaps even both? And how long will it take for a human to read and understand all of the words I put in this slide whilst I talk over it?

Of course, the content of your presentation won’t be accessible to everyone, especially if you’re talking about cutting edge research. However, an audience composed of other academics will have the tools and ability to understand what you are talking about if you put them in conditions to do so. And this applies for other audience settings too. If you can communicate your research at a level appropriate for industry, the research is much likely to be picked up by companies, leading to possible collaborations and funding.

The output of research is not only the results you get to write in a paper, but also includes the way the results are disseminated, shared with other researchers and put into practice in the real world. Academia and Industry should go hand in hand, and my hope is that even a little bit more effort in trying to speak a similar language will help reduce the gap between them. If this does not happen we will keep on living in two different worlds separated by a glass wall, and cutting edge research will take much longer to find practical applications, leading to a much slower progress. And that would be a shame for everyone.

]]>It's been over half a century since a record audience of 32.3 million people watched England's triumph over West Germany in the 1966 World Cup final, and Britain remains a nation of avid TV watchers. According to Ofcom’s most recent statistics, 95% of households contain a working television set and Britons spend an average of over three and a half hours per day parked in front of the box.

In an age when broadcast TV and on-demand programming are colliding on the same device, how can providers like Sky strike a balance between these two very different types of content?

Maybe Saturday night is movie night, but on weekday afternoons your children simply want endless on-demand repeats of Peppa Pig. Wouldn't it be great if someone could work out these patterns and offer you the right content at the right time, without forcing you to trawl through the same menus every week?

That’s exactly the problem I was tasked with solving on my project with Sky on the ASI Fellowship. This specific problem also proved to be a great example of a much more general question: How do you identify rare events when your data is really unbalanced?

How unbalanced? Well, despite the recent explosion in on-demand content, it turns out that people still spend about two thirds of their time watching ordinary live TV. Compare that to the less than 2% of time spent carrying on with a boxset they’re already part-way through.

The simplest recommender identifies the most-popular options and suggest them every time. The catch is creating a system which answers every question with “Why not watch BBC1?” is a bit like hiring the Ninja Turtles as wedding caterers: sure, everyone likes pizza, but nobody’s going to be impressed when they turned up expecting a banquet.

Instead I chose a Naive Bayesian classifier. This type of model relies on the probabilities that the data displays certain features given that the user took a specific action. The challenge then becomes one of feature engineering: can we identify a set of variables from which separations between the classes naturally emerge?

In the case of television, I found the trick was to exploit timing patterns in people’s viewing habits. Live broadcasters are clearly tied to a schedule, but so too are on-demand viewers who bookend their Breaking Bad boxset binges between work and other commitments.

Because of this, the features I chose were time-based, focusing on the viewer’s actions at a number of previous time points ranging from an hour to a week ago. In this way, someone who records a programme while they’re at work and watches it when they get home every night at 9pm will have a history which reflects that behaviour. Separating out actions with this specific history therefore filters out a large fraction of the live TV watches that might otherwise overwhelm the prediction.

In this way, by reflecting on the structure of the data and engineering features to match these patterns, I was able to cut a path through the jungle of noise and spot the rare beasts lurking within.

**Andy Perch (Fellowship VI)**

Andy's data science career began with a PhD in particle physics at UCL, which involved using neural networks to identify rare electron neutrino interactions against a large background of other particles.

For his Fellowship project, Andy worked at Sky to develop a model that predicts users' viewing intentions based on their past behaviour, allowing for more personalised programme recommendations to be made.

]]>7th December

**Data Science Lab: Understanding Data & Graphs**

An evening with three data scientists who have used graphs in their work on both theory and practical applications.

Sheldon Hall is a software engineer working at Grakn Labs where a distributed knowledge

]]>Join us for the last Meetup for 2016!

7th December

**Data Science Lab: Understanding Data & Graphs**

An evening with three data scientists who have used graphs in their work on both theory and practical applications.

Sheldon Hall is a software engineer working at Grakn Labs where a distributed knowledge graph is being developed. His current role involves research and development of our graph based knowledge representation system and analytics engine. He has a master’s degree in mathematical modelling and scientific computing, a doctorate in nuclear engineering, and is a previous ASI Fellow. Other speakers to be announced.

Details & RSVP here: https://datasciencegraph.eventbrite.co.uk

]]>I was never able to pull off a wheelie with the bike, or to do a walk-the-dog with the yoyo. As many of us have learned at the playground, not knowing these tricks usually means that you won’t be able to hang around with the bigger kids. Not even if you want it so, so much. Fast-forward a few decades and, as it happens, you are looking for a job in data science. The bigger kids have transformed into companies that do awesome things like convolutional neural networks and reinforcement learning. The playground logic, however, still generally applies. Unless you know the right programming stunts, you are unlikely to ace interviews with those amazing companies. And what’s worse, you still can’t pull off a wheelie.

An essential coding trick to learn in preparation for data-science job interviews is sorting algorithms. If you are not familiar with these, and are looking for a gentle introduction, you have landed in the right blog post. Hopefully the code will not show it, but I am a beginner too. Indeed, this blog post originated as an exercise, where I challenged myself to program every algorithm described in 0612’s YouTube video series *Sorting Algorithms Redux*. Encouraged by ASI Data Science, I turned that piece of work in a brief tutorial, adding some text and watchfully revising the code. I have learned a lot both from 0612’s videos and from several comments offered by Pascal Bugnion re. how to correctly implement the algorithms. Let me take the opportunity to thank them, and to inform the reader that any mistake is entirely my own. If you are getting ready for coding interviews, this blog post should be useful, particularly if you are aiming high!

Two side-comments are in order. (i) The source (Jupyter) notebook for this blog post can be downloaded here. (ii) As I have avoided using further packages, the code relies purely on in-built functions. The unique exception is NumPy, Python's main package for scientific calculations, which I employed for testing the output of algorithms.

```
# NumPy will be used only to test the code
import numpy as np
```

```
def argmin(array):
# Get the index of the smallest
# element in the array
arg = 0
for i in range(1, len(array)):
if array[i] < array[arg]:
arg = i
return arg
def selection_sort(array):
n = len(array)
for i in range(n):
j = argmin(array[i:n]) + i
array[i], array[j] = array[j], array[i]
return array
```

Input:

```
selection_sort([]) == []
```

Output:

```
True
```

Input:

```
array = list(np.random.randint(100, size=120))
np.array_equal(selection_sort(array), np.sort(array))
```

Output:

```
True
```

```
def bubble_sort(array):
n = len(array)
for i in range(n):
changed = False
for j in range(n-i-1):
if array[j] > array[j+1]:
array[j], array[j+1] = array[j+1], array[j]
changed = True
if not changed:
break
return array
```

Input:

```
bubble_sort([]) == []
```

Output:

```
True
```

Input:

```
array = [i for i in range(10)]
bubble_sort(array) == array
```

Output:

```
True
```

Input:

```
array = list(np.random.randint(100, size=120))
np.array_equal(bubble_sort(array), np.sort(array))
```

Output:

```
True
```

```
# Sort while moving forward or backward
def partly_sort(array, i, forward):
n = len(array)
# Range of forward traverse
if forward == True:
interval = range(i, n-i-1)
# Range of backward traverse
elif forward == False:
interval = reversed(range(i, n-i-2))
# Partly sort
changed = False
for j in interval:
if array[j] > array[j+1]:
array[j], array[j+1] = array[j+1], array[j]
changed = True
return (array, changed)
# Cocktail sort
def cocktail_sort(array):
n = len(array)
for i in range(n):
(array, changed) = partly_sort(array, i, True)
if not changed:
break
(array, changed) = partly_sort(array, i, False)
if not changed:
break
return array
```

Input:

```
cocktail_sort([]) == []
```

Output:

```
True
```

Input:

```
array = [i for i in range(10)]
cocktail_sort(array) == array
```

Output:

```
True
```

Input:

```
array = list(np.random.randint(100, size=120))
np.array_equal(cocktail_sort(array), np.sort(array))
```

Output:

```
True
```

```
def insertion_sort(array):
n = len(array)
for i in range(n-1):
j = i
while (j+1 > 0) and (array[j] > array[j+1]):
array[j], array[j+1] = array[j+1], array[j]
j -= 1
return array
```

Input:

```
insertion_sort([]) == []
```

Output:

```
True
```

Input:

```
array = list(np.random.randint(100, size=120))
np.array_equal(insertion_sort(array), np.sort(array))
```

Output:

```
True
```

```
def bucket_sort(array, mini, maxi, b, basic_sort = bubble_sort):
n = len(array)
# Safety checks
if n == 0:
return []
elif n < b:
raise ValueError('Number of buckets exceeds length of array.')
# Bucketing
step = float(maxi - mini)/b
buckets = [[] for i in range(b)]
result = []
for value in array:
# Range for last bucket
# includes both endpoints
if value == maxi:
j = b - 1
# Ranges for other buckets
# include left endpoint only
else:
j = int((value - mini)/step)
buckets[j].append(value)
# Sorting and recombining
for bucket in buckets:
result.extend(basic_sort(bucket))
return result
```

Input:

```
mini = 0
maxi = 20
array = np.random.random_integers(0, 20, size=20)
np.array_equal(bucket_sort(array, mini, maxi, 10), np.sort(array))
```

Output:

```
True
```

Input:

```
mini = 0.
maxi = 20.
array = np.random.uniform(mini, maxi, size=20)
np.array_equal(bucket_sort(array, mini, maxi, 10), np.sort(array))
```

Output:

```
True
```

```
# Get the nth Fibonacci number
def get_fibonacci(n):
if n <= 2:
return 1
else:
return get_fibonacci(n-1) + get_fibonacci(n-2)
```

Test the function `get_fibonacci(n)`

. Input:

```
[get_fibonacci(i) for i in range(1, 10)]
```

Output:

```
[1, 1, 2, 3, 5, 8, 13, 21, 34]
```

Even though `get_fibonacci(n)`

illustrates well the mechanics of recursion, it is not an effective way of generating Fibonacci numbers. One can verify this using the diagram below, which lists the function calls `f(n)`

= `get_fibonacci(n)`

required to compute the 6th Fibonacci number. It appears evident that many quantities are calculated repeatedly. For example, the whole subtree leading to `f(4)`

is evaluated twice. These unnecessary operations are wasteful, and make `get_fibonacci(n)`

very inefficient. As a matter of fact, it is easy to deduce from the structure of the diagram below that the function has exponential time complexity.

It may now look like recursion ought to be abandoned as a technique for calculating Fibonacci numbers. This impression, however, turns out to be mistaken. Indeed, fixing the problems that weigh `get_fibonacci(n)`

down results in an efficient algorithm, whose time complexity is `O(n)`

. As demonstrated in `get_fibonacci_memo(n)`

below, the key is to store the intermediate function values as they are calculated. To be specific, the dictionary `lookup`

keeps track of all previously encountered Fibonacci numbers, making sure that no time is wasted in computing these again. Notably, the trick of saving partial results is called memoization, and has a central role in dynamic programming.

```
lookup = {}
def get_fibonacci_memo(n):
if n <= 2:
return 1
elif n in lookup:
return lookup[n]
else:
lookup[n] = get_fibonacci_memo(n - 1) + get_fibonacci_memo(n - 2)
return lookup[n]
```

Test the function `get_fibonacci_memo(n)`

. Input:

```
[get_fibonacci_memo(i) for i in range(1,10)]
```

Output:

```
[1, 1, 2, 3, 5, 8, 13, 21, 34]
```

Memoization provides reduced time complexity by increasing the space complexity of the algorithm. In other words, a speedup is obtained at the cost of some memory, that occupied by `lookup`

. This can be regarded as a drawback of memoization, but not in all cases. If the program calculates Fibonacci numbers frequently, having the `lookup`

table becomes an advantage. As a matter of fact, every time `get_function_memo(n)`

is called (for a new value of `n`

), the program can refer to the available intermediate results, and thus offer a quicker answer. These remarks are valid beyond the current example with Fibonacci numbers, and apply to memoization as a general coding technique.

In the above code, you may have spotted that, even though `lookup`

is defined outside `get_fibonacci_memo(n)`

, it is accessed or modified inside the function. As a matter of fact, the dictionary `lookup`

is a global variable, which can be seen from any location in the program, including from inside a function, where variables are normally local. It is important to note that the use of global variables exhibits many dangers. For instance, as global variables can be modified from any line of code, unwanted or incorrect changes become more likely and harder to track down. A common and useful piece of advice is, in fact, to stay clear of global variables if possible. These arguments motivate the following variant of the memoized algorithm to calculate Fibonacci numbers. In particular, a class is defined that surrounds `lookup`

, whereby no global variables are used.

```
class Fibonacci(object):
def __init__(self):
self._lookup = {}
def __call__(self, n):
if n <= 2:
return 1
elif n in self._lookup:
return self._lookup[n]
else:
self._lookup[n] = self.__call__(n - 1) + self.__call__(n - 2)
return self._lookup[n]
```

Test the class `Fibonacci()`

. Input:

```
fn = Fibonacci()
[fn(i) for i in range(1, 10)]
```

Output:

```
[1, 1, 2, 3, 5, 8, 13, 21, 34]
```

```
def partition(array):
n = len(array)
# Sorting
l = 0
r = n - 1
pointer = (1, 0)
while l < r:
if array[l] > array[r]:
array[l], array[r] = array[r], array[l]
pointer = pointer[::-1]
l += pointer[0]
r -= pointer[1]
return (array, l)
def quick_sort(array):
n = len(array)
if n > 1:
(array, p) = partition(array)
array[:p] = quick_sort(array[:p])
array[p+1:] = quick_sort(array[p+1:])
return array
```

Input:

```
quick_sort([]) == []
```

Output:

```
True
```

Input:

```
array = list(np.random.randint(100, size=120))
np.array_equal(quick_sort(array), np.sort(array))
```

Output:

```
True
```

```
def get_combo(left, right):
combo = []
i = 0
j = 0
# Both left[i:] and right[j:] contain elements
while left[i:] and right[j:]:
if left[i] < right[j]:
combo.append(left[i])
i += 1
else:
combo.append(right[j])
j += 1
# Either left[i:] or right[j:] is empty
combo.extend(left[i:])
combo.extend(right[j:])
return combo
```

Input:

```
get_combo([],[]) == []
```

Output:

```
True
```

Input:

```
get_combo([0,1],[]) == [0,1]
```

Output:

```
True
```

Input:

```
get_combo([],[0,1]) == [0,1]
```

Output:

```
True
```

Input:

```
get_combo([1, 2, 3, 4, 5, 6, 7], [1, 2, 4, 5, 6])
```

Output:

```
[1, 1, 2, 2, 3, 4, 4, 5, 5, 6, 6, 7]
```

```
def merge_sort(array):
n = len(array)
if n > 1:
i = int(n/2)
left = merge_sort(array[:i])
right = merge_sort(array[i:])
return get_combo(left, right)
else:
return array
```

Input:

```
merge_sort([]) == []
```

Output:

```
True
```

Input:

```
array = list(np.random.randint(100, size=120))
np.array_equal(merge_sort(array), np.sort(array))
```

Output:

```
True
```

**Opening image**: J.L. Haven and C. Hettrick, "Bandelore," U.S. Patent 59,745, Nov. 20, 1866.**Video lectures**: 0612, Educational technologist, Singapore.**Folk-dance video**: Algo-rythmics, Sapientia University, Târgu Mureș, Romania.**Coding**: Alberto Favaro with advice from Pascal Bugnion.

Data Science has burst on the scene as a new employment area where demand for candidates outstrips supply. It is also a field that values PhD candidates – particularly from quantitative subjects. If you're wondering whether the academic path is right for you, but cannot imagine wasting the technical skills you've developed in academia, you might want to consider a career in Data Science, 'the sexiest job of the 21st century' according to Harvard Business School.

Data science is the set of skills and techniques necessary to find, store, process and draw insights from the vast quantities of data now available.

In your PhD, you received amazing training in 90% of the skills required to be a data scientist - things like collecting and interpreting data, testing hypotheses and communicating results. All that's needed is for you to excel in an exciting new career is to acquire the remaining 10% - the industry specific skills. This is where the ASI can help.

The ASI offers an 8-week full-time Fellowship, developed in collaboration with industry, specifically aimed at helping PhD graduates and Postdocs in science, maths and engineering transition to a career in data science and data engineering. Thanks to generous support, we are able to provide the course free of charge for accepted Fellows.

This is our 7th Fellowship. Since starting in 2014, we have helped 100+ Fellows transitioned to industry working in data science. ASI Fellowship VII starts on 16th January 2017. Applications are open until 31st October 2016 and Fellows are accepted on a rolling basis. Apply here.

We will be hosting a series of events on careers in data science, and answering any questions about the ASI Fellowship.

- Cambridge, 2-4pm, 30th September, Mill Lane Lecture Rooms
- Imperial College London, 13th October, Sign up here
- UCL, 19th October, Sign up here

We will also be hosting a webinar on October 12th.

]]>3rd Oct

Richard Sargeant, Director of Transformation at ASI Data Science and founding director of the Government Digital Service, will be hosting a fireside chat with current and past fellows about the work he's been doing in government and Google. Places are strictly limited so please message

]]>**Transforming Large Organisations**

3rd Oct

Richard Sargeant, Director of Transformation at ASI Data Science and founding director of the Government Digital Service, will be hosting a fireside chat with current and past fellows about the work he's been doing in government and Google. Places are strictly limited so please message info@asidatascience.com directly for an invite.

**Plotly x Data Science Lab: Visualisations with Scala**

20th Oct

Run in collaboration with our friends at Plotly, come hear three superb speakers on their experiences using Scala, how to build a Scala client, and tips on data visualisations with Scala. Speakers include Pascal Bugnion, author of Scala for Data Science. More details and RSVP to come.

**See us at**

We'll also be speaking at a number of great conferences this October. Come say hello at:

- IP EXPO, London, Oct 5-6th
- ODSC, London, Oct 7-8th
- Intelligent Data Processing, Barcelona, Oct 10-11th

**Theano** allows us to write relatively concise code that follows the structure of the underlying maths. To run the code

**Training neural networks** involves quite a few tricky bits. We try to make everything clear and easy to understand, to get you training your neural networks as quickly as possible.

**Theano** allows us to write relatively concise code that follows the structure of the underlying maths. To run the code yourself, download the notebook at https://github.com/ASIDataScience/training-neural-networks-notebook

We will train a network to classify digits. More precisely, we want a network that when presented with an image of a hand-written digit will tell us what digit it is `('0', '1', ..., '9')`

. The data we will use for this task is known as the MNIST dataset. It has a long tradition in neural networks research, as the dataset is quite small but still very tricky to classify correctly.

You can download the MNIST dataset which we are using.

You can install theano with the Python package manager `pip`

. At the command line type

`pip install theano`

or check out the theano documentation if you run into trouble

The `theano.tensor`

module contains useful functionality for manipulating vectors and matrices, like we will be doing here, so let's import it along with the full package.

```
import theano
import theano.tensor as T
import numpy as np
```

With the data loaded into a variable `mnist_dataset`

, we split it into three part: a training set, used to teach the network to recognize digits, a validation set that could be used to tune and compare models, and finally a test set, used to see how well it has learned.

We then prepare the datasets, splitting each set into the images and their labels, and store them in Theano shared variables, a bit of theano magic that's explained later. Understanding this massaging of the data isn't crucial.

```
train_set, valid_set, test_set = mnist_dataset
prepare_data = lambda x: (theano.shared(x[0].astype('float64')),
theano.shared(x[1].astype('int32')))
(training_x, training_y), (valid_x, valid_y), (test_x, test_y) = map(prepare_data, (train_set, valid_set, test_set))
```

Let's look at the first three images in the training set, then set about building a machine learning model that has the ability to recognise digits itself!

We've defined a function `plot_mnist_digit`

exactly for printing out the training images - it's just for prettiness.

```
for i in range(3):
plot_mnist_digit(training_x.get_value()[i]) # training_x is a theano object containing *images*
print ('Image class: '+ str(training_y.get_value()[i])) # training_y is a theano object containing *labels*
```

`Image class: 5`

`Image class: 0`

`Image class: 4`

The neural network model that we will build defines a probability distribution

*P*(*Y* = *y* | *X* = x; *θ*),

where *Y* represents the image class, which means it is a random variable that can take the values `0-9`

, *X* represents the image pixels and is a vector-valued random variable (we collapse the image matrix into a vector), and *θ* is the set of model parameters that we are going to learn.

In this tutorial we build two models: first implementing a logistic regression model, then extending it to a neural network model.

The equation for our first model is give by

*P*(*Y* = *y* | x; *θ*) ∝ [*σ*(x^{T}W+b^{T})]_{y},

where [x]_{i} is the *i*th entry of vector x, and the use of the proportionality symbol ∝ means that the probability is equal to the expression on the right hand side times a constant chosen such that ∑_{y}*P*(*Y* = *y* | x; *θ*) = 1

The parameter set *θ* for this model is *θ* = {W, b}, where W is a matrix and b is a vector. We also use the non-linearity *σ* given by

$$\sigma(t) = \frac{1}{1+e^{-t}}$$

When applied to a vector or matrix, the sigmoid function *σ*(*t*) is applied entrywise.

One way to think about the logistic regression model is that it takes the input (x), puts it through a linear combination (x^{T}W + b^{T}) and then finally through a non-linearity: *σ*(x^{T}W + b^{T}).

The result of this operation is a vector representing the entire discrete distribution over all the possible classes - in our case the ten possible digits `0-9`

. To get the probability of a particular class *y* = 6 we extract the 6*t**h* entry of the probability vector: [*σ*(x^{T}W+b^{T})]_{y}

Graphically the model looks like this:

Each indiviual entry of the vectors **x** and *P*(*Y*) is shown as a circle -- known as the units (or artificial *neurons*) of the network. We have *D* input units (the dimensionality of the input vector, which is the flattened matrix of pixels) and *C* output units (the number of classes, which is the digits `0-9`

). The model parameters **W** and **b** are represented as arrows. We also show the application of the sigmoid functions, but we do not represent the normalization that makes the probabilities sum up to 1.

Another way to write the model above is using the SoftMax function. A good exercise is deriving the SoftMax function based on the fact that we can also write the same model using SoftMax and the equal sign instead of the proportionality sign:

*P*(*Y* = *y* | x; *θ*)=[SoftMax(Wx+b)]_{y},

Our second model is given by

$$P(Y = y \ |\ \boldsymbol{\textrm{x}} ; \theta) = \left[ \textrm{SoftMax} \left( \boldsymbol{\textrm{h}} ^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{hy}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}} _\boldsymbol{\textrm{hy}}\right)\right]_y, \\
\boldsymbol{\textrm{h}} = \tanh \left( \boldsymbol{\textrm{x}} ^\boldsymbol{\textrm{T}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{xh}} + \boldsymbol{\textrm{b}}^\boldsymbol{\textrm{T}}_\boldsymbol{\textrm{xh}}\right)$$

This is very similar to the logistic regression model. Here we have introduced a new vector-valued random variable *h*. We call this a 'hidden' variable or 'latent' variable, as we do not have any data observations for it. This variable may not even correspond to any quantity in the real world, but we use it to increase the power of our statistical model.

We now also have more parameters: *θ* = {W_{xh}, W_{hy}, b_{xh}, b_{hy}}.

tanh is the hyperbolic tangent function given by

$$\tanh(t) = \frac{e^t-e^{-t}}{e^t+e^{-t}}$$

Like with the sigmoid, when applied to a vector or matrix, tanh function is applied entrywise.

Graphically, this model looks like this:

Now our depiction shows a hidden layer with *M* units (this number can be different from the number of input neurons and number of output neurons), and we have two different nonlinearities in the graph: *t**a**n**h* and sigmoids (but again we are not graphically representing the SoftMax normalization).

The way we're going to make our network learn is by trying to find some values for our parameters *θ* so that the network is as likely as possible to guess the correct class. That is, given a data set of training images x_{1}, x_{2}, …, x_{N} and correct labels *y*_{1}, *y*_{2}, …, *y*_{N}, we want to find the parameters that maximize probability of the correct labels given the images.

This method of choosing parameters is called *maximum likelihood* (ML), and we can express it mathematically as finding the parameters *θ* which maximize the likelihood function:

*θ*^{*} = argmax_{θ} *P*(*Y*_{1} = *y*_{1}, *Y*_{2} = *y*_{2}, …, *Y*_{N} = *y*_{N} | X_{1} = x_{1}, X_{2} = x_{2}, …, X_{N} = x_{N}; *θ*)

And since our data points are independent, we can write this joint probability as a product of probabilities.

In our likelihood function above, each random variable pair (*X*_{1}, *Y*_{1}),(*X*_{2}, *Y*_{2}),…,(*X*_{N}, *Y*_{N}) refers to a single data point. But since virtually all of our computations need to deal with multiple data points, we will find it both useful and computationally efficient to express both the mathematics and our Python code in terms of datasets. Thus, we will express the scalar random variables *Y*_{1}, *Y*_{2}, …, *Y*_{N} as the vector-valued random variable *Y*, and the vector-valued random variables *X*_{1}, *X*_{2}, …, *X*_{N} as the matrix-valued random variable *X*, where the matrix *X* has as many **rows** as there are data points.

Using this notation, we rewrite the maximum likelihood equation above:

*θ*^{*} = argmax_{θ} *P*(*Y* = y | *X* = X; *θ*)

Similarly, we can specify the logistic regression model it terms of multiple datapoints:

*P*(*Y* | *X* = X; *θ*)=SoftMax(XW+1b^{T})

Here the result is a matrix of probabilities with as many rows as there are data points, and as many columns as there are classes. We also consider the SoftMax to normalize the result of the linear combination (XW+1b^{T}) in such a way that each row of the result is a proper probability distribution summing to 1. Note that we have had to multiply the bias vector b by the vertical vector of all ones 1 in order to add the bias term for every single data point.

The neural network model equations follow a similar pattern, which it would be a good exercise to write out for yourself.

In most machine learning applications, it is better to maximize the log-likelihood rather than the likelihood. This is done because the log-likelihood tends to be simpler to compute and more numerically stable than the likelihood. In terms of the math, this doesn't make things much more complicated, as all we need to add is a log in front of the likelihood:

*θ*^{*} = argmax_{θ} log*P*(*Y* = y | *X* = X; *θ*)

Since the logarithm of a product is the sum of the logarithms, we can write:

$$\log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta) = \sum_{i=1}^N \log P( Y_i = \boldsymbol{\textrm{y}}_i\ | \ X_i = \boldsymbol{\textrm{X}}_i ; \theta)$$

that is, the log joint probability is the sum of the log marginal probabilities.

Now let's plug in the probability of a dataset from above to obtain:

log*P*(*Y* = y | *X* = X; *θ*)=∑[log(SoftMax(XW+1b^{T}))]_{⋅, y}

where we use the notation [M]_{⋅, y} to mean that from the matrix M we construct a new vector (*a*_{1}, *a*_{2}, …, *a*_{n}) such that *a*_{i} = *M*_{i, yi}∀*i*. We use a slight abuse of notation and we use the sum symbol to indicate the summation of the entries in the vector; we need a summation because the log joint probability is the sum of the log marginal probabilities.

Once we have a set of parameters *θ* that we are happy with, we can use the model to classify new data.

We have built a model that gives us a distribution over classes given the data. How do we assign a class to a new data point? The simplest way is to choose the class with the highest probability under the model to be the class we assign. We can write this mathematically, again using vector notation:

$$\hat{\boldsymbol{\textrm{y}}} = {\arg\max} \ P(Y \ |\ \boldsymbol{\textrm{X}} ; \theta) $$

Computing the maximum likelihood parameters is computationally unfeasible, so we're going to use a method called *gradient ascent* to find a set of parameters that are really good but perhaps not the absolute best.

The idea of gradient ascent is very simple. Given a function *f* : ℝ^{n} → ℝ, we want to iteratively find points *f*(x_{n}) such that the value of the function gets progressively higher, that is: *f*(x_{n + 1})>*f*(x_{n})∀*n*. One way to do this is taking the direction of steepest ascent, which is just the gradient of the function $\frac{\partial f}{\partial \boldsymbol{\textrm{x}}}$, and taking a step in that direction times a constant *λ* known as the *learning rate* that describes how big the step should be. We express this mathematically as:

$$ \boldsymbol{\textrm{x}}_{n+1} \leftarrow \boldsymbol{\textrm{x}}_n + \lambda \frac{\partial f}{\partial \boldsymbol{\textrm{x}}} $$

The last detail is choosing the starting point, x_{0}, which we can arbitrarily choose by setting to zero or to some random value. Graphically, the algorithm looks like this, with each color representing the path from a different starting point:

Note that this method tends to find the top of the nearest hill ('local' maximum), and not the overall best point ('global' maximum).

It is also not guaranteed to increase the value of the function at each step; if the learning rate is too large, the algorithm could potentially jump across the top of the hill to a lower point on the other side of the hill. Much research goes into optimization methods, and many neural networks models are trained with methods that are more complicated than gradient ascent as it's presented here, but this same idea is at the base of all of those methods.

Finally, most people talk about and use gradient **descent** on the **negative** log-likelihood rather than gradient **ascent** on the log-likelihood; this is because gradient descent is the standard algorithm in the field of optimization. Gradient ascent in used in this tutorial to keep things a bit simpler.

This is extremely easy: we just apply the equation above to our parameters taking the gradient of the log-likelihood:

$$\begin{align} \boldsymbol{\textrm{W}} &\leftarrow \boldsymbol{\textrm{W}} + \lambda \frac{\partial \log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)}{\partial \boldsymbol{\textrm{W}}} \\ \boldsymbol{\textrm{b}} &\leftarrow \boldsymbol{\textrm{b}} + \lambda \frac{\partial \log P( Y = \boldsymbol{\textrm{y}}\ | \ X = \boldsymbol{\textrm{X}} ; \theta)}{\partial \boldsymbol{\textrm{b}}} \\ \end{align}$$Coding with Theano is extremely simple: once we have the equations behind the model, we pretty much type them directly. Since we will be training on multiple data points, we are going to encode the model as we wrote it for datasets, using vectors and matrices.

We'll start with the logistic regression. Our model is:

*P*(*Y* | X; *θ*)=SoftMax(XW+1b^{T})

The first thing we do is declare all of the variables (X, y, W, b) that we will be using and their types like you would in Java or C:

```
n_classes = 10 # each digit is one of 0-9
dims = 28 * 28 # our input data is flattened 28x28 matrices of image pixels
X = T.dmatrix() # Theano double matrix
y = T.ivector() # Theano integer vector
W = theano.shared(np.zeros([dims,n_classes])) # Theano shared double matrix
b = theano.shared(np.zeros(n_classes)) # Theano shared double vector
```

As you can see above, we defined *W* and *b* to be *shared* variables. This means that the values of these variables are persistent -- their values live on after we have run Theano operations. This is opposed to regular Theano variables, which only take values when Theano runs, and otherwise only exist in the abstract.

The reason for making *W* and *b* shared variables is that we want to run multiple iterations of gradient descent, and to do that, we need their values to persist. Furthermore, we want to find good parameters through training, but we will then want to use the same parameters for prediction, so we need them to be persistent

Let's now write our statistical model in Theano. We basically copy the following equation into code:

*P*(*Y* | X; *θ*)=SoftMax(XW+1b^{T})

`P = T.nnet.softmax(T.dot(X,W) + b) # the matrix of probabilities of all classes for all data points`

Theano provides us with a ** T.nnet.softmax** function to compute SoftMax, correctly normalizing the probabilities so that each row of the matrix

`P`

Note that we didn't need to multiply ** b** by the

```
amatrix = np.zeros((3,2)) # 3x2 matrix of all zeros
avector = np.array((1,2)) # the vector [1,2]
amatrix + avector
```

```
array([[ 1., 2.],
[ 1., 2.],
[ 1., 2.]])
```

Our next equation is the log-likelihood (**LL**) of a dataset:

log*P*(*Y* = y | *X* = X; *θ*)=∑[log(SoftMax(XW+1b^{T}))]_{⋅, y}

`LL = T.mean(T.log(P)[T.arange(P.shape[0]), y]) # the log-likelihood (LL)`

OK... there's a lot going on here.

We have used two important tricks: - using mean instead of sum, - using the strange indexing `[T.arange(P.shape[0]), y]`

Let's go over them one by one.

Imagine that we were to construct a new dataset that contains each data point in the original dataset twice. Then the log-likelihood of the new dataset will be double that of the original. More importantly, the gradient of the log-likelihood of the new dataset will also be double the gradient of the log-likelihood of the original dataset. But we would like the size of the gradient to not depend on the amount of duplication in our dataset, and the easiest way to accomplish that is to divide the gradient by the size of the dataset.

Since taking the mean is equivalent to taking the sum and then dividing by the number of data points, what we are computing here is a type of "normalized" log-likelihood that will cause our gradient descent algorithm to be robust to change in dataset size. The quantity we are computing in the code can be more precisely described as the average log-likelihood for a single datapoint.

The second thing we do is use ** [T.arange(P.shape[0]), y]** to apply the mathematical operation denoted in the equation by [⋅]

As cryptic as it may be, this is a peculiar, but standard numpy way to index. For example, given a matrix

**M = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]

**

If we wanted to extract one element from each row, say the 1st element from the first row, and the last from the others

**M[0,0], M[1,3], M[2,3]

**

We could write that as a single index expression by combining the indexes

**M[(0,1,2), (0,3,3)]

**

But now the first index is just (0…#rows − 1), or, in code, ** np.arange(M.shape[0])**.

So we can write:

```
M = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
M[np.arange(M.shape[0]), (0,3,3)]
```

`array([ 1, 8, 12])`

We're done with the model. There's one more thing we need to do, and that is specify the gradient updates

```
learning_rate = 0.5 # we tuned this parameter by hand
updates = [
[W, W + learning_rate * T.grad(LL, W)],
[b, b + learning_rate * T.grad(LL, b)]
]
```

OK, we're done coding the model. What do we do next?

When working with Theano, the next step is to create a Theano function. A Theano function is the basic unit of Theano code that we call to do something. In our case, this something will be performing a single gradient ascent iteration.

We create a Theano function by calling the function ** theano.function** (yes, we create a function by calling a function).

`theano.function`

`inputs`

-- the list of input variables of the Theano function, similar to the inputs of a Python function`outputs`

-- the list of output variables of the Theano function, similar to the inputs of a Python function`updates`

-- a list of updates for shared variables, in the format we used above when we defined the variable`updates`

`givens`

-- a dictionary that allows substituting some variables from the model with other variables

In our case, we want the input to be the training dataset, the updates to be the gradient ascent updates, and while we don't really need an output, it will be helpful to get the log-likelihood as an output to see that we are doing the right things.

However, we will use the ** givens** instead of the

`input`

Doing it this way is more efficient, as we've already loaded up the training dataset into memory as a shared Theano function, when we first loaded the data.

Our Theano function will look like this:

```
training_function = theano.function(
inputs = [], # use givens instead of the inputs as it's more efficient
outputs = LL, # output log-likelihood just to check that it is improving
updates = updates, # these are the gradient updates, the one part that's really important
givens = {X: training_x, # we indicate that the model variables X and y defined in the abstract
y: training_y} # should take the values in the shared variables training_x and training_y
)
```

OK, let's run ten iterations of our code and see what the log-likelihood does

```
for i in range(10):
current_LL = training_function()
print("Log-likelihood = " + str(current_LL) + "\t\t" +
"Average probability of the correct class = " + str(np.exp(current_LL))
)
```

```
Log-likelihood = -2.302585093 Average probability of the correct class = 0.0999999999998
Log-likelihood = -1.83912090251 Average probability of the correct class = 0.158957103494
Log-likelihood = -1.52505436796 Average probability of the correct class = 0.217609225573
Log-likelihood = -1.31424405922 Average probability of the correct class = 0.268677350658
Log-likelihood = -1.16824101592 Average probability of the correct class = 0.310913352197
Log-likelihood = -1.06176739247 Average probability of the correct class = 0.34584402773
Log-likelihood = -0.98204553767 Average probability of the correct class = 0.374544170518
Log-likelihood = -0.919423093595 Average probability of the correct class = 0.398749015602
Log-likelihood = -0.869394886438 Average probability of the correct class = 0.419205139229
Log-likelihood = -0.827688588441 Average probability of the correct class = 0.437058341404
```

Great, it appears we're improving the log-likelihood of the model on the training data. Now to use it to classify. Recall our equation that expresses how to use the model to get the class:

$$\hat{\boldsymbol{\textrm{y}}} = {\arg\max} \ P(Y \ |\ \boldsymbol{\textrm{X}} ; \theta) $$

Let's put that in code:

`y_hat = T.argmax(P, axis=1)`

Note that we had to specify **axis=1**, that is, we want to get the **argmax** for each **row**, as each row represents the distribution for one datapoint.

Similarly to the training function, the classification function will use givens to pass in the test dataset, and output ** y_hat** which we just defined

```
classification_function = theano.function(
inputs = [],
outputs = y_hat,
givens = {X:test_x} # don't need the true labels test_y here
)
```

Now let's run the classification once, and print the first ten images, the true labels, and the labels assigned by the model

```
test_y_hat = classification_function()
print ("Classification error: "+ str(100 * (1 - np.mean(test_y_hat test_y.get_value()))) + "%")
for i in range(10):
plot_mnist_digit(
test_x.get_value()[i] # test_x is a theano object of images
)
print ('Image class: \t\t' + str(test_y.get_value()[i])) # test_y is a theano object of *true labels*
print ('Model-assigned class: \t' + str(test_y_hat[i])) # test_y_hat is a theano object of *predicted labels*
```

`Classification error: 15.12%`

```
Image class: 7
Model-assigned class: 7
```

```
Image class: 2
Model-assigned class: 2
```

```
Image class: 1
Model-assigned class: 1
```

```
Image class: 0
Model-assigned class: 0
```

```
Image class: 4
Model-assigned class: 4
```

```
Image class: 1
Model-assigned class: 1
```

```
Image class: 4
Model-assigned class: 4
```

```
Image class: 9
Model-assigned class: 9
```

```
Image class: 5
Model-assigned class: 2
```

```
Image class: 9
Model-assigned class: 9
```

So far we have trained a logistic regression model. The neural network model is so similar that we can imlepement it with just a few changes to the code.

We need

- to declare the hidden layer variable - to decide on the size of the hidden layer (we'll keep it small so it will run on your personal computer) - new parameters

```
H = T.dmatrix() # Theano double matrix
hidden\_layer\_size = 20
W_xh = theano.shared(0.01 * np.random.randn(dims, hidden_layer_size))
W_hy = theano.shared(np.zeros([hidden_layer_size, n_classes]))
b_xh = theano.shared(np.zeros(hidden_layer_size))
b_hy = theano.shared(np.zeros(n_classes))
```

Remember our model? Let's write it using matrices and vector for an entire dataset:

$$\boldsymbol{\textrm{H}} = \tanh \left( \boldsymbol{\textrm{X}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{xh}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}}_\boldsymbol{\textrm{xh}}\right), \\
P(Y \ |\ \boldsymbol{\textrm{x}} ; \theta) = \textrm{SoftMax} \left( \boldsymbol{\textrm{H}} \boldsymbol{\textrm{W}}_\boldsymbol{\textrm{hy}} + \boldsymbol{\textrm{1b}}^\boldsymbol{\textrm{T}} _\boldsymbol{\textrm{hy}}\right)$$

Let's code it up:

```
H = T.tanh( T.dot(X, W_xh) + b_xh )
P = T.nnet.softmax( T.dot(H, W_hy) + b_hy )
```

Now let's add the gradient updates:

```
LL = T.mean(T.log(P)[T.arange(P.shape[0]), y]) # the log-likelihood (LL)
updates = [
[W_xh, W_xh + learning_rate * T.grad(LL, W_xh)],
[W_hy, W_hy + learning_rate * T.grad(LL, W_hy)],
[b_xh, b_xh + learning_rate * T.grad(LL, b_xh)],
[b_hy, b_hy + learning_rate * T.grad(LL, b_hy)],
]
```

The one extremely important thing we did here was redefined ** LL**.

This is a crucial point about how Theano works:

Whenever we define a theano variable, like we did with`P`

`LL`

`LL = T.mean(T.log(P)[T.arange(P.shape[0]), y])`

** we implicitly create a new object for **`LL`

** that has a reference to our variable ** P** we just defined.

Now the crucial part: say we redefine ** P**. Then our variable

`LL`

`P`

`LL`

`LL`

Bugs in Theano are very commonly produced by exactly this. It is a good reason to always use Theano in scripts rather than in a notebook like we are here.

Phew, we are now ready to train!

```
training_function = theano.function(
inputs = [],
outputs = LL,
updates = updates,
givens = {X: training_x[:5000], # use only 10% of the data so model not too complicated
y: training_y[:5000]} # to train on a personal computer
)
for i in range(60): # train more than for logistic regression as this model is more complex
current_LL = training_function()
print(
"Log-likelihood = " + str(current_LL) + "\t\t" +
"Average probability of the correct class = " + str(np.exp(current_LL))
)
```

```
Log-likelihood = -2.30258509299 Average probability of the correct class = 0.1
Log-likelihood = -2.30123532532 Average probability of the correct class = 0.100135067902
Log-likelihood = -2.2998167966 Average probability of the correct class = 0.100277213167
Log-likelihood = -2.29812291581 Average probability of the correct class = 0.100447214752
Log-likelihood = -2.29590367206 Average probability of the correct class = 0.100670379142
Log-likelihood = -2.29281799209 Average probability of the correct class = 0.100981495471
Log-likelihood = -2.28837617693 Average probability of the correct class = 0.101431034253
Log-likelihood = -2.28186920166 Average probability of the correct class = 0.102093195481
Log-likelihood = -2.27228942655 Average probability of the correct class = 0.103075924982
Log-likelihood = -2.25826075997 Average probability of the correct class = 0.104532133214
Log-likelihood = -2.23801271431 Average probability of the correct class = 0.1066702782
Log-likelihood = -2.20944653423 Average probability of the correct class = 0.109761380875
Log-likelihood = -2.17036118024 Average probability of the correct class = 0.114136385658
Log-likelihood = -2.11893562814 Average probability of the correct class = 0.120159454814
Log-likelihood = -2.05448392668 Average probability of the correct class = 0.128158957933
Log-likelihood = -1.9781664624 Average probability of the correct class = 0.138322624675
Log-likelihood = -1.89308937088 Average probability of the correct class = 0.150605812178
Log-likelihood = -1.80356444392 Average probability of the correct class = 0.16471073844
Log-likelihood = -1.71390604967 Average probability of the correct class = 0.180160699808
Log-likelihood = -1.62743071587 Average probability of the correct class = 0.196433620113
Log-likelihood = -1.54612544874 Average probability of the correct class = 0.213071934689
Log-likelihood = -1.47089832383 Average probability of the correct class = 0.22971903039
Log-likelihood = -1.40197788868 Average probability of the correct class = 0.246109704628
Log-likelihood = -1.33918043475 Average probability of the correct class = 0.262060356155
Log-likelihood = -1.28205618089 Average probability of the correct class = 0.27746619282
Log-likelihood = -1.23000534858 Average probability of the correct class = 0.292291014336
Log-likelihood = -1.1823856608 Average probability of the correct class = 0.306546549485
Log-likelihood = -1.13858981262 Average probability of the correct class = 0.320270344716
Log-likelihood = -1.09808364787 Average probability of the correct class = 0.333509593518
Log-likelihood = -1.06041545036 Average probability of the correct class = 0.346311905034
Log-likelihood = -1.02521125894 Average probability of the correct class = 0.358720674451
Log-likelihood = -0.992165825303 Average probability of the correct class = 0.37077279169
Log-likelihood = -0.96103293577 Average probability of the correct class = 0.382497586412
Log-likelihood = -0.931615847613 Average probability of the correct class = 0.393916686506
Log-likelihood = -0.903757946709 Average probability of the correct class = 0.405044659855
Log-likelihood = -0.877333927776 Average probability of the correct class = 0.415890228318
Log-likelihood = -0.852241954255 Average probability of the correct class = 0.426457760591
Log-likelihood = -0.828397170186 Average probability of the correct class = 0.436748759536
Log-likelihood = -0.805726712165 Average probability of the correct class = 0.446763140352
Log-likelihood = -0.784166136418 Average probability of the correct class = 0.456500202015
Log-likelihood = -0.76365701148 Average probability of the correct class = 0.465959288932
Log-likelihood = -0.744145358272 Average probability of the correct class = 0.475140201108
Log-likelihood = -0.725580637598 Average probability of the correct class = 0.484043433528
Log-likelihood = -0.707915058331 Average probability of the correct class = 0.492670316262
Log-likelihood = -0.691103068952 Average probability of the correct class = 0.501023101114
Log-likelihood = -0.675100970071 Average probability of the correct class = 0.509105013643
Log-likelihood = -0.65986663098 Average probability of the correct class = 0.516920271045
Log-likelihood = -0.645359309369 Average probability of the correct class = 0.524474059803
Log-likelihood = -0.631539569966 Average probability of the correct class = 0.531772469537
Log-likelihood = -0.61836928705 Average probability of the correct class = 0.538822386202
Log-likelihood = -0.605811706762 Average probability of the correct class = 0.545631354182
Log-likelihood = -0.593831541869 Average probability of the correct class = 0.552207420302
Log-likelihood = -0.582395074029 Average probability of the correct class = 0.558558973142
Log-likelihood = -0.571470244429 Average probability of the correct class = 0.564694589
Log-likelihood = -0.561026720601 Average probability of the correct class = 0.570622892704
Log-likelihood = -0.551035933447 Average probability of the correct class = 0.576352438247
Log-likelihood = -0.541471083378 Average probability of the correct class = 0.581891611356
Log-likelihood = -0.532307117743 Average probability of the correct class = 0.587248554016
Log-likelihood = -0.523520683596 Average probability of the correct class = 0.592431109513
Log-likelihood = -0.515090060711 Average probability of the correct class = 0.597446785713
```

```
y_hat = T.argmax(P, axis=1)
test\_y\_hat = classification_function()
print ("Classification error: " + str(100 * (1 - np.mean(test_y_hat test_y.get_value()))) + "%")
for i in range(10):
plot_mnist_digit(
test_x.get_value()[i] # test_y is a theano object of *images*
)
print ('Image class: \t\t' + str(test_y.get_value()[i])) # test_y is a theano object of *true labels*
print ('Model-assigned class: \t' + str(test_y_hat[i])) # test_y_hat is a theano object of *predicted labels*
```

`Classification error: 15.12%`

```
Image class: 7
Model-assigned class: 7
```

```
Image class: 2
Model-assigned class: 2
```

```
Image class: 1
Model-assigned class: 1
```

```
Image class: 0
Model-assigned class: 0
```

```
Image class: 4
Model-assigned class: 4
```

```
Image class: 1
Model-assigned class: 1
```

```
Image class: 4
Model-assigned class: 4
```

```
Image class: 9
Model-assigned class: 9
```

```
Image class: 5
Model-assigned class: 2
```

```
Image class: 9
Model-assigned class: 9
```

Along the way we used a good chunk of mathematics, a hand-full of tricks but required very few lines of code.

Hopefully this leads nicely to using Theano more, in particular we would be excited to see you set theano to work with: smarter training algorithms, parallelized training, and training more popular network architectures.

Good luck!

By Gabi Teodoru.

Lorna Brightmore, a recent Fellow on the latest ASI Data Science Fellowship in May 2016 reflects on her experience on the ASI Fellowship project with pet food business Tails. Lorna has since been continuing working in a full-time role with Tails as a Data Scientist complete with dogs in the office - sounds like a dream job!

**Tell us about your background, before you decided to move into Data Science?**

Before moving into Data Science, I did my undergraduate in Maths at the University of York, and then a PhD in Random Matrix Theory and Integrable systems at the University of Bristol. My research was on the asymptotics of Toeplitz determinants, and applications to problems in Random Matrix Theory and calculating entanglement in a quantum spin chain.

I had always enjoyed maths and my research, but I realised I was more motivated by solving interesting and complex problems, rather than the actual content of my PhD. So I decided that for me the uncertainty of the post-doc life was not worth it, and I decided instead to find a career with a bit more stability, but that still kept all the things I enjoyed from research.

After finishing my thesis I was in no rush to start work right away! I spent some time volunteering for FareShare in Bristol, before going travelling round Asia for 5 months. Shortly after returning, I started work as an analyst at the Department for Education where I was quantifying policy impacts and forecasting education spending. Whilst the work was always interesting, I felt like I could do better analysis with better tools, and some more background knowledge. This finally led me to data science!

**Why did you choose the ASI Fellowship?**

The career change into data science felt daunting, and my coding ability wasn’t brilliant. So it felt like a steep learning curve and I initially didn’t know where to start in trying to get a job.

The Fellowship was ideal for me as it gave me the technical data science skills I needed, as well as real commercial experience doing an actual data science project.

If I hadn’t done the Fellowship, I probably would have spent at the same amount of time learning and applying for jobs, and I wouldn’t have had the commercial experience, business coaching skills or network of fellow data scientists.

**What was the best part of your project or experience on the ASI Data Science Programme?**

My favourite part was the first week, where all the different participating companies pitch their data science projects to you. It really opened my eyes to the potential for data science industry, and the variety of companies you can work for as a data scientist. Also, the demo day at the end of the Fellowship was a really nice way to end it, and celebrate all the awesome work that everybody had done.

Other than that, I really enjoyed just being around the other fellows and the ASI team. It was a really collaborative environment, and I was surprised that informal chats in the ASI offices with really smart people benefited my project even more than the formal teaching.

**What would be your advice to people looking to enter Data Science in Industry?**

A key first step is to , get your Python and SQL skills up to scratch. I’d recommend finding a good online course in both of these and work through it. Codecademy is a good place to start, and Learn Python The Hard Way is also a comprehensive Python course. If you are new to coding, and if you like Maths puzzles, then working through some Project Euler problems is a fun way to practice the basics in coding.

I’d also recommend taking Andrew Ng’s Machine Learning Coursera course. Everybody will tell you that, and they’re not wrong. Also, take a look at the Kaggle website for some good machine learning projects. They have some ‘getting started in data science’ competitions with tutorials that walk you through all aspects of a machine learning problem, from getting and cleaning the data to creating and testing a model.

Finally, I’d recommend just immersing yourself in the data science community. Talk to your any data scientist friends you have about what it’s like, and the type of projects they’re working on. In London at least there are a lot of talks and Meetups organised by data scientists, and I’d recommend going to some of these just to see what’s happening, and get to know some people in the field. Also, apply to the ASI Fellowship! This gives you the technical skills, experience on data science projects and data science community in one go and has really kick-started my career as a data scientist.

**How was your overall experience on the Fellowship, and what have you been up to since completing the Fellowship?**

I did my Fellowship project with the online dog food subscription business Tails.com, looking at what factors predict whether a customer is likely to churn. I enjoyed the project and all aspects of life at Tails, and since the Fellowship I’ve been working there as their data scientist!

My experience of the ASI Fellowship was really positive, and I now have a data science job in a company I love. At Tails, I’m busy extending my project to different retention scenarios, and planning future projects. My exposure to different projects and possibilities for data science within companies has really given me a head start in doing all of this.

]]>