A couple of weeks ago, I was lucky enough to give a talk at the Hardcore Data Science track at Strata London, “the Lollapalooza of big data conferences”. The quality of other speakers were intimidatingly good (key people from the Scikit-learn and Google Deepmind, are just two examples) and the audience was composed of technical people interested in the details of data science applied to the everyday life.
My talk was titled ‘Detecting Anomalies in the Real World’. Anomaly detection is a Machine Learning field that has a wide number of applications.
As is often the case in Data Science, each anomaly detection problem is unique and must be solved in its own way. The most common challenge in this field is to defining well what an anomaly is with respect to normal behaviour. Whether a data point is anomalous depends on the context. For example, a cold day in summer could considered anomalous (well, not so anomalous in the UK…), but the same temperature would be considered perfectly normal in winter. Hence, it is very important to take into account the context in which the data lies.
Anomaly Detection can be supervised (where we have data with labels that inform us whether we are observing an anomaly or not) or unsupervised (where we have never seen an anomaly, but we know it might exist). Even in the more fortunate supervised case, the inherent nature of of anomalies means the classes that compose our dataset will be very imbalanced and we must pay extra attention how we evaluate the models we build.
The accuracy measure, in particular, can be very misleading. For example, let’s think about a predictive maintenance scenario where we want to predict whether an engine is going to fail in the near future, using historical data. Let’s suppose an engine fails about 0.1% of the time. If we train a model that always says the engine is not going to fail, it will still be accurate 99% of the time, whilst missing the only relevant information that we care about.
Luckily there are other metrics that take this into account, including precision, recall, f-measure and weighted accuracy. To have an effective model, it is important to manage the a trade off between “we don’t want to miss any anomalies” and “we don’t want to raise too many false alarms” - and this again will depend on the kind of problem we are dealing with.
Since the track focused on practical applications rather than theory, I talked about two real world examples I happened to work on during my studies and as a data science consultant at ASI Data Science.
The first application was Foreground Detection in video sequences. In this problem, the foreground objects were considered anomalies that changed the normal state of the frame, which is the background. One of the key challenges is that the background in real world scenes is not completely static: there may be perturbations due to illumination changes, objects may be dynamic (e.g. tree foliage) and there may be permanent changes in the background that we would like to incorporate into our model over time. A possible solution is a dynamic background model based on dictionary learning that is able to learn from the environment the different configurations of the background. The model updates itself so that the alarms it raises will be true foreground objects, and not configurations that actually belong in the background.
The second application was fraud detection for credit card transactions, for which we built a model at ASI Data Science for a big e-commerce provider. This was a supervised case as the client provided us with labelled data. However there were two additional problems in addition to the well-known imbalance of the classes. Firstly, the frauds were malicious anomalies, such that they were designed to look like legitimate transactions. Secondly, the client already had in place a system that blocked a high number of transactions, even if they were legitimate. This means that the data we were provided was biased.
After an accurate feature engineering/selection phase, we built a fraud detection system based on a balanced forest classifier. In order to have a more robust model, we trained it on a “clever sample” subset of the data which contained a greater than average number of legitimate transactions that were misclassified as frauds by human experts. Eventually we classified accurately around 97% of frauds, whilst raising less than 2% of false alarms.
In conclusion, there isn’t a general framework to solving anomaly detection and every problem must be studied and considered on its own. It’s important to take into account the context in which the data lives and to select/build the features that describe the data points accurately. If the anomalies that you are trying to detect are malicious, like frauds, then let’s hope the fraudsters make enough mistakes so that you can spot them!
Written by Alessandra Stagliano, PhD, Data Scientist & Consultant at ASI. You can find out more about Strata + Hadoop World here]