Anomaly Detection with Computer Vision

December 13, 2020

| Authors: Heimer Rojas , Mia Morton , Abdel Perez, Ximena Andrade

Abstract As students of Machine Learning, we were given the opportunity to implement an idea of our choosing. We partnered with mentors Divait Parra and Paula Villamarin from LinkedAI and Cristian Garcia a Machine Learning Engineer who has contributed to many open-source projects; to create an Anomaly Detection model to automate the process of anomaly detection on the production line.

After choosing the data set to train and test the model, we were able to successfully detect anomalies 86% to 90% of the time.


An Anomaly is an event or item that deviates from what is expected. The frequency of an anomaly is low in comparison to the frequency of standard events. The anomalies that can occur in the products are usually random, some examples are changes in color or texture, scratches, misalignment, missing pieces, or errors in the proportions.

Anomaly Detection allows us to fix or eliminate those parts or elements that are in bad condition from the production chain. As a result manufacturing costs are reduced because of the avoidance of producing and marketing defective products. Anomaly detection, in factories, is a useful tool for Quality Control Systems because of its features and is a big challenge for Machine Learning Engineers.

Using Supervised Learning is not a recommended practice because of: the need for intrinsic features in anomaly detection and the use of the low quantity of anomalies in a full dataset (training/validation). On the other hand, image comparison could be a feasible solution but Standard Images handle several variables such as light, object position, distance to object, and others; which doesn’t allow the pixel-to-pixel comparison with a standard image. Pixel-to-pixel comparison is integral in the detection of anomalies.

Besides the last conditions, our proposal includes the use of Synthetic Data as the way to increase the Training Data Set; we choose two different kinds of Synthetic Data, the random Synthetic Data and similar to Anomaly Synthetic Data. (see Data section for more details)

The goal of this project is to classify Anomaly — Not Anomaly using Unsupervised Learning and Synthetic Data as data augmentation methodology.

This project is a proposal from the startup LinkedAI, a Colombian Enterprise expert in data labeling for Artificial Intelligence projects.

Background Research

Anomaly detection is associated with finance and detecting “bank fraud, medical problems, structural defects, malfunctioning equipment” (Flovik et al, 2018). The focus of this project was on anomaly detection using image datasets. The application of which would be on a production line. At the beginning of the project, we familiarize ourselves with the functionality and architecture of Autoencoders with regard to their use in anomaly detection. As part of the data plan, we researched the importance of including synthetic noisy images and real noisy images (Dwibedi et al, 2017).

Having a Data Plan was an important part of this project. Choosing a data set that had enough original images and enough real noisy images. Using both synthetic and real images. When working with real images the images (data) that are needed for full coverage of an object and its environment may not be available with regard to views and scale; “…distinguishing between such instances requires the dataset to have good coverage of viewpoints and scales of the object”(Dwibedi et al, 2017). The use of synthetic data allows for “good coverage of both instances and viewpoints”(Dwibedi et al, 2017). Creating the synthetic image datasets, which included synthetically rendered scenes and objects, was accomplished by using the Flip Library, an open-source python library created by LinkedAI. The Dwibedi et al, 2017 “Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection” through their training and Evaluation showed that training using synthetic datasets was comparable in results to training on the real image datasets.

Autoencoder architecture “typically” learns the representation of a dataset for dimensionality reduction (encoding) of the original, thus creating the bottleneck. From the reduced encoding of the original, a representation is generated. The representation (reconstruction) generated is as close to the original as possible. Both the input layer and the output layer of the autoencoder have the same number of nodes. Variational Autoencoders add a layer to the autoencoder just before the bottleneck “the bottleneck value is created by picking it from a random normal distribution” (Patuzzo, 2020). There is some reconstruction loss in the reconstructed output image which (Flovik, 2018) can be used via distribution to define the threshold value of the original image input. The threshold is the values by which an anomaly can be determined.

Denoising autoencoders allow the hidden layer to learn “more robust filters” and reduce overfitting. An autoencoder is trained to reconstruct the input “from a corrupted version of it” (Denoising Autoencoders (dA)). Training would include the original image as well as the noisy or “corrupted image”. With the introduction of the stochastic corruption process, the denoising autoencoder is expected to encode the input and then create a reconstruction of the original input by removing the noise (corruption) from the image. In Vincent et al (2020), “Extracting and Composing Robust Features with Denoising Autoencoders” the Denoising Autoencoder should be able to find structures and regularities as characteristics of the input. Regarding images, the structures and regularities would have to be capturers “from a combination of many input dimensions” (Vincent et al., 2020). Vincent et al (2020) hypothesis reference the “ robustness to partial destruction of the input” should be a criterion for “good intermediate representation”.

The emphasis, in this case, would be on the ability to obtain and create a large number of images both original and with noise. We used both real and synthetic data to create a significant number of images with which to train our model.

According to Huszar (2016), Dilated Convolutional Autoencoder “support exponential expansion of the receptive field without loss of resolution or coverage.” Maintaining resolution and coverage of an image is integral to the reconstruction of that image from the Dilated Convolutional Autoencoder and anomaly detection using images. This moves the autoencoder in the decoder stage, from creating a reconstruction of the original image to a much closer approximation that may result from the “typical” autoencoder structure. Dilated Convolutional Autoencoders in Yu et al. (2017), “Network Intrusion Detection through Stacking Dilated Convolutional Autoencoders”; the goal of the model was to combine unsupervised learning features and CNNs to learn features from large amounts of unlabelled raw traffic data. The interest is in identifying and detecting complex attacks. By allowing “very large receptive fields while only growing the number of parameters logarithmically” Huszar (2016); incorporating the feature learning of the unsupervised CNN; stacking these layers (Yu et al., 2017) they were able to achieve a “remarkable performance” from their model.


Flip Library (LinkedAI)
Flip is a python library that allows you to generate synthetic images in a few steps from a small set of images made up of backgrounds and objects (images that would be in the background). It also allows you to save the results in jpg, json, csv, and pascal voc files.

Python Libraries
Several Python Libraries have been used in this project with different purposes:

Visualization (images, metrics):

Arrays handling:


Image Similarity Comparison:

Weights & Biases

Weights and Biases is a developer tool that tracks the machine learning model and creates visualizations of the model and the training. It functions as a Python Library and can be imported as import wandb. It works within Tensorflow, Keras, Pytorch, Scikit, Hugging Face, and XGBoost. Use wandb.config to configure the inputs and hyperparameters; to track the metrics and create visualizations for the input, hyperparameters, model, and training; making it easier to see where changes can and need to be made to improve the model.

Method & Structure

We started our project based on current architectures for Autoencoders which specialize in using images with convolutional networks (see below graphs). After some preliminary tests, based on the research (see References) and advice from mentors, we changed to final architecture.

Fig. Typical architecture for Autoencoders

Use of Dilation Feature

The dilation feature is a special Convolutional Network where holes are inserted in the traditional convolutional kernels. In our project, we applied the dilation feature specifically to channel dimensions, without impacting the image resolution.

Fig. Final architecture

Image Similarity

One of the critical points for this project was to find an Image Comparison metric. The Image Comparison metric was used to train the model, build the histogram, and to calculate the threshold on which to classify images as Anomaly or Not Anomaly.

We started with L2 Euclidean distance pixel by pixel. The results did not identify some of the differences. We used the Python Imagehash library with its different hashes (perceptual, average, and difference) and we received different results for similar images. We found that the SSIM (Structural Similarity Index Measure) metric gave us a measure of the similarity between a pair of images and additionally it is a built-in loss from the Keras library.


After training and evaluating the model, with its respective datasets, it was necessary to identify the similarity between the reconstructed and the original images. Of course, due to the diversity of original images (eg. size, position, color, bright and other variables), there was a range for this similarity. We used the Histogram as a graphical representation to visualize the range and also to observe at which point we would have Non-Similar images.

Fig. Example of Histogram


The data used was downloaded from Kaggle: Surface Crack Detection Dataset (crack dataset) and casting product image data for quality inspection (casting dataset).

The first, Crack dataset, has 20,000 negative wall images (no cracks) and 20,000 positive images (with cracks). In this case, the cracks were considered anomalies. All data is 227x227 pixels with RGB channels. Examples of each of the groups are shown below.

We used 10,000 images from the group without anomalies to generate different synthetic datasets. The synthetic datasets were then divided into two types: one with noise similar to anomalies (51 images were created with Photoshop); another with noise using random objects such as fruits, plants, and animals (80 free images downloaded from pixabay page). All the images used as noise are in png format and with a transparent background. Below are some examples of the two types of datasets used for model training.

The second; Casting dataset is composed of two groups, one with images of 512x512 pixels (781 images with anomalies and 519 without anomalies) and another with images of 300x300 pixels (3137 without anomalies and 4211 with anomalies). All images had RGB channels. The 300 x 300 pixel images were used. The latter, from Kaggle, were divided as training with 91.65% of the data and the remainder for testing. For this dataset, the anomalies were: edge debris, scratches, surface warping, and perforations. Below are some examples of images with and without anomalies.

We used 1,000 images belonging to the training group without defects to generate the synthetic data datasets. As in the previous case, we created two types of datasets: one with noise similar to anomalies (51 images were created with Photoshop) and the other with noise from random objects such as animals, flowers, and plants (the same 80 images used in the crack dataset). Below are some examples of the images used during model training.

All synthetic data was created using the Flip library. In each generated image, 2 objects were chosen and placed at random. Three types of transformations were applied to the objects: flip, rotate, and resize. The resulting images were saved in jpg format. The following table shows the datasets used in the project:


Based on the above-explained tables; with the main objective to research which variation of the dataset might present the best results, we trained the model with these data and obtained results (see below graphs)

For each dataset, we evaluated several metrics such as loss (SSIM), recall, precision, F1, and accuracy. For each experiment, the histogram representing the image similarity between the set of noisy images & reconstructed images was evaluated.

To track and compare our results we used the library Weight & Biases which allows an easy way to store and compare the results from each experiment.


In order to keep the minor quantity of variables in our environment, we decided to always use a dataset of one thousand samples regardless of the relationship between real data and synthetic data.

In the algorithm, we split the respective dataset in 95% to train and 5% to test the results. Aside from this, our evaluation was implemented only with real data.

Evaluation & Results

The following images show the main results obtained in some experiments. You can find all the results in the following links:

Cracks dataset

Fig. Histogram Cracks

Fig. Anomaly detection in Crack Dataset

For the cracks dataset, the experiments had similarly excellent results (range 91% to 98%), without significant differences between the experiments. Its behavior is mainly due to variables like crack size and color in comparison with images without anomalies.

Casting dataset

Fig. Accuracy and Recall metrics for Casting Dataset

Fig. Histogram Casting E1 & E3

Fig. Anomaly detection in Casting Dataset



Several steps are required for the implementation of a real Machine Learning project from the idea to the implementation of models. This includes dataset selection, collection, and processing.

It is important to have “debugging scripts” in projects working with images. In our case, we used a script that allowed us to visualize: the original dataset, new synthesized images, and the cleared images after the autoencoder, enabling us to evaluate the model performance.

Exploring Generative AI: Use Cases in Computer Vision and Beyond
Thursday, February 22, and Friday, February 23
Register Free

Follow Us

Copyright © 2024 Linked AI, Inc. All rights reserved.