Is Synthetic Data the holy grail of Machine Learning?

April 24, 2020

| Paula Villamarin

In today’s world data is considered an asset and one of the most valuable resources, and truth be told only a few big players have the strongest hold on that currency. The biggest companies around the world are even so generous of giving machine learning algorithms for free, because in the end, these algorithms are not that valuable without the data that feeds them.

The primary bottleneck in the deployment of perception models is the creation of training data, because of the lack of large annotated datasets. With no more than one hour and reasonably powerful computer, you could train a machine learning model to recognize dog breeds with higher accuracy than most humans, due to the availability of public data. But, finding a large labeled dataset containing specific instances in a particular environment is unlikely. Each new environment with new instances requires data collection and annotation. However, this quickly becomes prohibitive when considering the manual labor needed to collect and label such data. Yet, in recent years, a new data source has emerged, and it’s fundamentally changing the development of machine learning: synthetic data.

What is Synthetic data?

Synthetic data is artificially manufactured rather than generated by real-world events. It is a repository of data generated programmatically, and it can help to improve existing datasets, or even in some cases, it can be better for training models than data collected from the real world.

You could create photorealistic images of people in random scenes rendered using video game engines like GTA V, or generate thousands of fake customer behavioral profiles, and even generate audio by a speech synthesis model from a given text.

Synthetic Image data

Sample images from Virtual KITTI (first row), and Domain Randomization approach (second row). (pdf)

Synthetic data is an increasingly popular tool for training deep learning models, especially in computer vision. The reason why synthetic image data has a great impact on these models is that real-world data comes with its own challenges, data privacy is one of them, also it can be expensive and time-consuming, as data has to be captured and then labeled manually, and finally, real-world data is not as perfect as we would like to think, it can be biased to the environment where it was captured. Synthetic data can overcome all these challenges

How it works?

Synthetic dataset using Domain Randomization with Segmentation label by LinkedAI

You can use 3D engines to create 3D assets and manipulate them spacially, changing their position, rotation, lighting, also, you could change the texture and the background.

So basically we can use traditional data augmentation where flips, rotations, crops, and color variations are used to increase the variety of data, and take it to the next level with synthetic data generation. Allowing you to create millions of automatically annotated training data at scale. As the system already knows what and where is the object, the data is actually perfectly labeled. As shown in the example above where I created a can of coke and manipulated rotation, lighting, and background variations to obtain mask labels from the object I wanted to detect.

This enormous amount of variations that can be generated with this method, like changing background to completely random scenes and contexts is called domain randomization, and it has been recently proposed as an inexpensive approach that intentionally leaves aside photorealism by randomly perturbing the environment in non-photorealistic ways to force the network to learn to focus on the essential feature of the objects it’s trying to detect, proved to improve the model performance.

This approach has been shown successful in tasks such as detecting the 3D figures on a table. The task was focused on object localization for robotic manipulation skills, and it proved that it is possible to train a real-world object detector that is accurate to 1.5 cm using only data from a simulator with non-realistic random textures (pdf)

As said before, it is possible to train a perception model using entirely synthetically generated data with great results. But we could also use synthetic data to augment existing real-world datasets so that the resulting hybrid datasets are better for training the models. In this case, the synthetic data is usually applied to enhance parts of the data distribution that are less represented to reduce dataset bias.

Challenges of Synthetic data

Although synthetic data has proved to have a great impact on Machine Learning models, it isn’t perfect and like everything, it has its limitations:

  1. Synthetic data can be great for detecting simple objects or products but what about detecting complex objects from nature, as different kinds of plants or medical data such as x-rays or MRIs?
  2. As a young approach to improve machine learning, there’s still a lot to do in research to have a better understanding of how it should be applied.
  3. While generating realistic synthetic data has become easier over time, real-world human-annotated data remains a necessary part of machine learning training data.


Synthetic data can be key to democratize machine learning and make it accessible to more people to produce better real-world ML-based solutions.

Also, it can be great to improve existing real-world datasets to reduce bias and tackle edge cases.

Finally, companies are starting to work more to generate this kind of data LinkedAI uses a proprietary library to generate customized elaborate synthetic datasets.

— —

Dealing with small data? Feel free to try LinkedAI Platform or get in touch with us at — our team would love to contribute to your project.

United States
+1 (510) 570-7796
© 2022 LinkedAI. All Rights Reserved.