April 24, 2020
In today’s world data is considered an asset and one of the most valuable resources, and truth be told only a few big players have the strongest hold on that currency. The biggest companies around the world are even so generous of giving machine learning algorithms for free, because in the end, these algorithms are not that valuable without the data that feeds them.
What is Synthetic data?
Synthetic data is artificially manufactured rather than generated by real-world events. It is a repository of data generated programmatically, and it can help to improve existing datasets, or even in some cases, it can be better for training models than data collected from the real world.
You could create photorealistic images of people in random scenes rendered using video game engines like GTA V, or generate thousands of fake customer behavioral profiles, and even generate audio by a speech synthesis model from a given text.
Synthetic Image data
Synthetic data is an increasingly popular tool for training deep learning models, especially in computer vision. The reason why synthetic image data has a great impact on these models is that real-world data comes with its own challenges, data privacy is one of them, also it can be expensive and time-consuming, as data has to be captured and then labeled manually, and finally, real-world data is not as perfect as we would like to think, it can be biased to the environment where it was captured. Synthetic data can overcome all these challenges
How it works?
You can use 3D engines to create 3D assets and manipulate them spacially, changing their position, rotation, lighting, also, you could change the texture and the background.
So basically we can use traditional data augmentation where flips, rotations, crops, and color variations are used to increase the variety of data, and take it to the next level with synthetic data generation. Allowing you to create millions of automatically annotated training data at scale. As the system already knows what and where is the object, the data is actually perfectly labeled. As shown in the example above where I created a can of coke and manipulated rotation, lighting, and background variations to obtain mask labels from the object I wanted to detect.
This enormous amount of variations that can be generated with this method, like changing background to completely random scenes and contexts is called domain randomization, and it has been recently proposed as an inexpensive approach that intentionally leaves aside photorealism by randomly perturbing the environment in non-photorealistic ways to force the network to learn to focus on the essential feature of the objects it’s trying to detect, proved to improve the model performance.
This approach has been shown successful in tasks such as detecting the 3D figures on a table. The task was focused on object localization for robotic manipulation skills, and it proved that it is possible to train a real-world object detector that is accurate to 1.5 cm using only data from a simulator with non-realistic random textures (pdf)
As said before, it is possible to train a perception model using entirely synthetically generated data with great results. But we could also use synthetic data to augment existing real-world datasets so that the resulting hybrid datasets are better for training the models. In this case, the synthetic data is usually applied to enhance parts of the data distribution that are less represented to reduce dataset bias.
Challenges of Synthetic data
Although synthetic data has proved to have a great impact on Machine Learning models, it isn’t perfect and like everything, it has its limitations:
Synthetic data can be key to democratize machine learning and make it accessible to more people to produce better real-world ML-based solutions.
Also, it can be great to improve existing real-world datasets to reduce bias and tackle edge cases.
Finally, companies are starting to work more to generate this kind of data LinkedAI uses a proprietary library to generate customized elaborate synthetic datasets.
Dealing with small data? Feel free to try LinkedAI Platform or get in touch with us at email@example.com — our team would love to contribute to your project.