Synthetic Data
As AI and machine learning (ML) become ubiquitous in business and consumer applications, the role of data used in training and evaluation becomes critical. Ideally, we would prefer to use real-world data, as it best describes what we want to work with. However, due to privacy concerns, limited availability, or cost of acquisition and annotation, many companies turn to generated data, or data that is synthetically produced using computers.
What is synthetic data?
Synthetic data is computer-generated to mimic real-world phenomena. It is designed to represent the specified traits that are considered within scope, ensuring these characteristics are accurately covered, while introducing randomness to all other attributes. This allows the ML model to generalise and focus on the relevant concerns. Typically, it is produced for a specific purpose, and the scope is managed strictly to minimise production costs.
Synthetic data is often used in conjunction with real-world data, either as complementary or to augment it, complicating the represented scenario to match needs. Using computers, we can simulate any type of data, including the usual formats like tabular, text, images, audio or video, but also more exotic types like depth, IR or near-IR images, X-rays, digital signals captured by antennas, etc.
With synthetic data generation, businesses can produce large volumes of diverse training data representative of real-world scenarios. This enables data scientists to develop and refine their ML models, resulting in better predictions and outcomes.
Synthetic data is used not only for AI models but also to validate statistical formulations or anonymise subjects. For example, in medical research, synthetic data can simulate patient data to develop and test new treatments or interventions while maintaining patient confidentiality, and software developers have been using it for decades for multiple types of testing.
Benefits of using synthetic data
Synthetic data is becoming increasingly important for several reasons, including its potential to overcome limitations associated with real-world data, such as privacy concerns, bias and cost.
Synthetic data offers several advantages, including:
- Pre-labelling and highly accurate data. Synthetic data annotations are generated automatically, eliminating the need for humans to annotate each image, sentence or audio file. These annotations are also more accurate, complete and richer in detail.
- Rare data. There are scenarios where capturing real-world data is difficult (or impossible) to perform; as such, synthetic data is the only way to proceed, as it’s useful for training, but more importantly, validation.
- Avoiding regulatory issues involved in handling personal data. Synthetic data helps companies analyse personal data, such as healthcare records, financial data and web content, protected by privacy and copyright law.
- Fewer biases. Because we design synthetic data with desired distributions, we can reduce the bias in the training data used to train models. When using generative adversarial networks (GANs), the model learns the existing biases in the sample data, and all biases will remain or likely grow.
- Privacy protection and data security. Synthetic data provides a better way to anonymise data. Instead of manipulating an existing dataset, synthetic data is generated from scratch without containing one-to-one relationships with the original data subjects, eliminating the risk of re-identification.
- Reducing costs. While synthetic data is expensive to produce and has higher upstart costs, as the volume of generated data grows, the accrued costs are small, making it a cost-effective option in the long run.
- Scalability. Synthetic data can be generated in large volumes, providing more opportunities for testing and training machine learning models.
- Diversity of data. By generating synthetic data, businesses can test their models and systems across different scenarios and conditions.
- Feature relevancy. When data is engineered, we can ‘design in/out’ a certain trait of the data and see its impact on the models' performance. This is useful for improving the efficiency of training, explaining it and performing more detailed ablation studies.
Synthetic data generation techniques
Depending on the type of data that needs to be produced, there are different techniques. The approaches fall in several categories: (1) replicating a small set of real-world samples while introducing variations to create new data points that are similar yet distinct; (2) simulations – using a computer representation of the natural world phenomena or object we capture the aspects that we find relevant; and (3) using foundational models. We have found that to solve real-world business problems, we usually combine multiple, if not all, techniques to generate synthetic data that is good enough but also efficient and cost-effective.
When some real-world data already exists but is not enough for an AI/ML model to learn the patterns, we can create more data by analysing the existing correlations or statistical properties of the existing sample data and developing new data points that match the same patterns but are distinct from the existing ones. Data scientists can do this manually or by using specialised ML/DNN models, adversarial networks or variational autoencoders (VAEs). This is usually suited to tabular data, but good results are obtained on visual data, too.
Another solution is to create representations of real-world objects or phenomena, simulate looks and behaviour using computers, and capture the desired data using the simulation software. These virtual worlds must be able to introduce variation programmatically to obtain enough variety in the output dataset for the model to learn the desired capability.
Depending on the type of data, different software tools are used; for visual data, VFX or game engines are the natural choice, while proper CAD tools, like Ansys or Dassault, can be used to simulate various physical phenomena to capture simulated sensor data.
As foundational models have shown the ability to produce realistic-looking data matching a textual description or guided by other means, i.e., sample data, generative models can also be tapped into to produce data to be used to train other models.
This relies on their ability to learn general representations of our world which guided by prompts and context windows can generate new data, this is usually used for text, voice and images, most often in scenarios where there is no need for labelled data.
Ground truth annotations can be theoretically produced but means creating customised solutions for specific scenarios and there are no best practices now. General audio (music) and videos have made good progress lately, but there are no publicly available models able to produce good enough data at this time.
Types of uses for synthetic data
The primary use of synthetic data is training ML/AI models; when this is done, the best practice is to evaluate the model’s performance on real-world data. The opposite is also true; models are trained on real-world data but tested and/or validated on synthetic data (or partial synthetic data).
Two main types of synthetic data are used: full synthetic and hybrid, where synthetic data is used together with real-world data.
Full synthetic data
A full synthetic dataset is entirely generated and will not contain any real-world data. The advantage of this solution is that the data can be engineered to meet specific requirements upfront. This way, bias, imbalance or ethical issues can be avoided. The disadvantage is that the cost is higher as it requires more effort to create realism.
Hybrid data
Sometimes, we choose to combine the world with synthetic data. This may be done for several reasons, like increasing realism by augmenting synthetic data with real-world backgrounds or creating specific scenarios that are not represented in the real-world data by adding synthetic examples.
At other times, it is not about altering data points but complementing a smaller volume of real-world captured data with synthetic data to cover specific edge cases or increase volume.
What is the difference between synthetic data and data augmentation, simulated data and data masking?
Data augmentation, simulated data and masking refer to data science and machine learning techniques for enhancing or modifying datasets.
Here's a detailed explanation of each:
Synthetic data
Synthetic data is artificially generated data that mimics real-world data but is created from scratch using algorithms or other software tools. It is designed to have similar properties to real data without being an exact copy.
Simulated data
Simulated data is generated by simulating real-world processes or systems, often based on mathematical or physical models. It reflects theoretical scenarios or experiments that may not be practical to conduct.
Data augmentation
Data augmentation involves creating new data points from existing data by applying various transformations. This technique is widely used to increase the size and variability of training datasets, particularly in image and text processing.
Data masking
Data masking involves altering data to protect sensitive information while preserving its usability for development, testing and analysis.
Synthetic data use cases
Various industries and sectors can benefit from using synthetic data. From healthcare to fraud detection systems, synthetic data has applications almost everywhere.
Synthetic data in healthcare
In a highly regulated industry, synthetic data can help researchers and practitioners access valuable insights without violating patient privacy.
Synthetic data in finance
Synthetic data can be used to model and predict financial trends, test trading algorithms and ensure regulation compliance.
Synthetic data in retail and marketing
Businesses can use synthetic data to optimise pricing strategies, understand customer behaviour and enhance marketing automation.
Synthetic data in automotive
Synthetic data is critical in developing in-cabin sensing or self-driving vehicles, as it allows for extensive testing and validation on a wide variety of scenarios before real-world testing.
Synthetic data in manufacturing and supply chain
Cameras are already used to monitor manufacturing facilities. The same cameras or other specialised ones can be used to find defects or monitor safety – the data to be used to train and validate AI for such scenarios is scarce or might not even exist.
Synthetic data in technology
Synthetic data is commonly leveraged to accelerate hardware development, enhance machine learning models and scale product innovation and efficiency.
Synthetic data in robotics
Enhance algorithm training, improve sensor accuracy, refine autonomous navigation and accelerate the development of advanced functionalities.
AI-generated synthetic data is set to revolutionise how we share, use and build datasets. Synthetic data is an industry-agnostic, enterprise technology solution used across various fields for data testing, validation and protection.
Further reading
Check out these resources to learn more about synthetic data and its role in people-centric innovation.
The Spark That Drives Machine Learning To Shine
Highlighting the importance of synthetic data in improving ML. Mimicking real-world phenomena, synthetic data is expected to lead AI models by 2030. Synthetic data addresses challenges like data scarcity, privacy and bias, proving essential for effective ML applications.
Read the articleHow AI Can Support Customer-Centricity
In this presentation from SIGGRAPH2023, Technical Art Director Jon Hanzelka and Senior Technical Artist Jacob Berrier present ‘Beyond Visible Light: Generating Synthetic Data in Unique Spectrums’, which explores how synthetic data can help ML overcome some of the challenges it faces and how it can innovate processes such as dental X-ray imaging.
Watch the videoSynthetic Data and AI: An in-depth dive into model training
Synthetic data is rapidly becoming a cornerstone in the domain of machine learning. With the increasing complexities of real-world applications, it's imperative to have a consistent supply of high-quality training data that can adapt to this rapidly evolving space. And while traditional data acquisition methods are riddled with challenges ranging from biases to privacy concerns, synthetic data stands out as a viable alternative.
Read the whitepaper