Synthetic data: Definition, types, use cases, benefits

Synthetic Data

As AI and machine learning (ML) become ubiquitous in business and consumer applications, the role of data used in training and evaluation becomes critical. Ideally, we would prefer to use real-world data, as it best describes what we want to work with. However, due to privacy concerns, limited availability, or cost of acquisition and annotation, many companies turn to generated data, or data that is synthetically produced using computers.

Endava pattern on swirling red white and blue dots

What is synthetic data?

Synthetic data is computer-generated to mimic real-world phenomena. It is designed to represent the specified traits that are considered within scope, ensuring these characteristics are accurately covered, while introducing randomness to all other attributes. This allows the ML model to generalise and focus on the relevant concerns. Typically, it is produced for a specific purpose, and the scope is managed strictly to minimise production costs.

Synthetic data is often used in conjunction with real-world data, either as complementary or to augment it, complicating the represented scenario to match needs. Using computers, we can simulate any type of data, including the usual formats like tabular, text, images, audio or video, but also more exotic types like depth, IR or near-IR images, X-rays, digital signals captured by antennas, etc.

With synthetic data generation, businesses can produce large volumes of diverse training data representative of real-world scenarios. This enables data scientists to develop and refine their ML models, resulting in better predictions and outcomes.

Synthetic data is used not only for AI models but also to validate statistical formulations or anonymise subjects. For example, in medical research, synthetic data can simulate patient data to develop and test new treatments or interventions while maintaining patient confidentiality, and software developers have been using it for decades for multiple types of testing.

Benefits of using synthetic data

Synthetic data is becoming increasingly important for several reasons, including its potential to overcome limitations associated with real-world data, such as privacy concerns, bias and cost.

Synthetic data offers several advantages, including:

Pre-labelling and highly accurate data. Synthetic data annotations are generated automatically, eliminating the need for humans to annotate each image, sentence or audio file. These annotations are also more accurate, complete and richer in detail.
Rare data. There are scenarios where capturing real-world data is difficult (or impossible) to perform; as such, synthetic data is the only way to proceed, as it’s useful for training, but more importantly, validation.
Avoiding regulatory issues involved in handling personal data. Synthetic data helps companies analyse personal data, such as healthcare records, financial data and web content, protected by privacy and copyright law.
Fewer biases. Because we design synthetic data with desired distributions, we can reduce the bias in the training data used to train models. When using generative adversarial networks (GANs), the model learns the existing biases in the sample data, and all biases will remain or likely grow.
Privacy protection and data security. Synthetic data provides a better way to anonymise data. Instead of manipulating an existing dataset, synthetic data is generated from scratch without containing one-to-one relationships with the original data subjects, eliminating the risk of re-identification.
Reducing costs. While synthetic data is expensive to produce and has higher upstart costs, as the volume of generated data grows, the accrued costs are small, making it a cost-effective option in the long run.
Scalability. Synthetic data can be generated in large volumes, providing more opportunities for testing and training machine learning models.
Diversity of data. By generating synthetic data, businesses can test their models and systems across different scenarios and conditions.
Feature relevancy. When data is engineered, we can ‘design in/out’ a certain trait of the data and see its impact on the models' performance. This is useful for improving the efficiency of training, explaining it and performing more detailed ablation studies.

Synthetic data generation techniques

Depending on the type of data that needs to be produced, there are different techniques. The approaches fall in several categories: (1) replicating a small set of real-world samples while introducing variations to create new data points that are similar yet distinct; (2) simulations – using a computer representation of the natural world phenomena or object we capture the aspects that we find relevant; and (3) using foundational models. We have found that to solve real-world business problems, we usually combine multiple, if not all, techniques to generate synthetic data that is good enough but also efficient and cost-effective.

When some real-world data already exists but is not enough for an AI/ML model to learn the patterns, we can create more data by analysing the existing correlations or statistical properties of the existing sample data and developing new data points that match the same patterns but are distinct from the existing ones. Data scientists can do this manually or by using specialised ML/DNN models, adversarial networks or variational autoencoders (VAEs). This is usually suited to tabular data, but good results are obtained on visual data, too.

Another solution is to create representations of real-world objects or phenomena, simulate looks and behaviour using computers, and capture the desired data using the simulation software. These virtual worlds must be able to introduce variation programmatically to obtain enough variety in the output dataset for the model to learn the desired capability.

Depending on the type of data, different software tools are used; for visual data, VFX or game engines are the natural choice, while proper CAD tools, like Ansys or Dassault, can be used to simulate various physical phenomena to capture simulated sensor data.

As foundational models have shown the ability to produce realistic-looking data matching a textual description or guided by other means, i.e., sample data, generative models can also be tapped into to produce data to be used to train other models.

This relies on their ability to learn general representations of our world which guided by prompts and context windows can generate new data, this is usually used for text, voice and images, most often in scenarios where there is no need for labelled data.

Ground truth annotations can be theoretically produced but means creating customised solutions for specific scenarios and there are no best practices now. General audio (music) and videos have made good progress lately, but there are no publicly available models able to produce good enough data at this time.

Hybrid data

Sometimes, we choose to combine the world with synthetic data. This may be done for several reasons, like increasing realism by augmenting synthetic data with real-world backgrounds or creating specific scenarios that are not represented in the real-world data by adding synthetic examples.

At other times, it is not about altering data points but complementing a smaller volume of real-world captured data with synthetic data to cover specific edge cases or increase volume.

Synthetic Data

What is synthetic data?

Benefits of using synthetic data

Synthetic data generation techniques

Synthetic Data and AI: An in-depth dive into model training

Types of uses for synthetic data

Full synthetic data

Hybrid data

What is the difference between synthetic data and data augmentation, simulated data and data masking?

Synthetic data

Simulated data

Data augmentation

Data masking

Synthetic data use cases

Synthetic data in healthcare

Synthetic data in finance

Synthetic data in retail and marketing

Synthetic data in automotive

Synthetic data in manufacturing and supply chain

Synthetic data in technology

Synthetic data in robotics

Further reading

Synthetic Data

What is synthetic data?

Benefits of using synthetic data

Synthetic data generation techniques

Synthetic Data and AI: An in-depth dive into model training

Types of uses for synthetic data

Full synthetic data

Hybrid data

What is the difference between synthetic data and data augmentation, simulated data and data masking?

Synthetic data

Simulated data

Data augmentation

Data masking

Synthetic data use cases

Synthetic data in healthcare

Synthetic data in finance

Synthetic data in retail and marketing

Synthetic data in automotive

Synthetic data in manufacturing and supply chain

Synthetic data in technology

Synthetic data in robotics

Further reading

Follow us!