Evaluating the benefits, costs and utility of synthetic data

About the project

Background

The growing discourse around synthetic data underscores its potential not only in addressing data challenges in a fast-paced changing landscape but for fostering innovation and accelerating advancements in data analytics and artificial intelligence. From optimising data sharing and utility, to sustaining and promoting reproducibility to mitigating disclosure, synthetic data has emerged as a solution to various complexities of the data ecosystem.

The funding opportunity

The Economic and Social Research Council through the UKRI Digital Research Infrastructure Fund and Administrative Data Research UK (ADR UK) aimed to collect evidence and insights on the advantages, financial implications, and practicality of low-fidelity synthetic data through the lens of three key stakeholder groups.

This funding supported three interrelated projects, each tailored to explore synthetic data from distinct viewpoints:

  • Researchers: investigating the impact of accessible synthetic data on research methodologies and outcomes.
  • Data owners and Trusted Research Environments (TREs): examining the benefits, costs, and logistics involved in providing synthetic data.
  • The public: understanding public perceptions and attitudes towards synthetic data usage.

The collective findings from these projects were expected to guide the creation of strategies for the broad-scale production and dissemination of synthetic data. Such strategies will aim to be both efficient and aligned with the expectations and ethical considerations of the public, data owners, and the research community.

Balancing the data scales: A cost-benefit analysis of low-fidelity synthetic data for data owners and providers

The UK Data Service has successfully secured funding for the second project of the initiative, focusing on data owners and TREs.

The project proposed a mixed-methods approach and was centred around three primary goals:

  1. To evaluate the comprehensive costs incurred by data owners and TREs in the creation and ongoing maintenance of low-fidelity synthetic data, including the initial production of synthetic data and subsequent costs.
  2. To assess the various models of synthetic data sharing, evaluating the implications and efficiencies for data owners and TREs, covering all aspects from pre-ingest to curation procedures, metadata sharing, and data discoverability.
  3. To measure the efficiency improvements for data owners and TREs when synthetic data is available, analysing impacts on resources, secure environment usage load, and the uptake dynamics between synthetic and real datasets by researchers.

Project team

Principal Investigator: Cristina Magder
Co-Investigators: Maureen Haaker, Jools Kasmire, Hina Zahid
Researcher: Melissa Ogwayo

Project duration

8 April 2024 – 8 April 2025

Project work packages

Conduct a literature review aimed at defining the existing body of knowledge surrounding the creation and dissemination of synthetic data.

Evaluate the perception, readiness, and potential risk aversion of data owners towards the adoption and dissemination of low-fidelity synthetic data.

Engage with a diverse cohort of data owners that already produce and disseminate synthetic data to help describe the existing frameworks, sharing mechanisms, and cost structures associated with low-fidelity synthetic data.

Delve deeper into the operational dimensions of synthetic data usage within secure environments to understand the practical implications, challenges, and opportunities that arise when integrating synthetic data into established workflows.

Recommendations

The project helped develop a set of sequenced recommendations to support the responsible and scalable use of synthetic data in research. These are based on evidence gathered through the four work packages.

You can read more about these recommendations in the final report “The Role of Synthetic Data in Research: Benefits, Costs, and Practical Insights from Data Owners and Trusted Research Environments Experts – A Project Report with Practical Recommendations”.

There is an urgent need for clear, practical guidance on the legal status of synthetic data.

We recommend that policymakers and regulators collaborate with data controllers, data owners, statisticians, and legal experts to develop clear and practical guidance on the legal and governance aspects of synthetic data. This should include use-case examples, licensing guidance, and clarification of whether different levels of data fidelity entail different legal considerations.

Organisations often lack the sustained resources and expertise to develop and maintain synthetic data.

We recommend that funders support long-term investment in the skills, roles, and infrastructure needed to integrate synthetic data into organisational workflows. This could include training programmes for data scientists, analysts, and TRE managers, support for dedicated synthetic data specialist roles or cross-organisational secondments, infrastructure investment (e.g. secure computing environments and software licenses) and provide funding or capacity grants to support pilot work, especially in smaller organisations.

The absence of consistent practices undermines confidence and usability.

We recommend that data creators and TREs actively develop and adopt shared models and quality standards for synthetic data. This entails establishing a quality checklist that rigorously assesses fidelity, structure, and utility and adopting clear, standardised documentation and metadata guidelines that transparently explain how each dataset is generated and should be used. We further recommend using harmonised disclaimers and labels to unambiguously indicate the synthetic nature and any limitations of the data.

Misunderstanding and mistrust are key barriers.

We recommend that the research community takes proactive steps to enhance awareness, training and public engagement regarding synthetic data. Specifically, we recommend developing targeted training modules that equip researchers with a clear understanding of when and how to use synthetic data responsibly, ensuring they can accurately assess its suitability for varied tasks. In addition, we recommend establishing public engagement initiatives to dispel misconceptions about synthetic data.

These steps are intended to build confidence, capacity, and consistency across the UK research infrastructure. A phased approach, starting with legal clarity, followed by resourcing, tools, and outreach, will help synthetic data move from niche innovation to trusted, widely adopted practice.

Project outputs

  • Knowledge exchange and training: The team co-organised a hands-on workshop with the Ministry of Justice: From Discovery to Analysis, showcasing the use of synthetic data from the Data First Linked Criminal Justice collection. The project also supported the Population Research UK (PRUK) initiative, “Skills Development for Managing Longitudinal Data for Sharing,” by co-organising the Introduction to Synthetic Data for Longitudinal Data Managers event.
  • Practical tools and guidance: A Minimal Documentation Standard for Synthetic Data Collections was developed to support consistent minimal documentation for synthetic datasets. This guidance was published openly to inform practice across data owners, data providers and TREs.
  • International contribution: Project team members participated in an international working group which resulted in a co-authored, community-driven statement on synthetic data principles and terminology, published via Zenodo.
  • Open access report: A final, open-access report including findings and sequenced recommendations is published via Zenodo.

Get in touch

If you would like to get in touch with us about the project please contact us at datasharing@ukdataservice.ac.uk.