Evaluating the benefits, costs and utility of synthetic data

About the project

Background

The growing discourse around synthetic data underscores its potential not only in addressing data challenges in a fast-paced changing landscape but for fostering innovation and accelerating advancements in data analytics and artificial intelligence. From optimising data sharing and utility, to sustaining and promoting reproducibility to mitigating disclosure, synthetic data has emerged as a solution to various complexities of the data ecosystem.

The funding opportunity

The Economic and Social Research Council through the UKRI Digital Research Infrastructure Fund and Administrative Data Research UK (ADR UK) aims to collect evidence and insights on the advantages, financial implications, and practicality of low-fidelity synthetic data through the lens of three key stakeholder groups.

This funding supports three interrelated projects, each tailored to explore synthetic data from distinct viewpoints:

  • Researchers: investigating the impact of accessible synthetic data on research methodologies and outcomes.
  • Data owners and Trusted Research Environments (TREs): examining the benefits, costs, and logistics involved in providing synthetic data.
  • The public: understanding public perceptions and attitudes towards synthetic data usage.

The collective findings from these projects are expected to guide the creation of strategies for the broad-scale production and dissemination of synthetic data. Such strategies will aim to be both efficient and aligned with the expectations and ethical considerations of the public, data owners, and the research community.

Balancing the data scales: A cost-benefit analysis of low-fidelity synthetic data for data owners and providers

The UK Data Service has successfully secured funding for the second project of the initiative, focusing on data owners and TREs.

The project proposes a mixed-methods approach and is centred around three primary goals:

  1. To evaluate the comprehensive costs incurred by data owners and TREs in the creation and ongoing maintenance of low-fidelity synthetic data, including the initial production of synthetic data and subsequent costs.
  2. To assess the various models of synthetic data sharing, evaluating the implications and efficiencies for data owners and TREs, covering all aspects from pre-ingest to curation procedures, metadata sharing, and data discoverability.
  3. To measure the efficiency improvements for data owners and TREs when synthetic data is available, analysing impacts on resources, secure environment usage load, and the uptake dynamics between synthetic and real datasets by researchers.

Project team

Principal Investigator: Cristina Magder
Co-Investigators: Maureen Haaker, Jools Kasmire, Hina Zahid

Project duration

8 April 2024 – 31 March 2025

Project work packages

Conduct a comprehensive literature review aimed at defining the existing body of knowledge surrounding the dissemination of low-fidelity synthetic data.

Evaluate the perception, readiness, and potential risk aversion of data owners towards the adoption and dissemination of low-fidelity synthetic data.

Engage with a diverse cohort of data owners that already produce and disseminate synthetic data to help describe the existing frameworks, sharing mechanisms, and cost structures associated with low-fidelity synthetic data.

Delve deeper into the operational dimensions of synthetic data usage within secure environments to understand the practical implications, challenges, and opportunities that arise when integrating synthetic data into established workflows.