We are delighted to announce the successful completion of the “Balancing the data scales: A cost-benefit analysis of low-fidelity synthetic data for data owners and providers” project. This year-long initiative examined the costs and benefits of synthetic data for data owners and Trusted Research Environments (TREs).
The project was strategically important for understanding how to advance synthetic data use in the UK research ecosystem. As one of three interrelated studies funded by Administrative Data Research UK (ADR UK), alongside parallel projects on researchers’ needs and public perspectives, the initiative generated valuable insights, practical guidance, and recommendations to help scale synthetic data as a trusted resource in research.
Project aims and overview
Launched in April 2024, the project evaluated how data-owning organisations and Trusted Research Environments can safely and efficiently create and share synthetic data. The project adopted a mixed-methods approach centred on three primary goals:
- Assess costs and benefits: Determine the resources required for generating and maintaining synthetic datasets, from initial production to ongoing curation, and weigh these against the potential efficiencies gained.
- Evaluate sharing models: Examine different models of synthetic data dissemination, covering processes from data ingest and metadata creation to access arrangements, to identify efficient practices for data owners and Trusted Research Environments.
- Measure efficiency gains: Understand how the availability of synthetic data might improve overall research efficiency, for example, by enabling researchers to understand data and refine analyses on synthetic versions before accessing sensitive real data.
Key findings and contributions
The project’s findings were based on a literature review, a survey of data owners, in-depth case studies with four organisations at the forefront of synthetic data creation, and a focus group with professionals from Trusted Research Environments.
Collectively, these activities identified several recurring challenges holding back wider adoption of synthetic data. These included uncertainty about legal and governance frameworks, a lack of dedicated resources and skills in organisations, inconsistent approaches to data generation and documentation, and pockets of misunderstanding or mistrust about what synthetic data can and cannot do. Addressing these issues formed the basis of the project’s recommendations.
A major outcome of the project is a set of four key recommendations, sequenced in priority order, to enable the responsible and scalable use of synthetic data across the UK. Each recommendation is targeted to a specific stakeholder group.
Recommendations for policymakers
The first recommendation is that policymakers should clarify governance and legal frameworks. They should develop clear, practical guidance on synthetic data’s legal status and governance and (in collaboration with data controllers, data owners, statisticians, and legal experts) issue authoritative guidelines that address licensing, intellectual property, and confidentiality for synthetic data. Importantly, this should clarify whether different levels of data fidelity entail different legal considerations. Establishing such clarity will provide a foundation of confidence for all other efforts.
Recommendations for funders
Second, funders should strengthen skills and infrastructure, providing sustained investment in the people and tools required to produce high-quality synthetic data. Research funders should support long-term capacity-building, for example by funding specialist staff positions or training programmes in data synthesis and upgrading technical infrastructure.
Targeted grants could enable pilot projects, particularly helping smaller organisations that lack the resources to experiment with synthetic data independently. Building these skills and infrastructure will ensure synthetic data innovation is not limited to a few well-resourced institutions.
Recommendations for data owners and Trusted Research Environments
Third, data owners and Trusted Research Environments should establish quality standards and sharing models, working toward common standards and best practices for creating and sharing synthetic data. Data-producing organisations and Trusted Research Environments are encouraged to adopt shared quality checklists to rigorously assess synthetic data’s fidelity and utility.
Likewise, instituting standardised documentation and metadata (with transparent details on how the synthetic data were generated and appropriate use) is critical for usability and trust. The project also recommends using clear, harmonised labels or disclaimers to ensure any user immediately knows that a dataset is synthetic and understands its limitations. By converging on consistent practices, data providers can improve confidence and interoperability in synthetic data across the research infrastructure.
Recommendations for the research community
Finally, the research community should improve awareness, training and public engagement, promoting greater understanding and acceptance of synthetic data among researchers and the public. The research community, including universities, training bodies, and data services, should develop targeted training modules to educate researchers on when and how to use synthetic data appropriately.
Increasing user literacy will help researchers confidently incorporate synthetic datasets into their workflows where suitable. In addition, outreach and public engagement initiatives could dispel misconceptions about synthetic data, ensuring that wider audiences appreciate its benefits and limitations. Proactively addressing mistrust and misinformation will be key to fostering a culture that embraces synthetic data as a valuable tool rather than viewing it with scepticism.
Looking ahead
These recommendations form a phased roadmap for action. The project suggests beginning with urgent legal and governance clarity, followed by bolstering resources and skills, then implementing technical standards, and finally broadening outreach and education.
Taken together, these steps are intended to build confidence, capacity, and consistency in the adoption and use of synthetic data, helping this technology advance from a niche innovation to a trusted, widely adopted practice in UK. For full details, the complete set of findings and recommendations is available in the project’s final report on Zenodo.
Acknowledgements
The dedication and collaboration of many individuals and organisations made the project’s completion possible. The project team, Cristina Magder, Maureen Haaker, Jools Kasmire, Hina Zahid, and Melissa Ogwayo, are keen to thank all collaborators and participants who contributed their time and insights. They would also like to acknowledge DELIMIT colleagues, Dr Fiona Lugg-Widger and Robert Trubey, for their invaluable support, and offer special thanks to Emily Oliver from ADR UK for her outstanding guidance and help.
This work was supported by ADR UK (Administrative Data Research UK). ADR UK is a partnership transforming the way researchers access the UK’s wealth of public sector data, enabling better-informed policy decisions that improve people’s lives. ADR UK is an Economic and Social Research Council (ESRC) investment (part of UK Research and Innovation).