The UK Data Service continues to explore new ways to improve how data is prepared, described, and delivered to researchers. As part of this work, our Research and Development and Curation teams have been working on MetaCurate-ML, a pilot project initially funded through the ESRC Future Data Services programme which subsequently received supplementary funding through the EPSRC AI for Science fund.
MetaCurate-ML
A key focus of the project is the FAIR data principles. FAIR stands for data that is Findable, Accessible, Interoperable, and Reusable. In this context, MetaCurate-ML focuses particularly on interoperability, ensuring that data and metadata can be understood and used effectively by both people and machines.
The project is led by Jon Johnson at CLOSER and in addition to the UK Data Service involves colleagues from the University of Surrey’s Computer Science Department, and the Scottish Centre for Social Research. Each partner contributes to different aspects of the work, from metadata extraction and classification to concept mapping and integration into discovery platforms.
Automatically extracting and structuring metadata
While much of the UK Data Service’s metadata is rich and detailed, a significant proportion is still held in formats such as PDF documents. These formats are useful for human interpretation but limit how metadata can be used by machines. This creates challenges for discovery, reuse, and the efficient curation of data at scale. MetaCurate-ML is addressing this challenge by developing approaches to automatically extract and structure metadata from existing documents.
Our colleagues at the University of Surrey are exploring machine learning approaches to extract metadata from PDF questionnaires, using machine learning techniques to capture information, including questions, and response codes and categories.
At the UK Data Service, we take this PDF-extracted metadata and combine it with existing metadata, creating a rich knowledge graph. This knowledge graph is powered by Data Documentation Initiative (DDI) metadata schemas and has several potential uses, such as powering future enhanced question banks, providing LLMs with rich contextual domain knowledge, and driving discovery and linkage.
One important application of this enhanced metadata is in improving how disclosure risk is assessed. At present, elements of this process rely on manual workflows, often using spreadsheets and specialist tools. By introducing structured metadata and automated classification, MetaCurate-ML supports the development of more efficient and consistent approaches to identifying and managing disclosure risk. This has the potential to reduce manual effort, minimise errors, and improve the scalability of curation processes.
The project is also supporting the development of new tools within the service, including a prototype Data Product Builder. This tool demonstrates how enriched metadata can be used to support data curation workflows, enabling curators to explore datasets, identify key variables, and create versions of data that are appropriate for different access conditions.
As a pilot project, MetaCurate-ML is helping to explore how these approaches can support the work of the UK Data Service curation team and improve efficiency across key processes. The insights and tools developed through this work have the potential to inform future developments, including enhancements to our data catalogue, making it easier for users to discover and access the data they need.
More information
The MetaCurate-ML project is a collaborative effort involving colleagues across the UK Data Service and partner organisations. From the UK Data Service, the work is led by the Data and Research Technology team, including Deirdre Lungley, Jacob Joy, and Ivan Evdokimov, working closely with the Curation and Collections teams, including Sharon Bolton, Liz Smy, Reza Afkhami, Cristina Magder and Vlad Viona.
The project also involves contributions from Paul Bradshaw at the Scottish Centre for Social Research, Jon Johnson at CLOSER, and Suparna De and her researchers at the University of Surrey.