About the project
Social science research benefits from accountability and transparency, which can usefully be underpinned by high quality and trustworthy data.
Rigorous data curation practices are still sometimes viewed as a dark art, and easy-to-use tools to correct and clean numeric data are not widely used, despite awareness of the desire to make data FAIR. The tasks of checking, cleaning and documenting data by repository staff can be all too manual and time-consuming.
The UK Data Service developed a free easy-to-use open source tool known as QAMyData that provides a health check for numeric data. The tool uses automated methods to detect and report on some of the most common problems in survey or numeric data, such as missingness, duplication, outliers and direct identifiers. Requirements were scoped through a series of engagement exercises with the Service’s own data curation team, other data publishers, managers and quantitative researchers to create a comprehensive list of ‘tests’ that are typically used when quality assessing numeric data files.
Data health check
The tool offers a number of configurable tests that have been categorised into four types: file, metadata, data integrity, and identifiers, which can be run on popular file formats, including SPSS, Stata, SAS and CSV. A standard config file has default settings for each test, such as a threshold for pass or fail on various tests (e.g. detect value label that are truncated, email addresses identified as a string, or undefined missing values) which can be easily adapted to meet the user’s own desired thresholds. The configuration feature allows the creation of a unique Data Quality Profile. The software creates a ‘data health check’ that details errors and issues as both a summary and detailed report, providing a location of the failed test. New tests can easily be added. Data depositors and publishers can act on the results and resubmit the file until a clean bill of health is produced.
The choice of technology for the tool went through at least four months of research, experimenting with different open source programming languages and libraries of statistical functions, including R, Python and Clojure, focussed initially on SPSS and Stata files. The agile programming language, Rust, was selected as the best choice, building on the established Readstat library, which is gaining recognition in the statistical community, where we hope it will be developed further. The QAMyData software is easily downloaded to a laptop or server and can be quickly used and integrated into data cleaning and processing pipelines. It is available to download from the UK Data Service Github pages under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
The grant also delivered a training module on what makes a clean and well-documented numeric dataset. A user guide, training exercise and purposely- erroneous dataset were produced and road-tested during early training sessions – see Outputs below.
UK Data Service contribution
- Principal Investigator: Louise Corti, UK Data Service
- Co-Investigator: Vernon Gayle, AQMen, University of Edinburgh
- Researchers: Jon Johnson, Myles Offord, Cristina Magder, Anca Vlad, UK Data Service
- Funders: National Centre for Research Methods (NCRM), Economic and Social Research Council (ESRC)
- Project dates: 8 January 2018 – 31 July 2019 (19 months)
- Overview presentation (PDF)
- Table of tests included (PDF)
- Tool and installation documentation
- Config file
- QAMyData Guide: How to install and run (PDF)
- Teaching exercise one: Identifying issues (PDF)
- Teaching exercise two: Using the QAMyData tool (PDF)
- Dummy Dataset (ZIP)
Data curator and researcher workshops
- Full day hands-on NCRM workshop, LSE, 20 February 2019: Assessing Data Quality and Disclosure Risk in Numeric Data,
- Presentation, AQMeN Data Wrangling course, London, 13 March 2019
- Half-day hands-on workshop, Assessing Data Quality and Disclosure Risk in Numeric Data, LSTHM, April 2019
- Demonstration, Home Office, April 2019
- Full day workshop, Assessing Data Quality and Disclosure Risk in Numeric Data, University of Edinburgh, 10 June 2019
- Full day workshop, Assessing Data Quality and Disclosure Risk in Numeric Data, Edinburgh Medical School, 11 June 2019
- Demonstration, Scottish Government, 12 June 2019
- Presentation, IASSIST Annual Conference, Sydney, 30 May 2019
- Presentation and demo, final NCRM showcase, University of Manchester, 10-11 June 2019