Anonymising quantitative data

Anonymising quantitative data such as survey data for secondary research may involve removing or aggregating variables, reducing precision or generalising detailed information. For data to be useful, socio-demographic details must be retained to allow meaningful analysis. Balancing privacy with data utility is key, ensuring anonymisation methods preserve critical information without compromising participant confidentiality.

Common socio-demographics include age, education level, employment details, ethnicity, national identity, religion, household size, income and other financial information. Retaining these details, even in anonymised form, is often essential to ensure data remains meaningful for secondary research.

Anonymisation methods can be classified into non-perturbative and perturbative techniques based on how the original data values are handled. While the examples below focus mostly on survey data, these best practices are applicable across a wide range of quantitative data types, such as administrative data, transaction records, and experimental data, wherever this type of information is being collected.

These identifiers are most often unnecessary for secondary research. This information compromises the identity of participants and should not be released unless consent is in place and the benefits of sharing the information outweigh the risks.

For example, remove respondents’ names, addresses (physical, email and IP), telephone numbers, NHS number, national insurance number.

Banding (or binning) is an effective method for continuous variables like age and income. By grouping values into broader bands, we reduce the uniqueness of participants therefore lowering the risk of identification.

For example, raw income values have been collected in the form of £9,996, £21,478, £51,299, £88,599 and £120,987. These can be banded into broader ranges, which maintains data usability while protecting individual financial details as follows: Less than £9,999, £10,000 – £24,999, £50,000 – £74,999, £75,000 – £99,999, £100,000 or more.

Survey data usually consists of quantitative information. Most surveys will also contain questions that allow for text responses. Generalisation is useful for anonymising free-text responses, which can often contain detailed or unique information that risk disclosing individual identities. By generalising specific information, such as locations, job descriptions, and at times information that was not even meant to be collected, we protect privacy and preserve the usability of the data.

For example, participants have been asked why they are not receiving a pension; one of the responses states “I moved abroad, to Paris, for many, many years. I lived on Rue de Rivoli, right next to the Louvre and during that time, I didn’t contribute to a UK pension plan, so I’m not eligible.” A generalised response would be “Not eligible due to years spent living abroad without contributions.”

Recoding, sometimes referred to as categorisation, is a method aiming to decrease the number of unique categories within a variable, making it harder to identify individuals. This approach is particularly effective for variables like ethnicity, educational attainment or employment, where detailed subcategories can be merged into broader groups. By using broader categories, we maintain relevant socio-demographic information while protecting individual identities.

For example, with educational attainment you might collect very detailed information such as 1 to 4 GCSEs grade A* to C, Any GCSEs at other grades, O levels or CSEs (any grades), 1 AS level, NVQ level 1, Foundation GNVQ, Basic or Essential Skills, 5 or more GCSEs (A* to C or 9 to 4), O levels (passes), CSEs (grade 1), BTEC First or General Diploma, RSA Diploma, 2 or more A levels or VCEs, etc. To protect participants, you might wish to recode this to only 7 categories as follows:

This example is in line with the Office for National Statistics standard coding frame for education used during the 2021 Census.

0 No qualifications

1 Level 1 and entry level qualifications

2 Level 2 qualifications

3 Apprenticeship

4 Level 3 qualifications

5 Level 4 qualifications or above

6 Other

Top and bottom coding addresses the privacy risk associated with rare, extreme values at the tails of a distribution. This is especially useful for age, where participants at very high or very low ages may be uniquely identifiable, as well as for income and financial variables.

For example, in the original data the following ages are present: 27, 118, 89, 56, 48, 31, 5. The anonymised data would contain top/bottom coded age data in the form of 18 or younger and 80 or older. By grouping ages above 80 and below 18 into broader categories, we reduce the risk of disclosing identities without losing general age patterns.

Perturbation methods involve introducing small changes to the data values to protect privacy while maintaining the overall data structure. For example, using perturbation, controlled noise is added to numerical values to obscure exact data points (e.g. adjusting income by a small, random amount). Data swapping is where certain values (often categorical) are exchanged between records, so individuals are not linked directly to specific responses. As these methods modify original data points, they are considered more advanced and should be applied with caution to ensure they do not distort analytical outcomes.

Our open source QAMyData tool provides a health check for numerical data including detection of direct identifiers and outliers on user set thresholds. There are a number of statistical disclosure control tools for numerical data including the R package sdcMicro, which offers a friendly graphical user interface, μ-ARGUS developed by Statistics Netherlands for applying statistical disclosure control methods to microdata and ARX, a versatile open-source data anonymisation tool that supports a range of anonymisation techniques for structured data, including k-anonymity and l-diversity.

When using any tools for anonymising or processing quantitative data it is crucial to ensure that they are deployed locally and not reliant on external servers or unrestricted cloud services, which could inadvertently expose sensitive data to third parties. Researchers must take responsibility to verify that any software used for data processing is configured correctly to prevent data from being uploaded to unknown or insecure entities.