Anonymisation step-by-step

Practical considerations for data anonymisation

Anonymisation ensures that the risk of identifying a data subject in released data is negligible. Careful considerations must be given to the applicable legislation as this will influence the definition of anonymised data.

In the UK, the Information Commissioner’s Office (ICO) provides in-depth information about personal data, including definitions and considerations for pseudonymisation and anonymisation, within the scope of the UK General Data Protection Regulation and the Data Protection Act 2018.

Anonymisation techniques can be different for quantitative (e.g. survey) data than for qualitative (e.g. transcript) data. Before attempting anonymisation the data must be well understood by consulting the available documentation such as user guides, technical reports, and methodological papers.

A simplified high-level approach can consist of the following three core steps.

Step 1: Find and assess identifiers

  • Start by identifying potential identifiers. This should include both direct identifiers (information identifying data subject directly e.g. names, addresses) and indirect identifiers (information which when combined might identify data subjects e.g. age, sex, educational attainment, occupation).
  • Evaluate the likelihood of re-identification by considering both the data itself and the potential availability of external information that could be linked to it.
  • Key questions to consider:
    • Can the identity of a participant be known from information in the data file?
    • Is there a possibility of inadvertently disclosing or causing harm to a third party based on the information in the data file?

Step 2: Implement anonymisation techniques

  • Ensure that all direct identifiers have been removed (deleted) or pseudonymised (replaced with fake names or codes).
  • Next, address the indirect identifiers you have identified as potentially leading to identification. Techniques may include:
    • Aggregation: Group data points to decrease identifiability. For example, rather than using specific ages, categorise them into broader age ranges such as 20-24, 25-29, 30-34, 35-39 etc.
    • Generalisation: Modify detailed information to more general terms to prevent identification. Highly applicable for qualitative data such as transcripts, but also for survey data containing string variables. For example, generalise “living in the city of Preston in Lancashire” to “living in a countryside location in the North West of England”.
    • Data specific techniques: Incorporate specialised techniques such as recoding, top/bottom coding or statistical disclosure control in quantitative data, and methods like blurring or altering features in visual data, or voice distortion in audio data. Careful consideration should be given to methods such as blurring or voice distortion as depending on the context and envisioned usage of the data, the usability of the data might be compromised. Find out more about applicable techniques for qualitative data and for quantitative data and consult the Government Statistical Service policy for the release of social survey microdata.
  • Key questions to consider:
    • How can the data be altered to prevent identification while retaining its utility for secondary analysis?
    • Are the anonymisation techniques employed sufficient to protect against re-identification? Make sure to consider the nature of the data including the data type and format, sensitivity, and uniqueness, as well as, intended usage.

Step 3: Review the data and re-assess any remaining disclosure risk

  • Ensure that the anonymisation process has been consistently applied across the data.
  • Conduct a thorough review to confirm that no real residual risk of disclosing personal or sensitive information remains. If there is a low residual risk of disclosure, consider an effective anonymisation approach, a concept introduced by the ICO. Check our licence and access framework webpage for further information on data classification.
  • Key questions to consider:
    • Have all identifiers, both direct and indirect, been adequately anonymised or removed?
    • Is there any remaining information that, when combined with other available data, could lead to the identification of individuals?