Anonymising quantitative data

Anonymising quantitative data

Anonymising quantitative data

Anonymising quantitative data may involve removing or aggregating variables or reducing the precision or detailed textual meaning of a variable.

A list of primary anonymisation techniques follows:

Remove direct identifiers from a dataset. Such identifiers are often not necessary for secondary research.

Example: Remove respondents’ names, addresses (physical, email and IP), postcode information, institution name and telephone numbers.

Aggregate or reduce the precision of a variable, such as age or place of residence. As a general rule, report the lowest level of geo-referencing that will not potentially breach respondent confidentiality.

The exact scale depends on the type of data collected, but very detailed geo-references, like full postcodes, names of small towns or villages, are likely to be problematic.

Coded or categorical variables, which may be potentially revealing, can be aggregated into broader codes. If aggregation of a disclosive variable is not possible, consider whether it should be removed from the dataset.

Example: Record the year of birth rather than the day, month and year; record postcode sectors (first three or four digits) rather than full postcodes; aggregate detailed ‘unit group’ standard occupational classification employment codes up to ‘minor group’ codes by removing the last digit.

Generalise the meaning of a detailed text variable by replacing potentially disclosive free-text responses with more general text.

Example: Detailed areas of medical expertise could indirectly identify a doctor. The expertise variable could be replaced by more general text or be coded into generic responses, such as ‘one area of medical speciality’, ‘two or more areas of medical speciality’, etc.

Restrict the upper or lower ranges of a continuous variable to hide outliers if the values for certain individuals are unusual or atypical within the wider group researched.

In such circumstances, the unusually large or small values might be collapsed into a single code, even if the other responses are kept as actual quantities, or one might code all responses.

Example: Annual salary could be ‘top-coded’ to avoid identifying highly paid individuals. A top code of £100,000 or more could be applied, even if lower incomes are not coded into groups.

Anonymise relational data where relations between variables in related, linked datasets or in combination with other publicly available outputs, may disclose identities.

Example: In confidential interviews on farms, the names of farmers have been replaced with codes and other confidential information on the nature of the farm businesses and their locations have been disguised to anonymise the data.

However, if related biodiversity data collected on the same farms, using the same farmer codes, contain detailed locations for biodiversity data alone, the location would not be confidential. Farmers could be identified by combining the two datasets.

The link between farmer codes and biodiversity location data should be removed, for example, by using separate codes for farmer interviews and for farm locations.

Anonymise geo-referenced data by replacing point coordinates with non-disclosing features or variables; or, preferably, keep geo-references intact and impose access restrictions on the data instead.

Point data may fix the position of individuals, organisations or businesses studied, which could disclose their identity. Point coordinates may be replaced by larger, non-disclosing geographical areas, such as polygon features (km2 grid, postcode district, county), or linear features (random line, road and river).

Point data can also be replaced by meaningful alternative variables that typify the geographical position and represent the reason why the locality was selected for the research, such as poverty index, population density, altitude, vegetation type. In this way, the value of data is maintained, whilst removing disclosing geo-references.

A better option may be to keep detailed spatial references intact and to impose access controls on the data instead.

Procedures to anonymise any research data that are destined for sharing or archiving should always be considered together with appropriate informed consent procedures