Assessing disclosure risk
Researchers are often charged with ensuring the identity of research participants is kept confidential, if, for example, this has been agreed with participants. If this is the case, then the risks of disclosure need to be considered before, during, and after the data are collected.
Assessing disclosure risk is done by evaluating key characteristics, or variables in data files, that are the most likely to lead to participant identification in a specific project.
These can be either:
- Direct identifiers – such as a person’s name, national insurance number, picture, or detailed geographic location.
- Indirect identifiers – such as a large household size, specialised profession, unusual health conditions or verbatim textual responses to survey questions.
Risk assessment is about managing risk, rather than the removal of all risk. The risk of identification and the risk of harm from exposing data are evaluated together to assess the overall risk.
Risk analysis can involve running frequency analyses of variables to determine low-frequency responses and extreme outliers. This needs to be complemented by qualitative analysis of risk characteristics based on local knowledge of the data and the population and individuals studied.
For example, a single house with solar panels in a small rural village may be highly disclosive if this is not a common feature in that area. It is important to decide whether confidentiality is sufficiently served by concealing individual and household identities, or whether community or other location-specific identifiers also need to be concealed.
Once a disclosure assessment has been completed, relevant strategies for consent protocols, anonymisation, and regulation of data access can be evaluated and applied.
Millennium Villages study example
This table shows examples of some assessed variables, with risk and actions taken, from a review we carried out of household survey data from the Millennium Villages Impact Evaluation project, northern Ghana.
It shows examples of direct identifiers commonly assessed for disclosure risk (such as age and community), as well as variables for which local knowledge is essential to indicate risk (fuel type use and house wall material).
Extract of household survey variables assessed for disclosure risk:
|Community||Low frequency counts for all named communities; respondents are very easily identifiable (especially in combination with other variables)||Exclude variable from dataset|
|Age||Low counts of older respondents over 75 years old||Top-code age >= 75 as ’75 and over’|
|Main occupation during last 12 months||Low counts of very specific occupations||Occupations aggregated into standard occupation codes|
|Ethnicity of the household head||Low counts of specific ethnicities||Recode the low-frequency responses (all responses but ‘Mamprusi’ and ‘Builsa’) into ‘Other’|
|Household’s primary type or energy/fuel used for cooking||Very low counts for ‘Gas/LPG’ and ‘Electricity-solar panel’ responses may lead to household identification (especially if combined with other datasets)||Recode all responses into the following main categories: 1 – ‘Firewood’; 2 – ‘Electricity-based’; 3 – ‘Charcoal’; 4 – ‘Other’, 5 – ‘Don’t know’; 6 – ‘NA/missing’|
|Main material of the wall of the house||A number of low-frequency responses; exterior features of buildings area easily identifiable and could result in identification of the household||As the main material of the wall refers to the exterior of a building, it may be advisable to recode the low-frequency and ‘Other’ variables into ‘Other (incl. wood-based and stone-based’) and retain the remaining groups|