Glossary

95% confidence interval

Confidence intervals are used to express the uncertainty associated with a population estimate. For example, imagine we wanted to use a survey to estimate the mean age of a population. The 95% confidence interval tells us that if we sampled this same population lots of times, and generated a CI each time, 95% of these CI would contain the true mean age of the population.

Agent-based modelling (ABM)

A computational model for simulating the actions and interactions of autonomous agents, which can be individual entities or groups like organisations, to understand the behaviour of a system and what governs its outcomes.

Aggregate data

Aggregate or macro data are data about populations, groups, regions or countries. These are data that have been averaged, totalled or otherwise derived from the individual level data found in the survey datasets.

AI hallucinations

Outputs from AI models that appear confident and plausible but are completely fabricated or incorrect.

Algorithmic transparency

The principle of making computational systems, AI, and automated decision-making processes understandable, accessible, and open to review.

Analytical process

In quantitative computational research, the analytical process is a systematic approach for examining data, assessing relationships, and exploring trends to derive meaningful insights. This process typically involves steps such as data discovery, data cleaning, modeling, and interpretation.

API (Application Programming Interface)

A set of rules and protocols enabling various software applications to communicate and securely exchange data. In research and data analysis, APIs help learners retrieve data from online platforms, databases, or other digital services for analysis. This facilitates hands-on experience with real-world data and supports skill development in data extraction and manipulation.

Artificial Intelligence (AI)

AI refers to the ability of computer systems to perform tasks typically associated with human intelligence, including learning, reasoning, problem-solving, perception, and decision-making.

Attrition

Sample attrition refers to the loss of study units from a sample after an initial wave of data collection. For example, individuals who take part in a first wave of a longitudinal study dropping out at a sunsequent wave.

Augmented data

Data created by adding new records to an existing dataset in order to increase dataset size, address gaps, or improve the performance of machine learning models. These additional records are typically generated from or based on the original data and are designed to resemble or preserve its key characteristics. See also observed data and synthetic data.

Augmented reality (AR)

Augmented reality refers to the real-time integration of digital information into a user’s environment. This technology superimposes digital content onto the real world, enhancing the user’s perception without replacing their view of reality.

Automation

The application of technology, programs, robotics, or processes to achieve outcomes with minimal human input.

Autonomous agent (reinforcement learning)

A system that makes decisions and takes actions within an environment without direct human guidance, learning from rewards or penalties to improve its performance. For example, a self-driving car continuously decides how to navigate based on sensor input. See also Reinforcement learning.

Bias

Sampling bias occurs when a sample statistic does not accurately reflect the true value of the parameter in the target population. Sample estimates might be too high or too low compared to the true population values. This may arise where the sample is not representative of the population.

Source: SAGE Research Methods.

Branching (version control)

Creating a separate working version of a project (a branch) within a version control system, allowing changes to be developed independently from the main version.

Case

A survey case is a unit for which values are captured. Typically, surveys use individuals, families/households or institutions/organisations as observation units (cases). In survey datasets, cases are usually stored in rows.

Catalogue record

Within a data catalogue, a catalogue record provides essential metadata for the dataset(s) and access to the accompanying documentation. These records typically include a title, details about the data creators, a descriptive overview of the dataset content, information about the sample, and details about the data access conditions, facilitating efficient data discovery and understanding.

Categorical variable

A variable that can take on one value from a set of discrete and mutually exclusive list of responses. For example, a marital status variable can include the categories of single (never married), married, civil partnership, divorced, widowed etc. and respondents can be assigned only one value from this list.

Chain of Thought (CoT) prompting

A technique that improves Large Language Model (LLM) performance on complex tasks by prompting them to produce intermediate reasoning steps before giving a final answer. By dividing problems into sequential, logical steps like “showing your work”, CoT improves accuracy in arithmetic, commonsense, and symbolic reasoning.

Change logs

A change log is a document that records all modifications made to a project, software, or document over time. It functions as a historical record of changes, aiding teams in tracking updates, recognising improvements, and ensuring transparency.

Choropleth map

Choropleth maps colour or shade different areas according to a range of values, e.g. population density or per-capita income.

Cloud-based repositories (version control)

Online platforms that provide version control capabilities, allowing code and project files to be stored, shared, and collaboratively edited from any location. Common examples include GitHub, Google Cloud Source Repositories, and Azure Repos.

Cluster sampling

The process of dividing a population into groups, then selecting a simple random sample of groups and sampling everyone in those groups. An example of this is geographical clustering, which is often efficiently applied in face-to-face surveys. Clustering of addresses limits travel for interviewers and so allows survey producers to sample more respondents for a given budget.

Sources: An Introduction to Statistical Methods and Data Analysis.

Clusters

Groups of data points that appear closer together than the rest of the data in a dataset because they share similar characteristics or patterns. Clusters are often discussed alongside outliers, which are data points that differ substantially from the broader patterns in the data.

Codebook

A codebook describes the contents, structure, and layout of a data collection. Codebooks begin with basic front matter, including the study title, name of the principal investigator(s), table of contents, and an introduction describing the purpose and format of the codebook. Some codebooks also include methodological details, such as how weights were computed, and data collection instruments, while others, especially with larger or more complex data collections, leave those details for a separate user guide and/or data collection instrument.

Cohort

A group of people who share a characteristic, usually birth year. See also: cohort study.

Source: Cambridge Dictionary

Cohort study

Cohort studies chart the lives of groups of individuals who experience the same life events within a given time period.

Source: Closer Learning Hub

Commit

In version control, the action of recording a set of changes to your project’s history. When you commit, you are telling the version control system to record exactly what has changed since the last save, creating a permanent entry in the project’s changelog that you can return to at any time. (e.g., git commit).

Computational environment

Encompasses all the hardware, software, and network settings in which a computer program operates, including the operating system, CPU, and libraries.

Computational social sciences (CSS)

An interdisciplinary field that uses mathematical algorithms, advanced data analysis, and computational modelling to investigate and predict human behaviour and social patterns.

Context engineering

The discipline of designing systems that deliver the precise, relevant information and tools a Large Language Model (LLM) needs at the right moment. It goes beyond static prompt engineering by dynamically selecting, organising, and inserting data such as retrieved documents, user history, or structured API results into the context window to enhance the AI agent’s reliability and effectiveness.

Context window

The amount of text a large language model can read and process at one time. Think of it as the model’s working memory, it can only consider what falls within this window when generating a response. Information outside the window is not visible to the model.

Control variables

A control variable is a variable that is included in an analysis in order to control or eliminate its influence on the variables of interest. For example, if we are looking at the relationship between having a university degree and smoking prevalence, we might need to consider the impact of age at the same time. Older generation respondents are more likely to smoke than a younger generation. If we control for age, we can see whether graduates are less likely to smoke than non-graduates once age has been accounted for.

Source: SAGE Research Methods.

Copyright

Copyright is the exclusive and assignable legal right to control all use of an original work, such as a book, data etc., for a particular period of time.

Source: Cambridge Dictionary.

Cross-sectional data

Cross-sectional data are collected from a sample at a single point in time. It is often likened to taking a snapshot. Cross-sectional studies are quick and relatively simple, but they cannot provide information about the change in the same individuals or units over time. Repeated cross-sectional data can however be used to look at aggregate changes in the population as a whole.

Source: SAGE Research Methods.

Data archives

A data archive is a centralised database system that collects, manages, and stores datasets for later use. Similar to a data repository.

Data generalisation (data anonymisation)

The process of replacing detailed or specific data with broader categories (for example, replacing an exact age with an age range). This helps protect privacy by reducing the level of detail that could identify individuals.

Data licensing

Data licensing is a legal arrangement between the creator of the data and the end-user specifying what users can do with the data.

Source: How to FAIR.

Data linkage

Data linkage is the process of joining together records from different sources that pertain to the same entity.

Source: ONS; Understanding Society.

Data manipulation

Data manipulation is the process of arranging and organising data to make it easier to use, analyse and interpret.

Data masking (data anonymisation)

The process of protecting sensitive data by replacing it with altered or fictitious values. The masked data preserves the format and structure of the original data but cannot be easily traced back to the real information.

Data mining

Data mining is defined as the process of extracting useful information from large data sets through the use of any relevant data analysis techniques developed to help people make better decisions.

Source: SAGE Research Methods .

Data repository

A data repository is a centralised database system that collects, manages, and stores datasets for later use, similar to a data archive.

Data science

A field that draws on statistics, programming, and subject area knowledge to collect, process, and analyse data in order to answer research questions and draw evidence-based conclusions.

Data suppression (data anonymisation)

The process of removing or hiding specific data values in a dataset to reduce the risk of identifying individuals. Suppressed data is typically replaced with blanks or placeholders to prevent sensitive information from being disclosed.

Dataset

Any computer file (or set of files) which is organised under a single title and is capable of being described as a coherent unit.

Derived variable

A variable that is created from one or more already existing variables by following some sort of calculation or other data processing technique. For example, each respondent’s estimated annual income from savings and investments could be derived from several reported income variables.

Descriptive statistic

Descriptive statistics are those that describe data. Examples include means, medians, variances, standard deviations, correlation coefficients, etc.

Source: SAGE Research Methods.

Disclosure risk (data anonymisation)

The likelihood that an individual or entity can be identified in a dataset that is intended to be anonymised. This can occur through direct identifiers (such as names) or indirect identifiers (such as combinations of demographic or sensitive attributes). Managing disclosure risk is important for protecting privacy and meeting data protection requirements.

Documentation

Accompanying files that enable users to understand a dataset, exactly how the research was carried out and what the data mean. Usually consisting of data-level documentation i.e. about individual databases or data files and study-level documentation i.e. high-level information on the research context and design, the data collection methods used, any data preparations and manipulations, plus summaries of findings based on the data.

DOI (Digital Object Identifier)

A unique, persistent alpha-numeric string used to identify content like scholarly articles and datasets. It offers a permanent, reliable link to the content’s location online. Unlike URLs, which may break or become outdated, a DOI ensures long-term access, functioning like a “digital passport” or barcode for research.

Equal interval

This is a method of dividing the data displayed in a choropleth map. Equal interval simply divides the data into equal sized subranges. For example, if your data ranged from 0 to 300, and you specified three classes 0 to 300, the ranges would be 0–100, 101–200, and 201–300. See also Natural breaks (or Jenks) and Quantile.

Few-shot prompting

An AI technique that provides a model with a limited number of examples (shots) in the prompt to improve accuracy, guide output style, and set tone. See also Zero-shot prompting.

Fidelity (in synthetic data)

The extent to which synthetic data reflects the characteristics, patterns, and distributions of real-world data. See also Structural similarity and Substantive similarity.

Generative adversarial network (GAN)

Is a type of machine learning model that creates realistic data by learning from patterns found in existing datasets. It uses unsupervised learning and two machine learning models working together: one tries to create new data, while the other checks if that data looks real or not. See also Machine learning and Unsupervised learning.

Generative simulation methods

Computational techniques that generate artificial data or model complex systems by learning patterns from existing data rather than following fixed rules. Examples include agent-based models, which simulate the behaviour of individuals within a system, and generative AI models, which produce new, statistically plausible data. For limitations of some generative AI approaches, see AI hallucinations.

Geospatial analysis

A technique that uses location-based data, such as GPS data, satellite imagery, and IoT sensor data, to interpret geographic patterns, relationships, and trends.

Hardware context

A record of exactly what the CPU was doing at a specific point in time, including which instructions it was running and what data it was working with. This record allows the processor to pause one task, switch to another, and later return to the original task exactly where it left off.

Imputation of missing data

Imputation involves replacing missing values, with an estimated value. It is one of three options for handling missing data. The general principle is to delete when the data are expendable, impute when the data are precious, and segment for the less common situation in which a large data set has a large fissure.

Source: SAGE Research Methods.

Indication risk

The likelihood that a synthetic data point is unique enough that it may lead someone to go looking for an identifiable individual or entity in real-world data. This can occur when synthetic data is created with high fidelity across multiple features as it reproduces the combinations of characteristics that make records distinctive. Just as with disclosure risk, managing indication risk is important for ensuring responsible use of data and meeting data protection requirements.

Informed consent

Informed consent is a process used in research where individuals are provided with complete and adequate information about a study including its risks, benefits, and alternatives, based on which the individual decides whether to participate in the study or not.

Source: SAGE Research Methods.

Interactive documentation

Interactive documentation turns static guides into hands-on experiences, enabling users to learn through doing rather than merely reading. For example, rather than reading a step-by-step guide on how to run a regression in R, a learner can write and execute the code directly within the documentation and see the output immediately. See also Non-interactive documentation.

IoT (Internet of Things) devices

Types of hardware, such as sensors, actuators, or appliances, that connect to a network, whether the internet or a local network, to gather, share, and respond to data.

Iterative

Describes a process that involves repeating a cycle of actions, such as designing, testing, and refining, to gradually improve a product, formula, or idea.

Jupyter notebook

Jupyter Notebook is a free, web-based tool, part of Project Jupyter, where you can write and run code, in different programming languages (e.g. Python, R SQL) see the results instantly, and add notes or explanations, all in one place. It is widely used in data science and research to explore data, test ideas, and share work with others.

Labelled data (machine learning)

Data that has been assigned a label or category, typically as the result of a manual or human labelling process, to indicate the correct classification or category for each data point. Labelled data is used in supervised learning tasks where the machine learning model builds a correlation between the features of each data point and the assigned labels. Examples include an image of a pastry labelled as a ‘pain au chocolat’ or of an animal as a ‘sloth’, an email labelled as ‘spam’ or as ‘not spam’ or some piece of text labelled as ‘positive’, ‘negative’, ‘neutral’ or ‘mixed’. See also Machine learning, Supervised learning, Semi-supervised learning and Unlabelled data.

Large Language Models (LLMs)

These sophisticated AI systems are developed using extensive datasets, enabling them to understand, summarize, generate, and predict content in human language with remarkable accuracy. By recognizing intricate patterns and contextual cues in text, LLMs deliver highly relevant responses tailored to the user’s needs. For learners and researchers, LLMs serve as powerful tools for answering questions, drafting and refining text, translating languages, and analysing documents. Their use accelerates research, fosters independent learning, and encourages deeper exploration of complex topics. Notable examples include Gemini, Claude, and GPT.

Level of measurement (also Scale of measurement)

Refers to the kind of information that the values of numeric variables represent and the relationships between them. It determines what kind of statistical analyses are appropriate. There are traditionally four type: nominal, ordinal, interval, and ratio.

Library (programming)

A collection of pre-written code and functions that can be reused to perform specific tasks by a program. For example, in Python, the pandas library is commonly used to work with and analyse data tables.

Linear (in reproducible documentation)

A document or workflow that progresses in a fixed sequence from start to finish, where each step follows directly from the previous one. In practice, this means a script or notebook moves from data import through cleaning, analysis, and output in a single, unbroken order, making it easier for others to follow, audit, and reproduce the analysis.

Logic

The study of correct reasoning, including formal and informal types. Formal logic focuses on deductively valid inferences and how conclusions follow from premises based on argument structure, regardless of topic. Informal logic deals with fallacies, critical thinking, and argumentation theory.

Long-format data

In long format data, each row represents a single observation or measurement for a subject, often resulting in multiple rows for each subject. For example, in a longitudinal survey tracking student performance over several years, each student will have multiple rows corresponding to different years, with each row recording their performance for that particular year. Contrast with wide-format data.

Longitudinal data

A longitudinal design is one that measures the characteristics of the same individuals on at least two, but ideally more, occasions over time. Its purpose is to directly address the study of individual change and variation. Longitudinal studies are expensive in terms of both time and money, but they provide many significant advantages relative to cross-sectional studies.

Source: SAGE Research Methods.

Machine Learning (ML)

A field of artificial intelligence in which systems learn from data by identifying patterns and relationships, allowing them to make predictions or decisions without being explicitly programmed step by step.

Merge (version control)

The process of combining changes from one branch into another within a version control system (such as Git).

Metadata

Metadata is a set of data that describes and gives information about other data. Information that describes significant aspects (e.g. content, context and structure of information) of a resource; metadata are created for the purposes of resource discovery, managing access and ensuring efficient preservation of resources.

Microdata

Microdata are unit-level data obtained from sample surveys, censuses, and administrative systems. They provide information about characteristics of individual people or entities such as households, business enterprises, facilities, farms or even geographical areas such as villages or towns.

Source: The World Bank.

Missing value

Some variables have values that are recorded as missing. These values may be missing unintentionally (due to data entry errors) or may stem from the survey design (e.g. if only part of the sample were asked a particular question). Sometimes non-substantive responses (such as ‘don’t know’) are also recorded as missing values. To draw accurate inferences about the data missing values need to be treated prior to the analyses, e.g. excluded.

Modelling and simulation (M&S)

M&S involves developing a computerised or mathematical model of a real-world system and conducting experiments (simulations) to analyse its behaviour, improve performance, or forecast outcomes.

Modular Script (or ModuleScript)

A reusable block of code that stores shared functions and data, written once and called upon whenever needed across a project. This keeps code organised and avoids repetition, similar to saving a template you reuse rather than rewriting it from scratch each time.

Multivariate analysis

Multivariate analysis encompasses all statistical techniques that are used to analyse more than two variables at once.

Sources: International Encyclopedia of the Social & Behavioral Sciences;

Multivariate modelling

This method attempts to model mathematically or statistically data from two or more variables measured on the same observations. Multivariate statistical modelling often involves a dependent variable and multiple independent variables. Examples of multivariate analyses are factor analysis, latent class analysis, and multivariate regressions. In contrast, univariate method involves an analysis of a single variable.

Resources: Centre for Statistical Methodology; STATA; Science Direct; UCLA Institute for Digital Research & Education.

Natural breaks

This is a method of dividing the data displayed in a choropleth map. The Natural breaks (Jenks) method groups similar values together, and breaks are assigned where there are relatively large distances between the classes. This reduces variance within classes and maximises variance between classes. See also Equal interval and Quantile.

Natural language processing (NLP)

A branch of artificial intelligence (AI) that uses machine learning to enable computers to understand, interpret, and generate human language. It combines computational linguistics with statistical modelling and deep learning to analyse text and speech.

Network analysis

A methodology for studying relationships among entities (nodes), such as people or organizations, connected by links (edges) that represent interactions or relationships. This approach enables the mapping and measurement of complex structures like social networks, communication patterns, or collaborations.

Nominal variable

This is a type of categorical variable that represents categories that do not have a natural order. The values assigned to the categories can be presented in any order. For example, there is no natural order to a set of categories describing the religion a person might follow.

Non-interactive documentation

Static, one-way information provided without requiring user input, navigation, or active engagement to comprehend. Contrast with Interactive documentation.

Non-substantive

Non-substantive responses in surveys are responses such as: ‘Not sure/ Do not recall’, ‘Don’t know (DK)’. Non-substantive responses are generally not used in analysis.

Observed data

Data collected from real-world sources, including surveys, observations, interviews, administrative records, transactions, and other records of people, events, or phenomena. It is based on empirical evidence rather than being artificially generated. See also Synthetic data.

Ordinal variable

This is a type of categorical variable that contains values which represent categories which have a natural order. For example, a highest level of qualification variable might follow an order such as:

higher degree
first degree
further education below degree
GCSE or equivalent
no qualification

There is a logical order that the values assigned to the categories can be presented in.

OS (Operating System)

Software that manages a computer’s hardware and applications by allocating resources such as memory, CPU, input/output devices, and file storage.

Outlier

An outlier is an extreme value that differs greatly from other values in a set of values.

Source: Stat Trek Statistics Dictionary.

Panel

A panel refers to a survey sample in which the same units or respondents are surveyed or interviewed on two or more occasions (waves). See also Panel study.

Source: SAGE Research Methods.

Panel study

Panel studies follow the same individuals over time. Information is normally collected about the whole household at each wave. See also Wave.

Source: Closer Learning Hub (Panel studies).

Parameters (in computational environments)

Values passed into a script or function that control how it behaves when it runs. For example, a researcher might set a parameter to specify which dataset to load, which variables to include in a model, or how many iterations an analysis should run, allowing the same script to be reused with different inputs without rewriting the code.

Parsing (Natural Language Processing)

The process of analyzing the grammatical and syntactic structure of text to determine how words and phrases relate to one another, enabling machines to understand sentence structure, meaning, and context.

Population

In survey design, a population is an entire collection of observation units, for example all ‘residents in England and Wales in 2020’, about which researchers seek to draw inferences.

Population estimate

Statistics produced using a sample of cases (sample statistics), which are designed to produce an estimate about the characteristics of the population (population parameter).

Source: SAGE Research Methods.

Precision

Precision refers to the size of deviations from a survey estimate (i.e. a survey statistic, such as a mean or percentage) that occurs over repeated application of the same probability-based sampling procedures using the same sampling frame and sample size. Standard errors and confidence intervals are two examples of commonly used measures of precision.

Source: SAGE Research Methods.

Primary data source

Primary data is data collected first-hand for a specific research purpose or project.

Source: SAGE Research Methods.

Probabilistic

Probabilistic refers to systems, models, or reasoning that use likelihoods and statistical probabilities instead of absolute certainty. It is commonly used to handle uncertainty in AI, engineering, and data analysis.

Probability sample

A sample based on random selection of elements. It should be possible for the sample designer to calculate the probability with which an element in the population is selected for inclusion in the sample.

Programming language

A set of instructions that allows us to communicate commands to a computer. Using a programming language, we can control a computer’s ‘s behaviour and automate processes.

Prompt engineering

Prompt engineering is the art and science of structuring, designing, and optimising inputs (prompts) to guide Large Language Models (LLMs) such as GPT, Claude, and Gemini to generate accurate, relevant, and high-quality outputs.

Pseudocode

A plain-language description of the logic or steps of a programme, written in a way that is easy for humans to read and understand without following the strict rules of any specific programming language. It is typically written in English but can use any natural language, and is used to plan and communicate the structure of code before writing it formally.

PSPP

PSPP is an open source statistics package which has a similar design and basic functionality of SPSS. Visit the PSPP website for more information.

Python virtual environment

An isolated folder with its own specific Python version and packages. It keeps project dependencies separate, avoiding conflicts with global packages. This prevents version clashes, maintains consistent environments, and does not require admin rights to install libraries.

Quantile

This is a method of dividing the data displayed in a choropleth map . Quantile classification arranges data so there is the same count of features in each class. This will result in an equal distribution of shading across the maps. This can result in a misleading map, as similar features can be in different classes, and widely different features in the same class. See also Equal intervals and Natural breaks (or Jenks).

Queries (in computing)

A request for specific information from a database or system. In research, queries are used to extract, filter, and retrieve data that meets defined criteria, for example, asking a database to return all survey responses from participants aged 18 to 30 in a particular region. The most common programming language for writing queries is SQL.

Random state

In machine learning and data science, random state is a parameter that controls the randomness of algorithms, ensuring they are deterministic and reproducible. By assigning a specific integer to a random state, you guarantee that the same sequence of random numbers will be generated each time the code is executed.

Raw variable

A variable that stores responses given to a question in the survey in their original form. Contrast with derived variables.

Readme file (GitHub repository)

A document placed at the top level of a project folder that explains what the project is, how it is organised, and how to use it. A typical GitHub ReadMe might include sections like an introduction, installation instructions, usage examples, and a list of contributors. For example, a Readme for a data analysis project could start with a summary of the purpose, followed by setup steps and code snippets showing how to run analyses. It is typically the first thing a collaborator or reviewer will read when they encounter the project.

Regression model

A statistical model used to estimate how changes in one or more variables are associated with changes in another variable.

Reinforcement learning

A method in which an agent (a system that can make decisions and take actions) learns by trial and error, receiving feedback in the form of rewards or penalties. Over time, the agent learns which actions lead to the best outcomes and uses this experience to improve its future decisions. For example, an AI writing assistant might suggest a synonym to a frequently used word based on the text surrounding the target word and is given a score of 1 to 5 based on how well the author likes the suggestion.

Repository (or repo) (Version control)

A centralised folder or storage location used in version-controlled environments (such as GitHub, GitLab, or Bitbucket) for keeping code, data, and documentation related to a research project. A version-controlled repository automatically saves a history of every change, allowing users to track revisions, collaborate, and revert to previous versions if needed. These repositories can be stored locally or on cloud-based platforms, enabling multiple contributors to access and update the project efficiently.

Representative sample

A representative sample is one that replicates the characteristics of the population.

Reproducibility

In the context of scientific research, reproducibility refers to the ability of an independent researcher or team to recreate the results of a study using the same methods and data as the original study. This concept hinges on the provision of detailed methodology, context, and background information.

Reproducible research

Reproducible research involves making data, code, and computational steps publicly available, where possible, so that others can verify, replicate, and build upon findings. This practice enhances the reliability of research, promotes transparency, and facilitates the extension of findings by making the original inputs available, where possible, ensuring consistent results.

Research data management

Research data management refers to the systematic organisation, storage, preservation and sharing of data resulting from a research project. It involves practices that span the entire data lifecycle, from planning the collection to wider data sharing.

Source: CODATA RDM Terminology Working Group. (2024). CODATA RDM Terminology (2023, v0001): overview (2023, v0001). Zenodo. DOI: https://doi.org/10.5281/zenodo.10626170

Research methodology

Research methodology is a description of the approach followed to complete a research project; the ‘how’ that helps the researcher address the research aims, objectives and research questions.

Respondent

A person, or other entity, who responds to a survey.

Sample

A sample is a subset of a population.

Sampling

The process of selecting and examining a portion (a sample) of a larger group of potential participants (a population) in order to produce inferences that apply to the broader group of participants.

Source: Encyclopaedia of Quality of Life and Well-Being Research.

Sampling frame

A sampling frame is a comprehensive list of all the members of the population from which a probability sample will be selected.

Source: SAGE Research Methods.

Satellite imagery

Satellite images (also known as Earth observation imagery, spaceborne photography, or simply satellite photos) are images of Earth captured by satellites operated by governments and businesses worldwide.

Scalability (in computational or data analysis)

The ability of a system, method, or workflow to efficiently handle increasing amounts of data, users, or processing demands without significant loss of performance or reliability.

Semi-supervised learning (machine learning)

A method used to train systems using a combination of labelled and unlabelled data as an input. In semi-supervised learning, the machine learning model learns to correlate the features of each labelled data point with the label assigned to it while also learning from the unlabelled data. For example, a small set of labelled emails combined with many unlabelled emails allows the system to learn more effectively whether a new email is spam or not than using only labelled data. See also Machine learning, Supervised learning, Unsupervised learning, Labelled data and Unlabelled data.

Simulation (synthetic data)

Simulation in the context of synthetic data means creating artificial datasets by building virtual environments on a computer. These environments include agent-based models, digital twins, and more. Simulations allow researchers to generate data that is already labeled and ready for analysis. It is a popular way to create synthetic data when real-world data is unavailable.

Source code

A set of instructions written in a programming language (such as Python, R, JavaScript, or C++) that tells a computer how to perform specific tasks. Think of it as the blueprint for a piece of software; it tells the computer exactly what to do and how the program should work. Source code is usually referred to simply as ‘code’ when referring to programming.

SPSS

SPSS is a commercial statistics package. Visit the website for more information.

Standard error

Standard error measures the uncertainty or variability associated with a sample estimate when compared to the true population parameter. The standard error of a statistic (like a mean or percentage) indicates how much that statistic is expected to vary from the true population value.

The standard error is also inversely proportional to the sample size; the larger the sample size, the smaller the standard error.

Source: Stat Trek Statistics Dictionary.

Statistical modelling

A statistical model is a theoretical construction of the relationship of explanatory variables to variables of interest created to better understand these relationships.

They typically consist of a collection of probability distributions and are used to describe patterns of variability that data may display.

The statistical model is expressed as a function. For example, a researcher may model a linear relationship using the regression function below:

y = b₀ + b₁x₁ + b₂x₂ … + b_ix_i

In this model, y represents an outcome variable and x_i represents its corresponding predictor variables. The term b₀ is an intercept for the model. The term b_i is a regression coefficient and represents the numerical relationship between the predictor variables and the outcome for the ith term.

Statistical modelling is a major topic. Readers who want to know more will find extensive accounts of statistical models including linear regression and logistic regression in statistical texts and online.

Sources: Science Direct; Magoosh Statistics Blog.

Strata

Stratified random sampling refers to a sampling method in which the total population is divided into non-overlapping subgroups. Each of the subgroups is called a stratum, and two or more subgroups are called strata.

Sources: An Introduction to Statistical Methods and Data Analysis; Stat Trek Statistics Dictionary.

Structural similarity

How well a synthetic dataset preserves the structure or shape of the original dataset. This includes the number of columns or rows, the name of each column, the type of data in each, or even the way variables are coded. See also Fidelity and Substantive similarity.

Structured interview

A structured interview follows a strict protocol using a set of defined questions administered in the same order to all interviewees. It allows for a quick collection of focused data, however there are limited opportunities for probing and further exploration of topics. The interviews are usually conducted face to face or over the phone.

Substantive similarity

How well a synthetic dataset preserves the statistical properties, patterns, and relationships found in the original real-world data. This includes features like distribution, central tendency, correlations and more. See also Fidelity and Structural similarity.

Supervised learning (machine learning)

A method used to train systems to make predictions or decisions that use labelled data as input. In supervised learning, the machine learning model learns to correlate the features of each data point with its assigned label. The trained model can then predict or categorize new data points. For example, emails labelled as “spam” or “not spam” allow the system to learn how to understand which features are likely to indicate whether a new email is spam or not. See also Machine learning, Unsupervised learning, Semi-supervised learning and Labelled data.

Survey design

Survey design involves a series of methodological steps to create an effective survey, such as defining an objective, determining a target population, designing the questionnaire etc.

Survey design can also refer to the structure or format of the survey, such as a cross-sectional survey, longitudinal survey etc.

Survey non-response

Survey nonresponse can occur at both an item and unit level.

Item nonresponse occurs when a sample member responds to the survey but fails to provide a valid response to a particular item (e.g. a question they refuse to answer).

Unit nonresponse occurs when eligible sample members either cannot be contacted, refuse to participate in the survey or do not provide sufficient information for their responses to be valid. Unit nonresponse can be a source of bias in survey estimates and reducing unit nonresponse is an important objective of good survey practice.

Source: SAGE Research Methods.

Synthetic data

Data that is generated or created rather than directly collected from real-world observations. It may be designed to resemble real data, either by preserving the structure of the original data set or also preserving patterns and relationships within it. Alternatively, synthetic data may be generated without any intention to resemble real world data. See also Observed data.

System configuration

The hardware, software, and settings that determine how a computer system or project runs. For example, the operating system, installed software, and network settings.

Target population

The target population, or simply the population, represents the specific group we are interested in studying.

Time-series analysis

A statistical method for analyzing data points collected at regular intervals over time. Time-series analysis is used to identify and model trends, cycles, seasonal effects, and other temporal patterns within the data. This approach is fundamental in fields such as economics, finance, environmental science, and social science for understanding historical dynamics and forecasting future values.

Timestamp

Records the precise date and time of an event, such as file creation, message sending, or a transaction. It is often used to verify when the activity occurred.

Tokens (AI, NLP)

Tokens in AI are the basic data units, such as character segments, words, or parts of words, used by large language models (LLMs) like GPT and Claude to process, comprehend, and produce text.

Top-p sampling

A setting that controls how varied or predictable an LLM’s responses are. Rather than considering all possible words when generating a response, the model restricts its choices to the most probable options until a set probability threshold is reached. A lower threshold produces more focused and predictable outputs; a higher threshold allows for more varied and creative responses.

Unit of analysis

The unit which is being analysed. This is synonymous to Case.

Univariate

Univariate analysis involves analysis of a single variable. Examples of univariate analyses include descriptive statistics (mean, standard deviation, kurtosis) goodness-of-fit tests and the Student’s t-test.

Unlabelled data (machine learning)

Is used in unsupervised learning tasks where the machine learning model is given data without any labels or tags and is asked to identify possible patterns or clusters within the data. See also Machine learning, Unsupervised learning, Semi-supervised learning and Labelled data.

Unsupervised learning (machine learning)

A method used to train systems using unlabelled data. In unsupervised learning, the machine learning model learns to find patterns or groupings within the data by comparing each point to every other point. The output is a set of possible patterns or groups that the model has identified. For example, customer data might be grouped into suggested customer segments based on the behaviour captured in the data rather than predetermined categories. See also Machine learning, Supervised learning, Semi-supervised learning and Unlabelled data.

Value

A representation of a characteristic for one case. For one variable, values may vary from one case to another. E.g. for the variable ‘gender’ the values may be ‘male’, ‘female’ or ‘other’.

Value label

A description of the values a variable can take on. Sometimes nominal values are coded as numbers and the label helps to describe what each of these numbers means. E.g. for the variable ‘gender’ the values may be:

female
male
other

Variable

A variable is an attribute that describes a person, place, thing, or idea. The value of the variable can “vary” from one entity to another. In surveys, this is usually a characteristic that varies between cases.

Source: Stat Trek Statistics Dictionary.

Version control

Version control (or source control) is a system that tracks changes to files or code over time. It enables individuals and teams to monitor history, revert to previous versions, and collaborate without overwriting each other’s work.

Virtual Reality (VR)

A three-dimensional synthetic digital environment that provides users with multiple degrees of freedom to interact with it and engage in immersive experiences.

Wave

A wave is a round of data collection in a particular longitudinal survey (for example, the age 7 wave of the National Child Development Study refers to the data collection that took place in 1965 when the participants were aged 7). Note that the term sweep often has the same meaning.

Source: Closer Learning Hub (Panel studies).

Wearable devices

Body-worn computers connected to the internet, such as smartwatches, fitness trackers, smart rings, and smart glasses. They monitor health metrics, display notifications, and facilitate touchless interactions using AI.

Web scraping

An automated method of collecting large amounts of data from websites using software called web scrapers, which analyse HTML code to convert unstructured web content into structured data formats such as JSON, CSV, or SQL.

Weighting

Weighting is a statistical adjustment made to survey data to improve accuracy of survey estimates. Weighting can correct for unequal probabilities of selection and survey non-response.

Source: SAGE Research Methods.

Wide-format data

In wide format data, each subject’s responses are listed in a single row, with different variables spread across multiple columns. Longitudinal data in wide format would contain one row of information per person, and measurements of the same variable at different time points would be contained in different variables. Contrast with Long-format data.

Zero-shot prompting

Zero-shot prompting is an AI interaction technique in which a prompt requests that a model perform a task without prior examples or specific training. See also Few-shot prompting.

This site uses necessary cookies

Website stats

95% confidence interval

Agent-based modelling (ABM)

Aggregate data

AI hallucinations

Algorithmic transparency

Analytical process

API (Application Programming Interface)

Artificial Intelligence (AI)

Attrition

Augmented data

Augmented reality (AR)

Automation

Autonomous agent (reinforcement learning)

Bias

Branching (version control)

Case

Catalogue record

Categorical variable

Chain of Thought (CoT) prompting

Change logs

Choropleth map

Cloud-based repositories (version control)

Cluster sampling

Clusters

Codebook

Cohort

Cohort study

Commit

Computational environment

Computational social sciences (CSS)

Context engineering

Context window

Control variables

Copyright

Cross-sectional data

Data archives

Data generalisation (data anonymisation)

Data licensing

Data linkage

Data manipulation

Data masking (data anonymisation)

Data mining

Data repository

Data science

Data suppression (data anonymisation)

Dataset

Derived variable

Descriptive statistic

Disclosure risk (data anonymisation)

Documentation

DOI (Digital Object Identifier)

Equal interval

Few-shot prompting

Fidelity (in synthetic data)

Generative adversarial network (GAN)

Generative simulation methods

Geospatial analysis

Hardware context

Imputation of missing data

Indication risk

Informed consent

Interactive documentation

IoT (Internet of Things) devices

Iterative

Jupyter notebook

Labelled data (machine learning)

Large Language Models (LLMs)

Level of measurement (also Scale of measurement)

Library (programming)

Linear (in reproducible documentation)

Logic

Long-format data

Longitudinal data

Machine Learning (ML)

Merge (version control)

Metadata

Microdata