Computational social science

Computational social science: Reflecting the changing data research landscape

New technologies, resources and methods are constantly changing how researchers interact with and use data. Many innovations, including those in modelling, simulation, big data, web-scraping, social media and more, have already made a huge impact on how researchers can and should access, manage, and explore data. The social sciences are no different, with many new forms of social data or methods of analysis that are more computationally intensive than most social scientists may be used to. The use of these computational tools, data and methods requires new skills as well as some new perspectives and expectations. Computational social science as a discipline applies these new tools and perspectives to data analysis.

The UK Data Service supports researchers and data analysts by providing free learning resources relating to several innovative aspects of data-intensive social science research.

Administrative data

UK Data Service provides access to the country’s largest collection of economic, population and social research data, including administrative datasets. Amongst this collection, the Service offers secure access to newly linked datasets from the Centre for Longitudinal Studies (CLS) and NHS Digital, the National Pupil Database, the National Referral Mechanism and Duty to Notify Statistics, UCAS admissions service data and Ofsted data.

ADR UK offers training resouces for researchers learning to use administrative data.

Computational social science – an introductory workshop

Taking place every spring and autumn, this workshop examines what computational social science is, what researchers can do to become computational social scientists, and our recommended 8-step process behind computational social science research projects.

Resources

Includes recordings, slide decks and GitHub repositories with interactive code notebooks.

Text-mining

Text and other semi-unstructured data are becoming more readily available and more intriguing to social scientists. This training series introduces core text-mining concepts and demonstrates some methods that researchers can learn for working with text data.

Resources

Our text-mining training series goes into different levels of detail, demonstrating different programming languages, and using different data sets. The resources here link to the recordings, slide decks and GitHub repositories (with interactive code notebooks) for all versions that are currently available.

Web-scraping and APIs for social science research

Discover how to acquire and use data through web-scraping tools and application programming interfaces (API). The resources here link to the recordings, slide decks and GitHub repositories (with interactive code notebooks) for all versions that are currently available.

Twitter data for social scientists

Although “Working with Twitter” is included in the web-scraping and APIs for social science research materials, there is more to learn about the specific techniques in acquiring and using data from the Twitter API. The resources here link to the recordings, slide decks and GitHub repositories (with interactive code notebooks) for all versions that are currently available.

Machine learning for social scientists

An increasingly popular way to deal with large volume, high complexity or rapidly changing data is to use machine learning algorithms including. The resources here link to the recordings, slide decks and GitHub repositories (with interactive code notebooks) for all the various workshops, webinars and code demonstrations that that are currently available, as well as to a machine learning research project, which serves as a case study and step-by-step example.

Agent-based modelling

Agent-based modelling (ABM) refers to a class of computational models for simulating the actions and interactions of autonomous agents that can represent individuals or collective entities such as households, organisations or political entities. ABM allows researchers to explore how emergent patterns might change under experimental, or even counter-factual, conditions in rapid and cost-effective simulations.

Resources

The ABM training series has been run multiple times, with the various editions, each featuring different examples, answering different questions, and demonstrating different points. The resources here link to the recordings, slide decks and GitHub repositories for all versions that are currently available.

Synthetic data

There are many kinds of synthetic data and in turn synthetic data is put to many kinds of uses. This training series explores the advantages and disadvantages of synthetic data, what it is good for, and how it is generated.

Resources

The resources within the synthetic data training series feature different foci, examples, questions, and issues. The resources here link to the recordings, slide decks and GitHub repositories for all versions that are currently available.

  • GitHub repository for synthetic data (covering basic concepts, adding real-world data, experimentation and a guest researcher presenting their published ABM work).
  • Synthetic Data YouTube playlist (recordings of various webinars, workshops, code demonstrations and guest presentations).
  • Workshop slide decks (available within the relevant GitHub repositories).

Social network analysis

Analysing interactions through social network analysis is not just for social media research! Any kind of interaction between entities (people, businesses, countries, etc.) can be explored through social network graphs to discover the properties of that network.

This training series introduces core social network analysis concepts and demonstrates some methods that researchers can learn for working with interaction data.

Resources

The resources here link to the recordings, slide decks and GitHub repositories (with interactive code notebooks) for all versions that are currently available.

Mapping crime data in R

This training series provides an overview of available crime data from the UK Data Service, UK police API, and “crimedata” R package, plus methods to access, analyse, and present results. It is aimed at students, researchers or anyone interested in crime data, from beginners with basic knowledge of quantitative analysis and R, through to intermediate users looking for inspiration and new resources.

Resources

The mapping crime data in R training series has been run multiple times, with the various editions, each featuring different foci, examples, questions, and issues. In particular, the 2021 edition is a 3-day interactive workshop with in-depth detail on many relevant aspects of R, mapping, interactive visualisations and crime data. The resources here link to the recordings, slide decks and GitHub repositories for all versions that are currently available.

Time series and forecasting

Time series analysis and forecasting are among the most common quantitative techniques employed by businesses and researchers. Frequently used in big data, these methods are used to identify trends, clean data, and to even predict the future. This workshop begins by exploring the underlying concepts and components of time series analysis, then moves on to a code demonstration that uses open-sourced police recorded crime statistics to visually explore these components.

Resources

The resources here link to the recordings, slide decks and GitHub repositories (with interactive code notebooks) for all versions that are currently available.

Computational methods for processing and working with data

All projects need data – you can generate it yourself via surveys or you can access secondary data from the UK Data Service.

With secondary data, you may need to adapt it to the needs of your research project. You can adapt the data by a variety of means; cleaning the data, extracting parts of the data or joining data from different sources.

Webinar: Data pre-processing

Data pre-processing is a data mining technique that involves transforming raw data into an understandable format. With the increasing amount of data available for research and analysis, real-world data is often incomplete or inconsistent and therefore not ready for immediate use. Multiple spreadsheets, missing values, typos, numbers shown as text, unnecessary columns. Data without adequate preparation will deliver poor or misleading findings.

Resources

Webinar: Introduction to SQL and databases

For large datasets, e.g. millions of records, a dedicated database environment is the storage medium of choice. A database environment is designed to store data efficiently while at the same time making the retrieval of the data you want straightforward using simple to learn SQL queries.

Resources

Webinar: Power Pivot and dynamic arrays in Excel

Versions of Excel now include Power Pivot – an extension that allows users to load datasets with millions of rows and join datasets together in a more intuitive way. Although integrated into Excel, it is a distinct processing system with its own set of commands for processing data.

Resources

Software resources and links for researchers and data analysists working with big data

The UK Data Service has produced a number of guides for researchers and analysts who want to run large-scale data analytics across cluster computers i.e. many computers working together.

Below, we have listed links to these guides, which are in PDF format, together with a brief description of each document.

Installing Spark on a Windows PC

Apache Spark is an open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers. Installing Spark on a Windows PC guide (PDF).

Obtaining and downloading the HDP Sandbox

Hortonworks is a commercial company which specialises in data platforms based on open source software for big data, in particular Hadoop. HDP is an acronym for the Hortonworks Data Platform, which is an implementation of a Hadoop cluster and a range of associated big data products which run in the Hadoop environment. Obtaining and downloading the HDP Sandbox guide (PDF).

Loading data into HDFS

This short guide provides detailed instructions of how to load a dataset from a PC into a Hadoop system. Loading data into HDFS guide (PDF).