Computational social science

The changing data research landscape

New technologies, resources and methods are constantly changing how researchers interact with and use data. Many innovations, including those in modelling, simulation, big data, web-scraping, social media and more, have already made a huge impact on how researchers can and should access, manage and explore data.

The UK Data Service supports researchers and data analysts by providing learning resources relating to several innovative aspects of data-intensive social science research.

Being a computational social scientist

The growing importance of computing

Scientific research and teaching is increasingly influenced by computational tools, methods and paradigms.

The social sciences are no different, with many new forms of social data only available through computational means. While to some degree social science research has always been marked by technological approaches, the field of computational social science involves the use of tools, data and methods that require a different skillset and mind set.

Webinar: Being a Computational Social Scientist

This webinar examines five key domains of computational social science:

  • thinking computationally
  • writing code
  • computational environments
  • manipulating structured and unstructured data
  • reproducibility of the scientific workflow.

Includes sample Python code for core social science research tasks.

Coding Demonstrations: Text Mining in Python

The UK Data Service held some coding demonstrations that introduced simple computational techniques that core text-mining concepts and demonstrates some basic and advanced methods that can be customised to the needs of individual research projects. Using these techniques, researchers can learn that can improve how they work with text data and help them speed up and simplify their text analysis as well as make their research methods more transparently documented and reproducible.


One of the tutorials focuses on setting up a computational environment. You can watch the recordings of four demonstrations.

Web-scraping for social science research

What is web-scraping?

Vast swathes of our social interactions and personal behaviours are now conducted online or captured digitally in some other way.

Websites and online databases contain rich information of relevance to social science research, including social media, network platforms, text corpora and more. Web-scraping refers to (semi-) automated computational methods for collecting such data from the web. Understanding and using web-scraping techniques can be an increasingly useful and important component of a social scientist’s toolkit.

Webinars on web-scraping

These three webinars cover how to collect data from the web using computational methods:

Webinars on Twitter data

Two webinars cover Twitter:

Programming for social science research

The growing importance of programming

Computational methods for collecting, cleaning and analysing data are an increasingly important component of a social scientist’s toolkit.

Central to engaging in these methods is the ability to write readable and effective code using a programming language.

Materials associated with the Coding Demonstrations training series (GitHub).

The training materials include webinar recordings, slides, and sample Python code for core social science research tasks.

Agent-based modelling

What is agent-based modelling?

Agent-based modelling refers to a class of computational models for simulating the actions and interactions of autonomous agents that can represent individuals or collective entities such as households, organisations or political entities.

Social science seeks to understand and predict patterns involving human behaviour, many of which are large-scale and complex. But social science explanations or predictions can be difficult to test and refine because of the serious ethical and practical barriers to controlling, manipulating and replicating conditions within experiments. For example, there are many theories behind some of the complex patterns of urban mobility, but when traffic calming measures fail to produce the desired results it can be difficult to identify why or how the situation can be improved.

One possible solution is to run social science experiments using computer-simulated actors whose features, behaviours and actions are informed by real world data. This allows social scientists to explore the factors needed to replicate observations, and to test and refine their understanding of how robust those factors might be. Computational social science experiments also allow researchers to explore how emergent patterns might change under experimental, or even counter-factual, conditions.

Webinars on agent-based modelling

In January 2020, Dr Julia Kasmire at the UK Data Service hosted a series of three webinars that introduce users to the concept of agent-based modelling and go on to discuss some of the key techniques involved.

The webinars are published on the UK Data Service YouTube channel. Links to each webinar is listed below with a brief explanation.

Webinar: Introduction to agent-based modelling for social scientists

This webinar introduces the important concepts of emergent patterns, bottom-up processes, and other theoretical ideas underpinning agent-based modelling; presents several examples of agent-based models; discusses the pros and cons of agent-based models; presents several software options for agent-based modelling and where to get more information.


Webinar: Adding real world GIS and census data to agent-based modelling for social scientists

Introduces the important concepts of downloading, cleaning and preparing shapefiles and other data files for importing into an existing agent-based model; presents an extensive exploration of how a commuting model differs when based on random data or imported real world data; discusses some problems and limitations of using real world data in agent-based models; presents links so that users can access and use the model data presented in the webinar.


Webinar: Conducting experiments, recording output and analysing results of agent-based modelling for social scientists

Includes how to conduct parameter sweeps for model testing in NetLogo; how to automate the process of computational experiments (also in NetLogo); two different methods of exporting experimental data to saved files for further analysis; briefly displays what exported experimental data looks like and how it might be analysed to support experimental conclusions.


Managing largescale data

Getting data, storing data, manipulating data

All projects need data – you can generate it yourself via surveys or you can get some from the UK Data Service.

If you didn’t generate it yourself, the chances are it is not quite what you wanted but you can adapt it to your needs. You can adapt the data by a variety of means; cleaning the data, extracting parts of the data or joining data from different sources.

The UK Data Service organised a series of three webinars hosted by Peter Smyth from the Cathie Marsh Institute, which covered ways of dealing with these data issues. Below is a brief description of each seminar together with a link to the webinar recording and slides.

Webinar: Introduction to SQL and databases

For large datasets, e.g. millions of records, a dedicated database environment is the storage medium of choice. A database environment is designed to store data efficiently while at the same time making the retrieval of the data you want straight forward using simple to learn SQL queries.


Webinar: Getting data from the Internet

This webinar illustrates techniques for automating the download of files from the Internet, using APIs to download data and demonstrating how you might store the data.

It also shows how you can systematically identify data on web pages, scrape it and store the data in a dataset.


Webinar: Power Pivot and dynamic arrays in Excel

Versions of Excel now include Power Pivot – an extension that allows users to load datasets with millions of rows and join datasets together in a more intuitive way. Although integrated into Excel, it is a distinct processing system with its own set of commands for processing data.


Online software for big data

Resources for researchers and data analysts

The UK Data Service has produced a number of guides for researchers and analysts who want to run large-scale data analytics across cluster computers i.e. many computers working together.

Below, we have listed links to these guides, which are in PDF format, together with a brief description of each document.

Installing Spark on a Windows PC

Apache Spark is an open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers. Installing Spark on a Windows PC guide (PDF).

Obtaining and downloading the HDP Sandbox

Hortonworks is a commercial company which specialises in data platforms based on open source software for big data, in particular Hadoop. HDP is an acronym for the Hortonworks Data Platform, which is an implementation of a Hadoop cluster and a range of associated big data products which run in the Hadoop environment. Obtaining and downloading the HDP Sandbox guide (PDF).

Loading data into HDFS

This short guide provides detailed instructions of how to load a dataset from a PC into a Hadoop system. Loading data into HDFS guide (PDF).

HiveQL example queries

This workbook contains some practical exercises for researchers and/or data analysts who want to run simple queries using Apache Hive. HiveQL example queries guide (PDF).