Computational social science
Computational social science: Reflecting the changing data research landscape
New technologies, resources and methods are constantly changing how researchers interact with and use data. Many innovations, including those in modelling, simulation, big data, web-scraping, social media and more, have already made a huge impact on how researchers can and should access, manage, and explore data. The social sciences are no different, with many new forms of social data or methods of analysis that are more computationally intensive than most social scientists may be used to. The use of these computational tools, data and methods requires new skills as well as some new perspectives and expectations. Computational social science as a discipline applies these new tools and perspectives to data analysis.
The UK Data Service supports researchers and data analysts by providing free learning resources relating to several innovative aspects of data-intensive social science research.
Administrative data
UK Data Service provides access to the country’s largest collection of economic, population and social research data, including administrative datasets. Amongst this collection, the Service offers secure access to newly linked datasets from the Centre for Longitudinal Studies (CLS) and NHS Digital, the National Pupil Database, the National Referral Mechanism and Duty to Notify Statistics, UCAS admissions service data and Ofsted data.
ADR UK offers training resouces for researchers learning to use administrative data.
Computational social science – an introductory workshop
Taking place every spring and autumn, this workshop examines what computational social science is, what researchers can do to become computational social scientists, and our recommended 8-step process behind computational social science research projects.
Resources
Includes recordings, slide decks and GitHub repositories with interactive code notebooks.
- GitHub repository for Becoming a Computational Social Scientist workshop (a precursor to the current workshop).
- Recording: Computational Social Science – an introductory workshop.
- Workshop slide deck: Computational Social Science – an introductory workshop (PDF).
- Computational Social Science YouTube playlist.
Text-mining
Text and other semi-unstructured data are becoming more readily available and more intriguing to social scientists. This training series introduces core text-mining concepts and demonstrates some methods that researchers can learn for working with text data.
Resources
Our text-mining training series goes into different levels of detail, demonstrating different programming languages, and using different data sets. The resources here link to the recordings, slide decks and GitHub repositories (with interactive code notebooks) for all versions that are currently available.
- GitHub repository for Text-mining for Social Science (1st edition covering basic natural language processing, sentiment analysis, automatic classifiers, named entity recognition and social network extraction in python).
- GitHub repository for Text-mining for Digital Health (2nd edition covering basic natural language processing, named entity recognition, and automatic classifiers in python and R).
- Text-mining YouTube playlist (includes both editions).
- GitHub repository for Person-first / Identity-first language project (a research project in which conference abstracts are scraped, processes and analysed to show how “autistic people” and “people with autism” are used differently in scientific publications).
- Workshop slide decks (available within the relevant GitHub repositories).
Web-scraping and APIs for social science research
Discover how to acquire and use data through web-scraping tools and application programming interfaces (API). The resources here link to the recordings, slide decks and GitHub repositories (with interactive code notebooks) for all versions that are currently available.
- GitHub repository for Web-scraping for Social Science Research (first edition, covering web-scraping demonstrations and APIs as a source of data and a case study showing how scraping charity data from webpages contributed to a social science research project).
- GitHub repository for Working with Twitter Data workshop (a demonstration of how to acquire and work with data from the Twitter API).
- Recording: Web-scraping for Google Chrome (demonstrating how a Google Chrome extension can allow researchers to scrape webpages without using python or other programming languages).
- Web-scraping and API YouTube playlist (includes recordings of various webinars, workshops and code demonstrations showing how to use multiple web-scraping tools and APIs as well as how to work with the data produced from web-scraping tools and APIs).
- Workshop slide decks (available within the relevant GitHub repositories).
Twitter data for social scientists
Although “Working with Twitter” is included in the web-scraping and APIs for social science research materials, there is more to learn about the specific techniques in acquiring and using data from the Twitter API. The resources here link to the recordings, slide decks and GitHub repositories (with interactive code notebooks) for all versions that are currently available.
- GitHub repository for Working with Twitter Data workshop (a demonstration of how to acquire and work with data from the Twitter API).
- Recordings of the webinars for The Twitter Timeline by Peter Smyth and Working with Twitter Data by Joseph Allen.
- Twitter data for Social Scientists YouTube playlist (covering additional recordings of webinars, workshops and code demonstrations).
- Slide decks for The Twitter Timeline by Peter Smyth (PDF) and Working with Twitter Data by Joe Allen.
Machine learning for social scientists
An increasingly popular way to deal with large volume, high complexity or rapidly changing data is to use machine learning algorithms including. The resources here link to the recordings, slide decks and GitHub repositories (with interactive code notebooks) for all the various workshops, webinars and code demonstrations that that are currently available, as well as to a machine learning research project, which serves as a case study and step-by-step example.
- GitHub repository for Machine Learning (introducing clustering methods with code notebooks for both python and R).
- Machine learning YouTube playlist (recordings of various webinars, workshops and code demonstrations for both python and R).
- GitHub repository for Arrow of Time project (a research project exploring how non-time series machine learning algorithms are still influenced by time series data).
- Workshop slide decks (available within the relevant GitHub repositories).
Agent-based modelling
Agent-based modelling (ABM) refers to a class of computational models for simulating the actions and interactions of autonomous agents that can represent individuals or collective entities such as households, organisations or political entities. ABM allows researchers to explore how emergent patterns might change under experimental, or even counter-factual, conditions in rapid and cost-effective simulations.
Resources
The ABM training series has been run multiple times, with the various editions, each featuring different examples, answering different questions, and demonstrating different points. The resources here link to the recordings, slide decks and GitHub repositories for all versions that are currently available.
- GitHub repository for agent-based modelling (covering basic concepts, adding real-world data, experimentation and a guest researcher presenting their published ABM work).
- ABM YouTube playlist (recordings of various webinars, workshops, code demonstrations and guest presentations).
- Download free NetLogo ABM software.
- Download Tram Commute model shown in webinars.
- Workshop slide decks (available within the relevant GitHub repositories).
Synthetic data
There are many kinds of synthetic data and in turn synthetic data is put to many kinds of uses. This training series explores the advantages and disadvantages of synthetic data, what it is good for, and how it is generated.
Resources
The resources within the synthetic data training series feature different foci, examples, questions, and issues. The resources here link to the recordings, slide decks and GitHub repositories for all versions that are currently available.
- GitHub repository for synthetic data (covering basic concepts, adding real-world data, experimentation and a guest researcher presenting their published ABM work).
- Synthetic Data YouTube playlist (recordings of various webinars, workshops, code demonstrations and guest presentations).
- Workshop slide decks (available within the relevant GitHub repositories).
Mapping crime data in R
This training series provides an overview of available crime data from the UK Data Service, UK police API, and “crimedata” R package, plus methods to access, analyse, and present results. It is aimed at students, researchers or anyone interested in crime data, from beginners with basic knowledge of quantitative analysis and R, through to intermediate users looking for inspiration and new resources.
Resources
The mapping crime data in R training series has been run multiple times, with the various editions, each featuring different foci, examples, questions, and issues. In particular, the 2021 edition is a 3-day interactive workshop with in-depth detail on many relevant aspects of R, mapping, interactive visualisations and crime data. The resources here link to the recordings, slide decks and GitHub repositories for all versions that are currently available.
- GitHub repository for Mapping Crime Data in R (includes folders for each time the training series was run).
- Mapping Crime Data in R YouTube playlist (recordings of various webinars, workshops, code demonstrations and guest presentations).
- Workshop slide decks (available within the relevant GitHub repositories).
Time series and forecasting
Time series analysis and forecasting are among the most common quantitative techniques employed by businesses and researchers. Frequently used in big data, these methods are used to identify trends, clean data, and to even predict the future. This workshop begins by exploring the underlying concepts and components of time series analysis, then moves on to a code demonstration that uses open-sourced police recorded crime statistics to visually explore these components.
Resources
The resources here link to the recordings, slide decks and GitHub repositories (with interactive code notebooks) for all versions that are currently available.
- GitHub repository for Time Series and Forecasting.
- Time Series and Forecasting YouTube playlist (includes the An introduction to time series analysis and forecasting workshop and other videos introducing time series data held by the Service).
- Workshop slide decks (available within the relevant GitHub repositories).
Computational methods for processing and working with data
All projects need data – you can generate it yourself via surveys or you can access secondary data from the UK Data Service.
With secondary data, you may need to adapt it to the needs of your research project. You can adapt the data by a variety of means; cleaning the data, extracting parts of the data or joining data from different sources.
Webinar: Data pre-processing
Data pre-processing is a data mining technique that involves transforming raw data into an understandable format. With the increasing amount of data available for research and analysis, real-world data is often incomplete or inconsistent and therefore not ready for immediate use. Multiple spreadsheets, missing values, typos, numbers shown as text, unnecessary columns. Data without adequate preparation will deliver poor or misleading findings.
Resources
- GitHub repository for data pre-processing.
- Computational methods for working with data YouTube playlist (recordings of webinars, workshops, and code demonstrations).
- Workshop slide decks (available within the relevant GitHub repositories).
Webinar: Introduction to SQL and databases
For large datasets, e.g. millions of records, a dedicated database environment is the storage medium of choice. A database environment is designed to store data efficiently while at the same time making the retrieval of the data you want straightforward using simple to learn SQL queries.
Resources
- Webinar recording: Introduction to SQL and Databases.
- Presentation slides: Introduction to SQL and Databases, Peter Smyth (PDF).
- Introduction to SQL Queries (Text).
- Introduction to SQL Queries (Python) (Text).
Webinar: Power Pivot and dynamic arrays in Excel
Versions of Excel now include Power Pivot – an extension that allows users to load datasets with millions of rows and join datasets together in a more intuitive way. Although integrated into Excel, it is a distinct processing system with its own set of commands for processing data.
Resources
Software resources and links for researchers and data analysists working with big data
The UK Data Service has produced a number of guides for researchers and analysts who want to run large-scale data analytics across cluster computers i.e. many computers working together.
Below, we have listed links to these guides, which are in PDF format, together with a brief description of each document.
Installing Spark on a Windows PC
Apache Spark is an open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers. Installing Spark on a Windows PC guide (PDF).
Obtaining and downloading the HDP Sandbox
Hortonworks is a commercial company which specialises in data platforms based on open source software for big data, in particular Hadoop. HDP is an acronym for the Hortonworks Data Platform, which is an implementation of a Hadoop cluster and a range of associated big data products which run in the Hadoop environment. Obtaining and downloading the HDP Sandbox guide (PDF).
Loading data into HDFS
This short guide provides detailed instructions of how to load a dataset from a PC into a Hadoop system. Loading data into HDFS guide (PDF).
Social network analysis
Analysing interactions through social network analysis is not just for social media research! Any kind of interaction between entities (people, businesses, countries, etc.) can be explored through social network graphs to discover the properties of that network.
This training series introduces core social network analysis concepts and demonstrates some methods that researchers can learn for working with interaction data.
Resources
The resources here link to the recordings, slide decks and GitHub repositories (with interactive code notebooks) for all versions that are currently available.