GESIS Training Courses

Wiss. Koordination

Marlene Mauk
Tel: +49 221 47694-579

Administrative Koordination

Noemi Hartung
Tel: +49 621 1246-211


Big Data and Computation for Social Data Science

Mannheim B6, 4-5
Course duration:
09:00-16:00 CEST
General Topics:
Course Level:
Software used:
Students: 500 €
Academics: 750 €
Commercial: 1500 €
Additional links
Dozent(en): Akitaka Matsuo, David (Yen-Chieh) Liao

Referenteninformationen - Akitaka Matsuo

Referenteninformationen - David (Yen-Chieh) Liao


This course is intended for social science researchers and practitioners who wish to gain insight by analyzing large data sets (“big data”), teaching them the infrastructure for data manipulation and analysis, and how to use that infrastructure with statistical and programming languages.
The amount of data available to social scientists is increasing every year, and such large amounts of data have the potential to provide novel insights that were previously unavailable. However, as the volume of data increases, it becomes less feasible to load and process them on a personal computer. What is needed in such cases are databases for data storage and parallel processing, and distributed computing systems for data processing and computation. Learning about them is the objective of this course.
With regard to database systems, after learning the basic concepts, participants will learn SQL, the most widely used relational database language, and its management systems. As a more advanced topic, we will overview databases other than SQL, especially MongoDB, which is an excellent non-relational destination for storing large unstructured data (e.g. text data). For data processing and computation, students will learn how to parallelize data processing and analysis and how to use distributed computation systems, such as Apache Spark.
To learn these technologies, both theory and practice are very important, and thus the course will provide both lectures and labs as one set. The primary programming language will be R, as it is a language familiar to most quantitative social scientists, but given the increased importance of Python in social data science, the course will show how to use Python to do what we have learned in R, when appropriate.
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.


Participants will find the course useful if:
  • they want to work with large datasets and need to perform complex computations and data analysis tasks
  • they are interested in using relational databases with SQL  and non-relational databases with NoSQL, distributed computation systems such as Apache Spark, and cloud computing for their data analysis
  • they have a background in R programming

  • Lernziel

    By the end of the course, students should have a good understanding of how to work with SQL as well as NoSQL databases in R, as well as how to leverage distributed computation systems like Spark for large-scale data processing. They should also be able to work with databases and compute clusters in the cloud. To be more concrete:
  • Understanding the basics of SQL and NoSQL databases
  • Writing SQL queries to retrieve data from a database
  • Importing and exporting data from databases using R
  • Working with non-relational databases, such as MongoDB, and understanding their data structures and query languages
  • Understanding the concept of parallel computing and its advantages in data processing and analysis, including faster processing times and increased scalability
  • Working with distributed computing systems such as Apache Spark
  • Using R to perform data manipulation and analysis with the tidyverse packages
  • Learning how to do the same process as above in Python, thereby understanding the advantages and disadvantages of R and Python in their respective ecosystems
  • Understanding the importance and practice of benchmarking in data processing and analysis
  • Profiling the code to find the pieces that are causing performance problems in R and Python
    Organisational Structure of the Course
    Each day of the course will have two 3-hour units. Each unit will include both lectures and labs.
    In the lab, students will receive exercise problems to work on. The exercises are essentially given in R, and students answer them in the time allotted by the instructor. Students will work with other students to answer the questions on their own, and the two instructors will both be present in the classroom, so if they have any questions, they can always ask. The instructor will then provide the answer and, if possible, a demonstration of how to do the same thing in Python.
    The instructor will also have office hours after class, where students can not only ask questions regarding the lectures and lab but also consult with the instructors on the methodological issues with their own research projects.


  • Experience with data analysis using R including:
  • Manipulation of objects (e.g., scalars, vectors, data frames)
  • Opening/writing data files
  • Running and interpreting basic statistical models (e.g., OLS regressions, logit/probit models)
  • Working with packages
  • Experience in Python is not required but would be a plus to understand the Python examples.
  • For those who would like a primer or refresher in R or Python, we recommend taking the online workshop "Introduction to R" that takes place from 05-07 September 2023 or the online workshop “Introduction to Python” that takes place from 04-06 September 2023.
    Software and hardware requirements
    In this course, we will access various cloud data and computational environments, mainly using R and RStudio as well as Python and JupyterLab as a client. Participants should bring their own laptops with the following software installed:
    -          R (preferably latest, minimum 4.1.0)
    -          RStudio (latest)
    -          Miniconda (latest)
    -          Git environment (for Windows users who do not have a Bash environment)
    R and RStudio should be installed beforehand, and Windows users should install Git for Windows. For Python, please install Miniconda, but building a conda environment will be done in the lab. Detailed instructions on packages and additional software (e.g., VS Code, MongoDB client) installation will be provided during the lecture and lab.
    Monday, 11.09.
    Morning Session
  • Introduction to big data management and computation
  • R: Introduction
  • Python: Introduction
  • Afternoon Session
  • R infrastructures
  • Python
  • Anaconda/miniconda and conda environments
  • Lab time
  • Tuesday, 12.09.
    Morning Session
  • Introduction to parallel processing (hardware, memory, and performance)
  • Parallel strategies and tools
  • Benchmarking and code optimization
  • Lab time
  • Afternoon Session
  • Dealing with NLP tasks in parallelization (part of speech tagging and named entities recognition)
  • Data storage
  • Lab time
  • Wednesday, 13.09.
    Morning Session
  • Introduction to databases and SQL
  • Relational database model
  • Creating and managing databases
  • Basic SQL queries
  • Lab time
  • Afternoon Session
  • More on SQL queries
  • How to use dbplyr
  • Lab time
  • Thursday, 14.09.
    Morning Session
  • Advanced SQL topics
  • noSQL databases overview: MongoDB
  • Lab time
  • Afternoon Session
  • NoSQL database and MongoDB basic
  • Schema and relation in MongoDB
  • MongoDB queries
  • Lab time
  • Friday, 15.09.
    Morning Session
  • Introduction to distributed computation systems and Apache Spark
  • Sparklyr and Sparkr
  • Spark data wrangling
  • Lab time
  • Afternoon Session
  • Data analysis with Apache Spark
  • PySpark
  • Lab time