GESIS Training Courses

Scientific Coordination

Verena Kunz

Administrative Coordination

Noemi Hartung
Tel: +49 621 1246-211

Synthetic Data: Generation and Evaluation

Online via Zoom
General Topics:
Course Level:
Software used:
Students: 200 €
Academics: 300 €
Commercial: 600 €
Additional links
Lecturer(s): Thom Volker

About the lecturer - Thom Volker

Course description

In the current age of open science, sharing research code and data is often required when publishing a scientific paper. Moreover, the open dissemination of research data is a potential gold mine for answering many research questions. However, privacy and confidentiality constraints often impede the open dissemination of research data. Synthetic data can be an excellent solution to this problem: the real data is kept secret, but a "fake" version of the data is made available. This synthetic dataset can serve many purposes. For example, it allows those in the process of obtaining access to the real data set to get familiar with the structure of the data, and it allows reviewers (or other researchers) to rerun scripts and assess whether the original analysis code is reproducible and runs as intended. Additionally, the synthetic data itself can be used to run completely different analyses, unrelated to the original research problem. In this course, you will learn what synthetic data is, how to generate synthetic data, how to evaluate its quality in terms of utility and remaining privacy risks, and how to obtain statistically valid results from analyses on this data.
In three half days, we will cover the origins of synthetic data (including its relation to multiple imputation of missing data), practice generating our own synthetic version of a realistic scientific dataset, and evaluate its quality and disclosure risks. We will discuss how to make inferences from synthetic data and work on increasing the synthetic data quality through advanced modeling or solving practical problems that arise when working with complex data structures (for example, how to deal with deterministic systems/composite variables or logical constraints). On the final day, there will be room for individual consultation.
The course will have a hands-on format, with more time scheduled for practicals (+ discussion) than lectures (approximately a 60/40 division). In principle, (social) scientific datasets are provided for all practicals, but participants can also bring their own data (this might not be ideal if (1) this data is so privacy-sensitive that instructors cannot look at it; (2) the dataset is so large that running code takes too long). All practical exercises are in R, but only little programming experience is required (a recent 'introduction to R'-course or some working experience with R or another scripting language suffices). A good understanding of basic statistics will definitely be beneficial (i.e., working experience with regression analysis).
Organizational structure of the course
Each day will consist of two blocks of two hours, containing a live lecture of approximately 45 minutes and a hands-on practical (that can be completed individually or in small groups), and a discussion of approximately 60 minutes. The lecturer will be available for questions during the practicals. On the last day, there will be some time for individual consultation (but project-related questions can also be asked during breaks or before/after class).

Target group

Participants will find the course useful if:
  • they want understand the idea of synthetic data
  • they want to be able to generate high-quality synthetic data
  • they want to evaluate utility and disclosure risks of generated data
  • they want to share a secure version of their privacy-sensitive data with collaborators and in replication archives
  • they want to adhere to open science principles (including open data) but are restricted by privacy-issues

Learning objectives

By the end of the course participants will:
  • have a good understanding of the concept of synthetic data
  • know the advantages and disadvantages of synthetic data
  • be able to independently generate high-quality synthetic data
  • be able to independently evaluate the quality of synthetic data and the remaining disclosure risks


  • Experience with R or another scripting/programming language (e.g., some basic understanding of data structures in R (e.g., numeric, factor, and character variable types, basic data wrangling, and running regression analyses).
  • Understanding of basic statistics (working experience with regression modeling).
    Software requirements
    Make sure to have a recent R (and RStudio) installation. Required packages will be announced in due time.


    Recommended readings