´╗┐´╗┐ GESIS Training Courses

Scientific Coordination

Marlene Mauk
Tel: +49 221 47694-579

Administrative Coordination

Claudia O'Donovan-Bellante
Tel: +49 621 1246-221

Automated Web Data Collection with Python

Mannheim B6, 4-5
General Topics
Course Level
Software used
Students: 500 €
Academics: 750 €
Commercial: 1500 €
Additional links
Lecturer(s): Felix Soldner, Dr. Jun Sun, Leon Froehling

About the lecturer - Felix Soldner

About the lecturer - Dr. Jun Sun

About the lecturer - Leon Froehling

Course description

The continuously growing importance of the internet for everyday life and the correspondingly increasing volume of digital behavioral data on the Web allows us to study human behavior from new perspectives. However, accessing or collecting such data is not always straightforward. Moreover, the heterogeneity of collected data poses the challenge of data pre-processing, ensuring that they can be effectively used in further analyses. Thus, this course aims to introduce participants to data collection from online platforms and the pre-processing necessary to make it usable for their research. Apart from these essential, technical foundations, we will also discuss basic methods to enrich raw, textual data with additional features. Lastly, we will present participants with a framework for the critical reflection on their data collection processes and documentation of their data.
This course will show and teach participants how content, comment, and interaction data can be automatically collected from social media platforms (e.g., Twitter, YouTube, Reddit) or other online platforms (e.g., eBay, Amazon). We will cover the main aspects of collecting data using the programming language Python, including APIs and custom scrapers for static and dynamic webpages. We will also show how collected data can be cleaned, pre-processed, and curated to enable further statistical analyses.
The course will include lectures on each topic, introducing the basic theoretical concepts necessary for understanding the practical implementations, which are then practiced during exercises. The exercises will be conducted in small groups and assisted by the instructors, who may help with questions and problems. In mini-projects, participants have the chance to discuss how they can apply and integrate the newly learned methods within their research.
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.

Target group

Participants will find the course useful if:
  • they are interested in working with web data
  • they want to learn how to collect web data through APIs or webpages
  • they want to learn how to pre-process and augment the collected data for further analyses (basic NLP)
  • they want to learn about frameworks for the critical reflection on web-data collection processes

  • Learning objectives

    By the end of the course, participants will:
  • be able to collect online data with APIs and custom scrapers for static and dynamic websites
  • be able to handle, (pre-)process and augment data for further statistical analyses
  • be able to integrate the learned methods into their research
  • be able to reflect and inspect automatically-collected data critically
    Organizational structure of the course
    The course will be structured around lectures in which we explain the material and methods and exercises in which participants can explore and practice what they learned. Lectures are scheduled in the morning and exercises in the afternoon, separated by a lunch break. The morning and afternoon sessions will have short coffee breaks.
    Exercises will be made up of small coding assignments, prepared by the instructors in advance and designed to be solved by participants directly in Notebooks using Google Colab, and “mini-projects” in which participants can apply the newly learned methods on their own projects. Participants can work alone or in small groups during that time. Throughout the exercises, instructors will support participants in their individual or group work (e.g., conceptual or coding).


  • Basic knowledge of the programming language python
  • Motivation to work with various web-data sources
  • Willingness to engage in hands-on coding exercises to learn how to collect web-data
    Software and hardware requirements
    The course will use Google Colab (https://colab.research.google.com/), so there is no need to have Python installed on your machine. However, you will need a Google account and an up-to-date version of Google's Chrome web browser. 
    Participants should bring their own laptops and pre-install the following software/packages:
  • Google Colab
  • Google account
  • Google Chrome
    Recommended related courses
  • Linking Twitter & Survey Data (Workshop, online, 27.06.2022)
  • Python 101 (Workshop, online, 31.08. - 01.09.2022)
  • Introduction to Computational Social Science with Python (Fall Seminar, Mannheim, Week 1)
  • Introduction to Machine Learning for Text Analysis with Python (Fall Seminar, Mannheim, Week 3)
  • Automated Image and Video Data Analysis with Python (Fall Seminar, Mannheim, Week 3)