GESIS Training Courses

Scientific Coordination

Marlene Mauk
Tel: +49 221 47694-579

Administrative Coordination

Loretta Langendörfer M.A.
Tel: +49 221 47694-143

Web Data Collection and Natural Language Processing in Python

Online via Zoom
General Topics:
Course Level:
Software used:
Students: 400 €
Academics: 600 €
Commercial: 1200 €
Additional links
Lecturer(s): Indira Sen, Dr. Arnim Bleier, Julian Kohne, Dr. Fabian Flöck

About the lecturer - Indira Sen

About the lecturer - Dr. Arnim Bleier

About the lecturer - Julian Kohne

About the lecturer - Dr. Fabian Flöck

Course description

Data Science is the interdisciplinary science of extracting interpretable and useful knowledge from potentially large datasets. In contrast to empirical social science, data science methods often serve purposes of exploration and inductive inference. In this course, we aim to provide an introduction on how to tap into the vast amount of digital behavioral data available on Web platforms and processing it to be useful for social science research purposes.
To this end, participants will first learn how to collect data with Web Application Programming Interfaces (APIs) and Web scraping, by employing common Python tools and methods and how to incorporate them into workable data structures. Such APIs will likely include such offered by major social media companies like Reddit, Wikipedia, and Youtube (list not final).
Participants will subsequently be introduced to the basics of Natural Language Processing (NLP) for the analysis of these corpora. As much of the work in NLP is based on Machine Learning (ML), we will begin this section with a basic introduction to ML, followed by an introduction to pre-processing, e.g., data cleaning and feature extraction. We will then cover the application of popular NLP toolkits, some based on simple heuristics and dictionaries, but some also introducing more advanced ML methods.
All course materials will be provided as Python-based Jupyter Notebooks.
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.

Target group

Participants will find the course useful if:
  • They are interested in obtaining digital behavioral data from Web platforms through different APIs and Web Scraping.
  • They need to structure textual user contributions to study social phenomena.
  • They are interested in learning the basics of applying some Natural Language Processing, including basic Machine Learning applications.
    We expect this tutorial to be of interest for participants from a variety of disciplinary backgrounds (e.g. Economics, linguistics, sociology, psychology, political science, demography), particularly those who are interested in leveraging novel forms of digital traces for drawing inferences.

    Learning objectives

    Participants will obtain a working knowledge of how web and social media is collected through a detailed introduction to Web APIs and Web scraping and corresponding tools. Participants will obtain knowledge about typical data types and structures encountered when dealing with digital behavioral data from the Web, and how to apply selected NLP methods and tools in Python to structure natural language texts; and they will learn how this approach differs from those typically encountered in survey-based or experimental research. This will enable them to identify benefits and pitfalls of these data types and methods in their field of interest and will thus allow them to select and appropriately apply the covered NLP methods to large datasets in their own research. The knowledge obtained in this course provides a starting point for participants to investigate specialized methods for their individual research projects.
    Organisational Structure of the Course
    The course will be structured based on different subthemes of Web data collection, working with digital human traces from the web and processing for analysis. Lectures will be interactive, and the use of Jupyter Notebooks allows participants to reproduce the steps along the research pipeline while we introduce the topics. Each lecture will be a combination of conceptual sections and hands-on programming examples. Additionally, participants will have the opportunity to cement and test their understanding of different concepts in regular exercise and feedback rounds, where instructors provide support, advice, and troubleshooting.


  • The requirements for attending this course are a functional knowledge of Python and Pandas. We expect the participants to have a working knowledge of Python data structures like lists, dictionaries, and Pandas data frames and how to use these to do basic data wrangling and processing. In case the participants are unfamiliar with these basic concepts of Python programming, we recommend them to attend the course "Introduction to Computational Social Science with Python" in preparation. Additionally, we will also publish a set of external online materials and courses on basic Python and Pandas that participants can use to prepare. The course will include a brief refresher on the basics of Python and Pandas in the beginning - this does however not replace a proper introduction into Python.
  • Some previous knowledge of statistics would be beneficial, although not mandatory.
  • Participants have preferably worked in a Jupyter Notebook environment before. Detailed installation instructions on how to access the Jupyter Notebook cloud environment will be provided before the start of the course.
    Software requirements
    Participants should have an installation of Anaconda ready, along with Python 3.7+, and Jupyter Notebooks. Anaconda is an open data science platform powered by Python which can be downloaded here: . It comes with many code libraries / packages for Python already installed. It also comes equipped with Jupyter Notebooks. We will be working with Python 3.7 and we will use Jupyter Notebooks for the exercises. While we plan to work in Jupyter Notebooks in our ready-to-go cloud-based environment, local installations of Anaconda are needed in rare cases of downtime of this service.