GESIS Training Courses

Scientific Coordination

Verena Kunz

Administrative Coordination

Claudia O'Donovan-Bellante
Tel: +49 621 1246-221

Automatic Sampling and Analysis of YouTube Data

Online via Zoom
General Topics:
Course Level:
Software used:
Students: 200 €
Academics: 300 €
Commercial: 600 €
Additional links
Lecturer(s): Annika Deubel, Johannes Breuer, Rohangis Mohseni

About the lecturer - Annika Deubel

About the lecturer - Johannes Breuer

About the lecturer - Rohangis Mohseni

Course description

YouTube is the largest and most popular video platform on the internet. The producers and users of YouTube content generate huge amounts of data. These data are also of interest to researchers (in the social sciences as well as other disciplines) for studying different aspects of online media use and communication. Accessing and working with these data, however, can be challenging. In this workshop, we will first discuss the potential of YouTube data for research in the social sciences, and then introduce participants to the YouTube API as well as different tools for collecting YouTube data. Our focus for the main part of the workshop will be on using R for collecting, processing, and analyzing data from YouTube (using various packages). Regarding the type of data, we will focus on user comments but will also (briefly) look into other YouTube data, such as video statistics and subtitles. For the comments, we will show how to clean/process them in R, how to deal with emojis, and how to do some basic forms of automated text analysis (e.g., word frequencies, sentiment analysis). While we believe that YouTube data has great potential for research in the social sciences (and other disciplines), we will also discuss the unique challenges and limitations of using this data.

Target group

Participants will find the course useful if:
  • They want to work with YouTube data (esp. user comments) in their research.

  • Learning objectives

    By the end of the course participants will:
  • Know different tools and methods for collecting YouTube data,
  • be able to automatically collect YouTube data,
  • process and clean these data,
  • and do some basic (exploratory) analyses of user comments.
    Organizational structure of the course
    The workshop is structured into segments of instructive lectures and interactive hands-on sessions. The lecturers will be available for support during hands-on segments and can also consult on participants' own (planned) research projects with YouTube data.


    Participants should at least have some basic knowledge of R and, ideally, also the tidyverse. Basic R knowledge can, for example, be acquired through the swirl (Learn R, in R) course “R Programming” (see or the RStudio Primer “Programming basics“ (, both of which are available for free. There also are many brief online introductions to the tidyverse, such as these blog posts by Martin Frigaard ( or Dominic Royé ( Order from most to least important, if applicable.
    Software requirements
    R (at least version 4.0.0), RStudio, and the following R packages: remotes, tidyverse, tuber, vosonSML, quanteda, tm, quanteda, qdapRegex, syuzhet, lexicon, subtools, stm, youtubecaption (optional)


    Recommended readings