GESIS Training Courses
user_jsdisabled
Search

Scientific Coordination

Dr.
Marlene Mauk
Tel: +49 221 47694-579

Administrative Coordination

Noemi Hartung
Tel: +49 621 1246-211

Automated Web Data Collection with Python

About
Location:
Mannheim B6, 4-5
 
Course duration:
10:00-17:00 CEST
General Topics:
Course Level:
Format:
Software used:
Duration:
Language:
Fees:
Students: 500 €
Academics: 750 €
Commercial: 1500 €
 
Keywords
Additional links
Lecturer(s): Felix Soldner, Jun Sun, Leon Fröhling

About the lecturer - Felix Soldner

About the lecturer - Jun Sun

About the lecturer - Leon Fröhling

Course description

The continuously growing importance of the internet for everyday life and the correspondingly increasing volume of digital behavioral data on the web allow us to study human behavior from new perspectives. However, accessing or collecting such data is not always straightforward. Moreover, the heterogeneity of collected data poses the challenge of data pre-processing, ensuring that they can be effectively used in further analyses. Thus, this course aims to introduce participants to data collection from online platforms and the pre-processing necessary to make it usable for their research. Apart from these essential, technical foundations, we will also discuss basic methods to enrich raw, textual data with additional features. Lastly, we will present participants with a framework for the critical reflection on their data collection processes and documentation of their data.
 
This course will show and teach participants how content, comment, and interaction data can be automatically collected from social media platforms (e.g., YouTube, Reddit) or other online platforms (e.g., eBay, Amazon). We will cover the main aspects of collecting data using the programming language Python, including APIs and custom scrapers for static and dynamic web pages. We will also show how collected data can be cleaned, pre-processed, and curated to enable further statistical analyses.
 
The course will include lectures on each topic, introducing the basic theoretical concepts necessary for understanding the practical implementations, which are then presented in live-coding sessions and practiced during exercises. The exercises will be conducted individually or in small groups and assisted by the instructors, who help with questions and problems. In mini-projects, participants have the chance to discuss how they can apply and integrate the newly learned methods within their research or personal projects.
 
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.


Target group

Participants will find the course useful if:
  • they are interested in working with web data
  • they want to learn how to collect web data through APIs or web pages
  • they want to learn how to pre-process and augment the collected data for further analyses (basic NLP)
  • they want to learn about frameworks for the critical reflection on web-data collection processes


  • Learning objectives

    By the end of the course participants will:
  • be able to collect online data with APIs and custom scrapers for static and dynamic websites
  • be able to handle, (pre-)process and augment data for further statistical analyses
  • be able to integrate the learned methods into their research
  • be able to reflect and inspect automatically collected data critically
  •  
    Organisational Structure of the Course
    The course will be structured around lectures in which we explain the material and methods and exercises in which participants can explore and practice what they learned. Lectures are scheduled in the morning, and exercises in the afternoon, separated by a lunch break. The morning and afternoon sessions will have short coffee breaks.
     
    Exercises will be made up of small coding assignments, prepared by the instructors in advance and designed to be solved by participants directly in Notebooks using Google Colab, and “mini-projects” in which participants can apply the newly learned methods on their own projects. Participants can work alone or in small groups during that time. Throughout the exercises, instructors will support participants in their individual or group work (e.g., with conceptual or coding issues). After coding assignments, instructors will provide walk-through solutions.


    Prerequisites

  • basic knowledge of the programming language Python, including how to write loops, functions, and if-statements
  • motivation to work with various web data sources
  • willingness to engage in hands-on coding exercises to learn how to collect web data
  • For those who would like a primer or refresher in Python, we recommend taking the online workshop “Introduction to Python” that takes place from 04-06 September 2023.
  •  
    Software and hardware requirements
    Participants should bring their own laptop for use in the course.
    The course will use Google Colab (https://colab.research.google.com/), so there is no need to have Python installed on your machine. However, you will need a Google account and an up-to-date version of Google's Chrome web browser.
     
    Agenda
    Monday, 18.09.
    Morning SessionWe will start with a short discussion about the expectations of the course and how participants envision using web-scraping in their work. We then follow up with an introduction about web-scraping and basic concepts, such as APIs and custom scrapers.
    Afternoon SessionAfter the lunch break, we will give an interactive introduction of basic commands of the Reddit API and how to work with JSON files (common outputs received from APIs). Participants will have time to practice using the Reddit API, including how to save and work with the Reddit data. Lastly, we will guide participants through the process of obtaining their individual YouTube API keys needed for the following sessions.
    Tuesday, 19.09.
    Morning SessionUsing the Reddit Pushshift API as an example, in the first two lecture sessions in the morning, we will show how to use Pushshift and the wrapper package PSAW to query Reddit data.
    Afternoon SessionUsing the YouTube Data API as an example, we will also show how to use APIs that require credentials in the lecture session in the afternoon. All lecture sessions are followed by exercises.
    Wednesday, 20.09.
    Morning SessionWe will cover general knowledge of HTTP requests, HTML and CSS in the first lecture in the morning. In the exercise that follows, we will then cover how to systematically extract web data from html pages with beautifulsoup.
    Afternoon SessionIn the afternoon lecture session, we will cover how to use requests, regex and selectorlib to scrape static web pages. In the exercise that follows, we will then cover how to systematically extract web data from html pages with requests, regex and selectorlib.
    Thursday, 21.09.
    Morning SessionIn the morning, we will discuss what dynamic web pages are and how to obtain information from them using the python package Selenium. Participants will learn how to navigate (click, scroll, etc.) through web pages and retrieve information automatically.
    Afternoon SessionIn the afternoon, participants will have the time to practice the presented skills with prepared exercises on a training website for scraping (books.toscrape.com) and Quora. We will finish with a short overview of best practices for web scraping, common scraping problems, and how to overcome them.
    Friday, 22.09.
    Morning Session
  • On the last day, we will first learn about the different steps necessary to prepare the raw, mostly textual data into suitable formats for subsequent analysis before we discuss the importance of properly documenting datasets collected from the Web.
  • In the first lecture of the day, we will introduce the basics of Natural Language Processing (NLP) and thereby understand why textual data needs to be preprocessed before looking into the most popular preprocessing methods. In the exercise, we will try out available Python libraries (e.g., spaCy and NLTK) to prepare some of the data that we collected during the previous days for further analysis.
  • Afternoon SessionIn the second lecture of the day, we will demonstrate the importance of comprehensively documenting web-data datasets, especially if used to do research on human behavior and interactions. We will introduce different approaches for the documentation of datasets and for the critical reflection on potential sources of bias and error in the data collection process. In the exercise, we will use these frameworks to examine the data that we collected during the previous days for systematic errors and document one of the collection processes as well as the resulting dataset.