Scientific Coordination
Dr.
Marlene Mauk
Tel: +49 221 47694-579
Marlene Mauk
Tel: +49 221 47694-579
Administrative Coordination
Noemi Hartung
Tel: +49 621 1246-211
Tel: +49 621 1246-211
Please wait...
Automated Web Data Collection with Python
About
Location:
Mannheim B6, 4-5
Mannheim B6, 4-5
Course duration:
10:00-17:00 CEST
General Topics:
Course Level:
Format:
Software used:
Duration:
Language:
Fees:
Students: 500 €
Academics: 750 €
Commercial: 1500 €
Keywords
Additional links
Lecturer(s): Felix Soldner, Jun Sun, Leon Fröhling
Course description
The continuously growing importance of the internet for everyday life and the correspondingly increasing volume of digital behavioral data on the web allow us to study human behavior from new perspectives. However, accessing or collecting such data is not always straightforward. Moreover, the heterogeneity of collected data poses the challenge of data pre-processing, ensuring that they can be effectively used in further analyses. Thus, this course aims to introduce participants to data collection from online platforms and the pre-processing necessary to make it usable for their research. Apart from these essential, technical foundations, we will also discuss basic methods to enrich raw, textual data with additional features. Lastly, we will present participants with a framework for the critical reflection on their data collection processes and documentation of their data.
This course will show and teach participants how content, comment, and interaction data can be automatically collected from social media platforms (e.g., YouTube, Reddit) or other online platforms (e.g., eBay, Amazon). We will cover the main aspects of collecting data using the programming language Python, including APIs and custom scrapers for static and dynamic web pages. We will also show how collected data can be cleaned, pre-processed, and curated to enable further statistical analyses.
The course will include lectures on each topic, introducing the basic theoretical concepts necessary for understanding the practical implementations, which are then presented in live-coding sessions and practiced during exercises. The exercises will be conducted individually or in small groups and assisted by the instructors, who help with questions and problems. In mini-projects, participants have the chance to discuss how they can apply and integrate the newly learned methods within their research or personal projects.
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.
Target group
Participants will find the course useful if:
Learning objectives
By the end of the course participants will:
Organisational Structure of the Course
The course will be structured around lectures in which we explain the material and methods and exercises in which participants can explore and practice what they learned. Lectures are scheduled in the morning, and exercises in the afternoon, separated by a lunch break. The morning and afternoon sessions will have short coffee breaks.
Exercises will be made up of small coding assignments, prepared by the instructors in advance and designed to be solved by participants directly in Notebooks using Google Colab, and “mini-projects” in which participants can apply the newly learned methods on their own projects. Participants can work alone or in small groups during that time. Throughout the exercises, instructors will support participants in their individual or group work (e.g., with conceptual or coding issues). After coding assignments, instructors will provide walk-through solutions.
Prerequisites
Software and hardware requirements
Participants should bring their own laptop for use in the course.
The course will use Google Colab (https://colab.research.google.com/), so there is no need to have Python installed on your machine. However, you will need a Google account and an up-to-date version of Google's Chrome web browser.
Agenda
Monday, 18.09. | |
Morning Session | We will start with a short discussion about the expectations of the course and how participants envision using web-scraping in their work. We then follow up with an introduction about web-scraping and basic concepts, such as APIs and custom scrapers. |
Afternoon Session | After the lunch break, we will give an interactive introduction of basic commands of the Reddit API and how to work with JSON files (common outputs received from APIs). Participants will have time to practice using the Reddit API, including how to save and work with the Reddit data. Lastly, we will guide participants through the process of obtaining their individual YouTube API keys needed for the following sessions. |
Tuesday, 19.09. | |
Morning Session | Using the Reddit Pushshift API as an example, in the first two lecture sessions in the morning, we will show how to use Pushshift and the wrapper package PSAW to query Reddit data. |
Afternoon Session | Using the YouTube Data API as an example, we will also show how to use APIs that require credentials in the lecture session in the afternoon. All lecture sessions are followed by exercises. |
Wednesday, 20.09. | |
Morning Session | We will cover general knowledge of HTTP requests, HTML and CSS in the first lecture in the morning. In the exercise that follows, we will then cover how to systematically extract web data from html pages with beautifulsoup. |
Afternoon Session | In the afternoon lecture session, we will cover how to use requests, regex and selectorlib to scrape static web pages. In the exercise that follows, we will then cover how to systematically extract web data from html pages with requests, regex and selectorlib. |
Thursday, 21.09. | |
Morning Session | In the morning, we will discuss what dynamic web pages are and how to obtain information from them using the python package Selenium. Participants will learn how to navigate (click, scroll, etc.) through web pages and retrieve information automatically. |
Afternoon Session | In the afternoon, participants will have the time to practice the presented skills with prepared exercises on a training website for scraping (books.toscrape.com) and Quora. We will finish with a short overview of best practices for web scraping, common scraping problems, and how to overcome them. |
Friday, 22.09. | |
Morning Session | |
Afternoon Session | In the second lecture of the day, we will demonstrate the importance of comprehensively documenting web-data datasets, especially if used to do research on human behavior and interactions. We will introduce different approaches for the documentation of datasets and for the critical reflection on potential sources of bias and error in the data collection process. In the exercise, we will use these frameworks to examine the data that we collected during the previous days for systematic errors and document one of the collection processes as well as the resulting dataset. |