Scientific Coordination
Dr.
Marlene Mauk
Tel: +49 221 47694-579
Marlene Mauk
Tel: +49 221 47694-579
Administrative Coordination
Noemi Hartung
Tel: +49 621 1246-211
Tel: +49 621 1246-211
Please wait...
Automated Web Data Collection with R
About
Location:
Mannheim B6, 4-5
Mannheim B6, 4-5
Course duration:
9:30-16:30 CEST
General Topics:
Course Level:
Format:
Software used:
Duration:
Language:
Fees:
Students: 500 €
Academics: 750 €
Commercial: 1500 €
Keywords
Additional links
Lecturer(s): Hauke Licht, Allison Koh
Course description
The increasing availability of large amounts of online data enables new lines of research in the social sciences. Over the past decades, a variety of information - whether election results, press releases, parliamentary speeches or social media content - has become available online. Although it has become easier and easier to find such information online, its extraction and reshaping into data formats ready for downstream analyses can be challenging. This makes web data collection and cleaning skills essential for researchers. The goal of this course is to equip participants to gather online data and process it in R for their own research.
During the course, participants will learn about the characteristics of web data and their use in social science research. The main learning objective is that participants acquire the skills to collect (“scrape”) content from different types of web pages as well as from application programming interfaces (APIs) such as those hosted by governments, international organizations, and popular newspapers. However, the course will also demonstrate programming strategies for sustainable and robust social media data extraction - a skill that has become all the more important since major social media platforms like Facebook and Twitter have discontinued API access to their data in recent years.
The course is hands-on, with daily lectures followed by exercises. In the exercises, participants will apply and practice these methods in R. While we introduce tools and techniques that help with data collection more generally, the focus will be on three common scenarios:
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.
Target group
Participants will find the course useful if:
Participants will be asked to indicate their prior experience with web scraping, their research interests and potential web scraping-related project ideas in a pre-course survey. Based on this survey, the instructors will attempt to include examples in the afternoon tutorial sessions that match participants' research interests and project ideas.
Learning objectives
By the end of the course participants will:
Organizational Structure of the Course
The course will be organized as a mixture of lectures and exercise sessions. We will switch between lectures and exercises throughout the morning and afternoon sessions of the course. In the lecture sessions, we will focus on explaining core concepts and methods in web scraping. In the exercise sessions, participants will apply their newly acquired knowledge. Both instructors will be available to answer questions and provide guidance during the entire course.
Prerequisites
Software and hardware requirements
Agenda
Monday, 18.09. |
Introduction We will cover what web scraping is and how it can be used in social science and digital humanities research. Participants will be asked to share their expectations of the course and how they plan to use web scraping in their research. We will then introduce the most fundamental concepts, including APIs, the XML and HTML formats, and how websites are commonly organized. In the afternoon tutorial session, we will first ensure that all participants have a working setup. We will then have a series of coding exercises designed to ensure that all participants are comfortable with basic R programming concepts and techniques (see Prerequisites section above). |
Tuesday, 19.09. |
Scraping static websites On day 2, we will introduce how to web scrape static websites. Building on our general discussion of HTML (Day 1), we will cover how to systematically extract web data by introducing the CSS selector and Xpath methods. In practical applications, we will use the rvest R package to show how to (i) extract data (text, hyperlinks, tables, images, and other media, as well as metadata) from web pages and (ii) how to automatically navigate between and scrape multiple pages of a website. In the afternoon tutorial session, participants will learn how to apply this knowledge to different web pages. |
Wednesday, 20.09. |
Scraping of dynamic websites On the third day of the course, we will go one step further and discuss how to scrape dynamic websites. We will first explain what makes a page “dynamic” and show how to recognize dynamic web elements in the wild. We will then introduce the RSelenium package and show how it enables systematic interaction with dynamic web elements. This will include how to set up a web driver in R (Google Chrome), how to click on web elements (e.g., to unfold/collapse drop-down elements) in an automated way, how to navigate dynamic elements (e.g., accordion elements), how to switch between windows (e.g., a main page and a pop-up), and how to automatically download files. In the afternoon, participants will have the opportunity to practice these skills. |
Thursday, 21.09. |
APIs & collecting social media data Building on the content discussed during the previous days, we will deepen participants' understanding of APIs, discussing common APIs for data sharing. Using the Mastodon API as an example, we will then show how to use the rtoot package to query social media data. This part of the session will also include a primer on authentication, pagination, API rate limits, and ethics. To enable participants to potentially also interact with APIs for which no R package exists (yet), we will show how to send requests to APIs using the httr R package using the example of the Dad Jokes API (https://dadjokes.io). In the context of this example, we will also explain the JSON format - the data format commonly returned by APIs. In the afternoon tutorial session, participants will learn how to apply this knowledge with a small project using the News API (https://newsapi.org). |
Friday, 22.09. |
On the last day, we will begin with a recap of what we have learned during the previous four days. Specifically, we will provide a condensed, systematic overview of the common programming techniques applied to automate web data collection from static websites, dynamic websites, and APIs, respectively. We will then walk through some advanced topics in web scraping, including web sessions, user agents, proxies, login, and other topics participants might be interested in. We will also discuss tools for the advanced parsing of webpage content, including regular expressions. |