GESIS Training Courses
user_jsdisabled
Search

Scientific Coordination

Dr.
Marlene Mauk
Tel: +49 221 47694-579

Administrative Coordination

Noemi Hartung
Tel: +49 621 1246-211

Automated Web Data Collection with R

About
Location:
Mannheim B6, 4-5
 
Course duration:
9:30-16:30 CEST
General Topics:
Course Level:
Format:
Software used:
Duration:
Language:
Fees:
Students: 500 €
Academics: 750 €
Commercial: 1500 €
 
Keywords
Additional links
Lecturer(s): Hauke Licht, Allison Koh

About the lecturer - Hauke Licht

About the lecturer - Allison Koh

Course description

The increasing availability of large amounts of online data enables new lines of research in the social sciences. Over the past decades, a variety of information - whether election results, press releases, parliamentary speeches or social media content - has become available online. Although it has become easier and easier to find such information online, its extraction and reshaping into data formats ready for downstream analyses can be challenging. This makes web data collection and cleaning skills essential for researchers. The goal of this course is to equip participants to gather online data and process it in R for their own research.
During the course, participants will learn about the characteristics of web data and their use in social science research. The main learning objective is that participants acquire the skills to collect (“scrape”) content from different types of web pages as well as from application programming interfaces (APIs) such as those hosted by governments, international organizations, and popular newspapers. However, the course will also demonstrate programming strategies for sustainable and robust social media data extraction - a skill that has become all the more important since major social media platforms like Facebook and Twitter have discontinued API access to their data in recent years.
The course is hands-on, with daily lectures followed by exercises. In the exercises, participants will apply and practice these methods in R. While we introduce tools and techniques that help with data collection more generally, the focus will be on three common scenarios:
  • scraping data from static and dynamic web pages
  • automating the collection of information spread over multiple pages, including by navigating dynamic websites (through simulation of clicking and scrolling behavior)
  • interacting with APIs to, for example, collect data from government institutions, news publishing companies, or international organizations
  •  
    For additional details on the course and a day-to-day schedule, please download the full-length syllabus.


    Target group

    Participants will find the course useful if:
  • they want to collect larger amounts of web data from web pages or APIs
  • they want to learn about best practices in automated web data collection
  • they want to improve pre-existing web scraping skills by deepening their understanding of common web technologies and learning more about the process of developing robust web scrapers
  •  
    Participants will be asked to indicate their prior experience with web scraping, their research interests and potential web scraping-related project ideas in a pre-course survey. Based on this survey, the instructors will attempt to include examples in the afternoon tutorial sessions that match participants' research interests and project ideas.


    Learning objectives

    By the end of the course participants will:
  • know the most important characteristics of web data, including web page content and social media data
  • understand of a variety of scraping scenarios: static pages, dynamic pages, APIs, social media data
  • be able to parse, clean and process data collected from the web
  • be able to write reproducible and robust code for web scraping tasks
  •  
    Organizational Structure of the Course
    The course will be organized as a mixture of lectures and exercise sessions. We will switch between lectures and exercises throughout the morning and afternoon sessions of the course. In the lecture sessions, we will focus on explaining core concepts and methods in web scraping. In the exercise sessions, participants will apply their newly acquired knowledge. Both instructors will be available to answer questions and provide guidance during the entire course.


    Prerequisites

  • willingness to engage with different web technologies
  • basic knowledge of the R programming language (incl. the use of loops and writing custom functions): Participants   should make sure before the course that they are familiar with the following R programming concepts and techniques:
  • primary data object classes (vectors, lists and data frames)
  • data wrangling (manipulating vectors, lists and data frames; reshaping/pivoting data frames),
  • for loops and (ideally) functions in the apply/map families (map_* in the purrr package)
  • writing simple functions
  • knowledge of tidyverse R packages (recommended)
  • We will briefly recap these topics in the afternoon session of the first day of the course. However, if participants are unfamiliar with these topics, we recommend taking the corresponding free online short tutorials in the SICSS R Bootcamp: https://sicss.io/boot_camp. For those who would like a primer or refresher in R, we recommend taking the online workshop “Introduction to R” which takes place from 05-07 September 2023.
  •  
    Software and hardware requirements
  • Participants should bring their own laptops for use in the course.
  • RStudio (or a comparable R IDE)
  • the Google Chrome web browser
  • required R packages (a complete list of packages will be provided before the course)
  • for web scraping: rvest, RSelenium, httr
  • for data processing: dplyr, tidyr, purrr, stringr
  •  
    Agenda
    Monday, 18.09.
    Introduction
    We will cover what web scraping is and how it can be used in social science and digital humanities research. Participants will be asked to share their expectations of the course and how they plan to use web scraping in their research. We will then introduce the most fundamental concepts, including APIs, the XML and HTML formats, and how websites are commonly organized.
     
    In the afternoon tutorial session, we will first ensure that all participants have a working setup. We will then have a series of coding exercises designed to ensure that all participants are comfortable with basic R programming concepts and techniques (see Prerequisites section above).
    Tuesday, 19.09.
    Scraping static websites
    On day 2, we will introduce how to web scrape static websites. Building on our general discussion of HTML (Day 1), we will cover how to systematically extract web data by introducing the CSS selector and Xpath methods.
     
    In practical applications, we will use the rvest R package to show how to (i) extract data (text, hyperlinks, tables, images, and other media, as well as metadata) from web pages and (ii) how to automatically navigate between and scrape multiple pages of a website.
     
    In the afternoon tutorial session, participants will learn how to apply this knowledge to different web pages.
    Wednesday, 20.09.
    Scraping of dynamic websites
    On the third day of the course, we will go one step further and discuss how to scrape dynamic websites. We will first explain what makes a page “dynamic” and show how to recognize dynamic web elements in the wild.
     
    We will then introduce the RSelenium package and show how it enables systematic interaction with dynamic web elements. This will include how to set up a web driver in R (Google Chrome), how to click on web elements (e.g., to unfold/collapse drop-down elements) in an automated way, how to navigate dynamic elements (e.g., accordion elements), how to switch between windows (e.g., a main page and a pop-up), and how to automatically download files. In the afternoon, participants will have the opportunity to practice these skills.
    Thursday, 21.09.
    APIs & collecting social media data
    Building on the content discussed during the previous days, we will deepen participants' understanding of APIs, discussing common APIs for data sharing. Using the Mastodon API as an example, we will then show how to use the rtoot package to query social media data. This part of the session will also include a primer on authentication, pagination, API rate limits, and ethics.
     
    To enable participants to potentially also interact with APIs for which no R package exists (yet), we will show how to send requests to APIs using the httr R package using the example of the Dad Jokes API (https://dadjokes.io). In the context of this example, we will also explain the JSON format - the data format commonly returned by APIs.
     
    In the afternoon tutorial session, participants will learn how to apply this knowledge with a small project using the News API (https://newsapi.org).
    Friday, 22.09.
    On the last day, we will begin with a recap of what we have learned during the previous four days. Specifically, we will provide a condensed, systematic overview of the common programming techniques applied to automate web data collection from static websites, dynamic websites, and APIs, respectively.
     
    We will then walk through some advanced topics in web scraping, including web sessions, user agents, proxies, login, and other topics participants might be interested in. We will also discuss tools for the advanced parsing of webpage content, including regular expressions.
     


    Recommended readings