´╗┐´╗┐ GESIS Training Courses
user_jsdisabled
Search

Scientific Coordination

Dr.
Marlene Mauk
Tel: +49 221 47694-579

Administrative Coordination

Claudia O'Donovan-Bellante
Tel: +49 621 1246-221

Automated Web Data Collection with R

About
Location:
Mannheim B6, 4-5
 
General Topics
Course Level
Format
Software used
Duration
Language
Fees
Students: 500 €
Academics: 750 €
Commercial: 1500 €
 
Keywords
Additional links
Lecturer(s): Dr. Theresa Gessler, Dr. Hauke Licht

About the lecturer - Dr. Theresa Gessler

About the lecturer - Dr. Hauke Licht

Course description

The increasing availability of large amounts of data on the internet enables new lines of research in the social sciences. Although it has become easier to find data online that is relevant to social science research, such as social media content, election results, or organizations' press statements, extracting these data and bringing it into formats ready for downstream analyses can be challenging. Web data collection is thus an essential skill for researchers.
 
The goal of this course is to enable participants to collect web data and process it in R for their research. Course participants will learn about the characteristics of web data and their use in social science research, how to harvest content from different types of webpages, and how to collect social media data from application programming interfaces (APIs), such as the Twitters API.
 
We will cover tools and techniques that enable participants to collect web data relevant to their research and focus on two common scenarios in particular: (i) automating the collection of data presented on multiple web pages (e.g., several pages) of both static and dynamic websites (with RSelenium), and (ii) interacting with APIs to, for example, collect social media data or datasets from institutions, companies, and organizations. In addition, we will cover advanced topics such as using web sessions, interacting with HTML forms (e.g., login), managing user agents, error handling, and headless browsing.
 
The course is hands-on, with daily lectures followed by exercises where participants can practice their newly learned skills.
 
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.


Target group

Participants will find the course useful if they want to:
  • collect larger amounts of web data from APIs or webpages
  • learn about best practices in automated web data collection
  • improve their existing web scraping skills by deepening their understanding of common web technologies and learning more about the process of developing robust web scrapers


  • Learning objectives

    By the end of the course participants will:
  • Know the most important characteristics of web data, including webpage content and social media data
  • Gain an understanding of a variety of scraping scenarios: APIs, static pages, dynamic pages
  • Be able to write reproducible and robust code for web scraping tasks
  • Be able to parse, clean, and process data collected from the web
  •  
    Organizational structure of the course
    The course will be organized as a mixture of lectures (morning sessions) and exercises (afternoon sessions). In the lectures, we will focus on explaining core concepts and methods in web scraping. In exercise sessions, participants will apply their newly acquired knowledge while the instructors will be available for individual consultations and support work on assignments.


    Prerequisites

  • Basic knowledge of the R programming language.
  • Willingness to engage with different web technologies
  • Knowledge of tidyverse R packages (recommended)
  •  
    Software and hardware requirements
    Participants should bring their own laptops and pre-install the following software/packages:
  • RStudio (or a comparable R interface/IDE)
  • the Google Chrome web browser
  • suggested R packages (an updated list of packages will be provided before the course)
  • for web scraping: rvest, httr, RSelenium, rtweet
  • for data processing: dplyr, tidyr, purrr
  •  
    Recommended related courses
  • Linking Twitter & Survey Data (Workshop, online, 27.06.2022)
  • R 101 (Workshop, online, 31.08. - 01.09.2022)
  • Introduction to Computational Social Science with R (Fall Seminar, Mannheim, Week 1)
  • Tools for Efficient Workflows, Smooth Collaboration and Optimized Research Outputs (Fall Seminar, Mannheim, Week 1)
  • Network Analysis in R (Fall Seminar, Mannheim, Week 3)


  •