GESIS Training Courses
user_jsdisabled
Search

Scientific Coordination

Dr.
Marlene Mauk
Tel: +49 221 47694-579

Administrative Coordination

Loretta Langendörfer M.A.
Tel: +49 221 47694-143

Automated Web Data Collection with R

Lecturer(s):
Dr. Theresa Gessler, Hauke Licht

Date: 20.09 - 24.09.2021 ics-file

Location: Online via Zoom

About the lecturer - Dr. Theresa Gessler

About the lecturer - Hauke Licht

Course description

The increasing availability of large amounts of online data enables new lines of research in the social sciences. Over the past years, a variety of data - whether election results, press releases, parliamentary speeches or social media content - has become available online. Although data has become easier to find, ist extraction and reshaping into formats ready for downstream analyses can be challenging. This makes data collection and cleaning skills essential for researchers. The goal of this course is to equip participants to gather online data and process it in R for their own research.
During the course, participants will learn about the characteristics of web data and their use in social research, as well as how to harvest content from different types of webpages, gather information from web interfaces and collect social media data. The course also covers the most important techniques for cleaning and reshaping web and social media data for analysis.
While we introduce tools and techniques that help with data collection more generally, the focus will be on two common scenarios:
  • automating the collection of data spread over multiple pages, including by navigating dynamic websites
  • interacting with APIs to, for example, collect social media data or datasets from institutions, companies and organizations.
  • The course is hands-on, with lectures followed by exercises where participants will apply and practice these methods in R.


    Keywords



    Target group

    Participants will find the course useful if:
  • they want to collect larger amounts of web data from APIs or webpages
  • they want to learn about best practices in automated web data collection
  • they want to improve pre-existing web scraping skills by deepening their understanding of common web technologies and learning more about the process of developing robust web scrapers


  • Learning objectives

    By the end of the course participants will:
  • Know the most important characteristics of web data, including webpage content and social media data
  • Gain an understanding of a variety of scraping scenarios: APIs, static pages, dynamic pages, web crawling
  • Be able to write reproducible and robust code for web scraping tasks
  • Be able to parse, clean and process data collected from the web
  •  
    Organisational Structure of the Course
    The course will be organized as a mixture of lectures (morning session) and exercises and lab sessions in the afternoon. In the lecture sessions we will focus on explaining core concepts and methods in web scraping. In the lab and exercise sessions, participants will apply their newly acquired knowledge while the instructors will be available for individual consultations and support work on assignments.


    Prerequisites

  • Basic knowledge of the R programming language
  • Willingness to engage with different web technologies
  • Knowledge of tidyverse (recommended)
  •  
    Software requirements
    RStudio or a similar R interface/IDE
    Google Chrome web browser
    Suggested R packages include (an updated list of packages will be provided before the course)
  • for web scraping: rvest, httr, RSelenium
  • for data processing: dplyr, tidyr, purrr, rtweet,
  • for automation: boilerpipeR, taskscheduleR, cronR


  •