GESIS Training Courses
user_jsdisabled
Search

Scientific Coordination

Marlene Mauk
Tel: +49 221 47694-579

Administrative Coordination

Noemi Hartung

Web Data Collection with Python and R

About
Location:
Mannheim, B6 4-5
 
Course duration:
9:00-16:00 CEST
General Topics:
Course Level:
Format:
Software used:
Duration:
Language:
Fees:
Students: 550 €
Academics: 825 €
Commercial: 1650 €
Keywords
Additional links
Lecturer(s): Iulia Cioroianu

About the lecturer - Iulia Cioroianu

Course description

The exponential increase in online and social media data offers unprecedented opportunities for advancing research across a variety of fields, both within academia and outside of it. This course provides researchers the tools needed to collect and pre-process large-scale data from a range of online sources. The course will be offered both in R and in Python. Students can attend taught sessions in both programming languages in the morning, and can choose their preferred language for individual/group work and exercises in the afternoon. The content and examples used in the lecturer-led tutorials are similar across programming languages, making it easier for those interested in developing new skills in a secondary language that they may not be proficient in to do so by drawing parallels across the two sessions.
 
Through a combination of lectures, hands-on tutorials and individual/group exercises, participants will develop a theoretical understanding of the challenges associated with online data collection and the best methods and tools for addressing them in R and in Python, as well as the practical skills needed to collect data through Application Programming Interfaces (APIs), navigate dynamic websites and scrape data from both static and dynamic web pages. The sources used in the examples provided include social media websites, online media outlets and news aggregators, government data portals and other large-scale online data repositories.
 
Acknowledging that the most difficult part of a computational project involving the collection of complex and heterogenous data is often the pre-processing needed to prepare the data for subsequent analysis and link it across a variety of sources, the course also covers text-based methods for data cleaning and pre-processing. By the end of the week, participants should be able to apply the methods studied to extract and process data for their own research projects.
 
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.
 
Organizational Structure of the Course
The course will consist of taught morning sessions in R and Python, and ample opportunities for student independent and group work in the afternoon. In each programming language, morning sessions consist of a short lecture laying out the main notions and providing an overview of the language-specific tools used (approximately 30 minutes), followed by a hands-on lecturer-led tutorial (1 hour). In the afternoon, students will have the opportunity to solve exercises in their chosen programming language working independently or in small groups, with the support of the lecturer and one teaching assistant (2 hours). Solutions to the exercises will be provided and discussed in the final part of the day (30 minutes). The daily schedule by programming language group is presented in the table below.
 
 RPython
Theoretical overview and lecturer-led tutorial9:00-10:3011:00-12:30
Individual or small-group exercises and solutions13:00-15:3013:30-16:00
 
 


Target group

You will find the course useful if:
  • you want to learn how to collect and process large amounts of data from online sources fast.
  • you aim to improve your existing web scraping skills or have so far encountered difficulties trying to scrape data from online sources.
  • you have a research idea for which online data might be suitable, but you are not sure of the practical implications.


Learning objectives

By the end of the course you will:
  • understand the structure and basic features of different forms of online data.
  • be able to collect data from static and dynamic websites.
  • be able to interact with APIs to access and collect data.
  • be able to parse, clean and process the data collected.
  • be able to apply the methods studied to their own research projects


Prerequisites

 
Software and Hardware Requirements
You should bring your own laptop for use in the course and pre-install the following software:
 
R:
  • RStudio
  • required packages (final list of packages to be provided before the course): httr, rvest, RSelenium, dplyr, tidyr, stringr, quanteda
 
Python:
  • Python 3
  • Required packages (final list to be provided before the course): requests, lxml, Beautiful Soup, Selenium, pandas, re, stringr, NLTK


Schedule

Recommended readings