Scientific Coordination

Marlene Mauk
Tel: +49 221 47694-579

Administrative Coordination

Noemi Hartung
Tel: +49 621 1246-211

Web Data Collection with Python and R

About

Date:
09.09 - 13.09.2024

Location:
Mannheim, B6 4-5

Course duration:

9:00-16:00 CEST

General Topics:

Computational Social Science, Research Data Management

Course Level:

Beginner

Format:

Fall Seminar

Software used:

R, Python

Duration:

5 days

Language:

English

Fees:

Students: 550 €

Academics: 825 €

Commercial: 1650 €

Keywords

Web scraping, APIs, automated data collection, R, Python

Additional links

Terms and Conditions
FAQs

Lecturer(s): Iulia Cioroianu

About the lecturer - Iulia Cioroianu

Course description

The exponential increase in online and social media data offers unprecedented opportunities for advancing research across a variety of fields, both within academia and outside of it. This course provides researchers the tools needed to collect and pre-process large-scale data from a range of online sources. The course will be offered both in R and in Python. Students can attend taught sessions in both programming languages in the morning, and can choose their preferred language for individual/group work and exercises in the afternoon. The content and examples used in the lecturer-led tutorials are similar across programming languages, making it easier for those interested in developing new skills in a secondary language that they may not be proficient in to do so by drawing parallels across the two sessions.

Through a combination of lectures, hands-on tutorials and individual/group exercises, participants will develop a theoretical understanding of the challenges associated with online data collection and the best methods and tools for addressing them in R and in Python, as well as the practical skills needed to collect data through Application Programming Interfaces (APIs), navigate dynamic websites and scrape data from both static and dynamic web pages. The sources used in the examples provided include social media websites, online media outlets and news aggregators, government data portals and other large-scale online data repositories.

Acknowledging that the most difficult part of a computational project involving the collection of complex and heterogenous data is often the pre-processing needed to prepare the data for subsequent analysis and link it across a variety of sources, the course also covers text-based methods for data cleaning and pre-processing. By the end of the week, participants should be able to apply the methods studied to extract and process data for their own research projects.

Organizational Structure of the Course

The course will consist of taught morning sessions in R and Python, and ample opportunities for student independent and group work in the afternoon. In each programming language, morning sessions consist of a short lecture laying out the main notions and providing an overview of the language-specific tools used (approximately 30 minutes), followed by a hands-on lecturer-led tutorial (1 hour). In the afternoon, students will have the opportunity to solve exercises in their chosen programming language working independently or in small groups, with the support of the lecturer and one teaching assistant (2 hours). Solutions to the exercises will be provided and discussed in the final part of the day (30 minutes). The daily schedule by programming language group is presented in the table below.

	R	Python
Theoretical overview and lecturer-led tutorial	9:00-10:30	11:00-12:30
Individual or small-group exercises and solutions	13:00-15:30	13:30-16:00

Target group

You will find the course useful if:

you want to learn how to collect and process large amounts of data from online sources fast.
you aim to improve your existing web scraping skills or have so far encountered difficulties trying to scrape data from online sources.
you have a research idea for which online data might be suitable, but you are not sure of the practical implications.

Learning objectives

By the end of the course you will:

understand the structure and basic features of different forms of online data.
be able to collect data from static and dynamic websites.
be able to interact with APIs to access and collect data.
be able to parse, clean and process the data collected.
be able to apply the methods studied to their own research projects

Prerequisites

Working knowledge of R and/or Python, including data structures and control structures.
Students attending both the R and the Python sessions should have a basic level of knowledge in each of the two programming languages.
Students who lack basic knowledge of these programming languages are encouraged to take the Introduction to Computational Social Science with R or Introduction to Computational Social Science with Python course in week 1 and/or the introductory workshops (Intro to R, Intro to Python).

Software and Hardware Requirements

You should bring your own laptop for use in the course and pre-install the following software:

RStudio
required packages (final list of packages to be provided before the course): httr, rvest, RSelenium, dplyr, tidyr, stringr, quanteda

Python:

Python 3
Required packages (final list to be provided before the course): requests, lxml, Beautiful Soup, Selenium, pandas, re, stringr, NLTK