Scientific Coordination
Dr.
Marlene Mauk
Tel: +49 221 47694-579
Marlene Mauk
Tel: +49 221 47694-579
Administrative Coordination
Loretta Langendörfer M.A.
Tel: +49 221 47694-143
Tel: +49 221 47694-143
Please wait...
Web Data Collection and Natural Language Processing in Python
Lecturer(s):
Indira Sen, Dr. Arnim Bleier, Julian Kohne, Dr. Fabian Flöck
Date: 20.09 - 24.09.2021 ics-file
Location: Online via Zoom
Course description
Data Science is the interdisciplinary science of extracting interpretable and useful knowledge from potentially large datasets. In contrast to empirical social science, data science methods often serve purposes of exploration and inductive inference. In this course, we aim to provide an introduction on how to tap into the vast amount of digital behavioral data available on Web platforms and processing it to be useful for social science research purposes.
To this end, participants will first learn how to collect data with Web Application Programming Interfaces (APIs) and Web scraping, by employing common Python tools and methods and how to incorporate them into workable data structures. Such APIs will likely include such offered by major social media companies like Reddit, Wikipedia, and Youtube (list not final).
Participants will subsequently be introduced to the basics of Natural Language Processing (NLP) for the analysis of these corpora. As much of the work in NLP is based on Machine Learning (ML), we will begin this section with a basic introduction to ML, followed by an introduction to pre-processing, e.g., data cleaning and feature extraction. We will then cover the application of popular NLP toolkits, some based on simple heuristics and dictionaries, but some also introducing more advanced ML methods.
All course materials will be provided as Python-based Jupyter Notebooks.
Keywords
Web Scraping, APIs, Natural Language Processing, Text as Data, Social Media, Python, Computational Social Science, Fall Seminar, 1 Week, Online, English, Python, Beginner
Target group
Participants will find the course useful if:
We expect this tutorial to be of interest for participants from a variety of disciplinary backgrounds (e.g. Economics, linguistics, sociology, psychology, political science, demography), particularly those who are interested in leveraging novel forms of digital traces for drawing inferences.
Learning objectives
Participants will obtain a working knowledge of how web and social media is collected through a detailed introduction to Web APIs and Web scraping and corresponding tools. Participants will obtain knowledge about typical data types and structures encountered when dealing with digital behavioral data from the Web, and how to apply selected NLP methods and tools in Python to structure natural language texts; and they will learn how this approach differs from those typically encountered in survey-based or experimental research. This will enable them to identify benefits and pitfalls of these data types and methods in their field of interest and will thus allow them to select and appropriately apply the covered NLP methods to large datasets in their own research. The knowledge obtained in this course provides a starting point for participants to investigate specialized methods for their individual research projects.
Organisational Structure of the Course
The course will be structured based on different subthemes of Web data collection, working with digital human traces from the web and processing for analysis. Lectures will be interactive, and the use of Jupyter Notebooks allows participants to reproduce the steps along the research pipeline while we introduce the topics. Each lecture will be a combination of conceptual sections and hands-on programming examples. Additionally, participants will have the opportunity to cement and test their understanding of different concepts in regular exercise and feedback rounds, where instructors provide support, advice, and troubleshooting.
Prerequisites
Software requirements
Participants should have an installation of Anaconda ready, along with Python 3.7+, and Jupyter Notebooks. Anaconda is an open data science platform powered by Python which can be downloaded here: https://www.anaconda.com/products/individual . It comes with many code libraries / packages for Python already installed. It also comes equipped with Jupyter Notebooks. We will be working with Python 3.7 and we will use Jupyter Notebooks for the exercises. While we plan to work in Jupyter Notebooks in our ready-to-go cloud-based environment notebooks.gesis.org, local installations of Anaconda are needed in rare cases of downtime of this service.