GESIS Training Courses
user_jsdisabled
Search

Scientific Coordination

Dr.
Marlene Mauk
Tel: +49 221 47694-579

Administrative Coordination

Noemi Hartung
Tel: +49 621 1246-211

Introduction to Machine Learning for Text Analysis with Python

About
Location:
Mannheim B6, 4-5
 
Course duration:
10:00-17:00 CEST
General Topics:
Course Level:
Format:
Software used:
Duration:
Language:
Fees:
Students: 500 €
Academics: 750 €
Commercial: 1500 €
 
Keywords
Additional links
Lecturer(s): Damian Trilling, Anne Kroon

About the lecturer - Damian Trilling

About the lecturer - Anne Kroon

Course description

The course will provide insights into the concepts, challenges and opportunities associated with data so large that traditional research methods (like manual content analysis) cannot be applied anymore and traditional inferential statistics start to lose their meaning. Participants are introduced to strategies and techniques for capturing and analyzing digital data in communication contexts using Python. The course offers hands-on instructions regarding the several stages of computer-aided content analysis. More in particular, students will be familiarized with pre-processing methods, analysis strategies and the visualization and presentation of findings. The focus will be in particular on Machine Learning techniques to analyze quantitative textual data, amongst which both deductive (e.g., supervised machine learning and inductive (e.g., unsupervised machine learning) approaches will be discussed.
This is a beginner's course. Participants who are looking to learn about the latest developments in machine learning for textual data (such as transformer models) should consider taking a different course, e.g. “From Embeddings to Transformers: Advanced Text Analysis for Social Scientists”. These techniques will be (briefly) discussed towards the end of the course, but the focus lies on the basics of natural language processing and classical machine learning in Python.
 
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.


Target group

Participants will find the course useful if:
  • they are social scientists who have the ambition to model quantitative textual data. Specifically, those who aim to    describe, explain or predict the content of large-scale textual data using computation techniques are likely to benefit          from participating in this course.
  • Note that non-textual data, such as images or networks, are not at the center of this course. Techniques we cover are partly generalizable to such types of data, but note that the course is not tailored towards them.
    Participants interested in working with images or networks might be interested in one of the following two courses: Automated Image and Video Data Analysis with Python in Week 2 (18-22 September) or Social Network Analysis with R in Week 3 (25-29 September).


    Learning objectives

    By the end of the course participants will:
  • be able to identify research methods from computer science and computational linguistics which can be used for research in the domain of social science
  • have an understanding of the principles of supervised and unsupervised machine learning
  • be able to explain the principles of these methods and describe the value of these methods
  • know how to analyze textual data
  • have basic knowledge of the programming language Python and know how to use Python-modules for questions relevant in the domain of the social sciences
  • be able to independently analyze quantitative textual data using machine learning techniques
  •  
    Organisational Structure of the Course
    In the morning, we will have lectures, in which we will explain the topic of the day both from a theoretical-conceptual point of view as well as from a practical point of view (i.e., walking you through code examples). We may have small in-class exercises in between, if necessary.
     
    In the afternoon, students work on larger exercises in which they implement the techniques we covered. We provide example datasets, but it is also possible (and encouraged) to try to apply the techniques to own datasets. Participants can either opt to work on their own or try to solve problems together with one or multiple classmates. Lecturers will provide feedback on the (attempted) solutions of participants, and also provide example solutions.


    Prerequisites

  • Knowledge of basic statistics (linear and logistic regression)
  • Some experience with computational methods, programming in general, and/or statistical languages (but not necessarily Python) is highly recommended to participate in this course. During the first day of the course, we will discuss some fundamental aspects of coding in Python at a fast pace. In order to follow along, we recommend those who have little previous experience with computational methods or statistical languages to take part in the course Introduction to CSS with Python (week 1).
  • Participants are expected to have a working Python environment installed (see below), and we strongly recommend that participants spend a couple of hours with one of the many free online resources to familiarize themselves with the very basics of Python to have an easier start. For a basic introduction or refresher to Python programming, participants may also consider taking the online workshop “Introduction to Python” that takes place from 04-06 September 2023.
  •  
    Software and hardware requirements
    Participants need to have a current Python environment installed and need to be able to install and update packages on their own. All relatively recent versions of Python (in general, 3.8 or higher) should be fine. If you still have an older version, you may not be able to run the example code 1:1 but need to adapt it. Make sure you have recent versions of crucial packages such as pandas, numpy, scipy, scikit-learn, gensim, and keras installed. If in doubt, check how to update them. One option to achieve all of this is to simply install the newest version of the so-called Anaconda distribution, even though this is by no means necessary (in fact, both of us usually install our packages by hand instead of using Anaconda). Additionally, it is advisable to have access to Google Colab. Therefore, please ensure that you have a Google account and can execute code through Google Colab.
     
    Agenda
    Monday, 25.09.
  • Introduction and overview of the course
  • Principles of quantitative textual analysis for social scientists
  • Getting started with programming in Python: Introduction to the main concepts (such as data types, functions, and methods)
  • Practical discussion of benefits and drawbacks of working with different IDEs, as well as working with specific modules (such as pandas) versus native Python data structures.
  • Conducting an exercise that focuses on setting up our first simple machine learning classifier.
  • Tuesday, 26.09.
  • Introduction to the toolkit accessible to social scientists working with 'big' textual datasets
  • Inductive and deductive approaches to computer-aided content analysis
  • Exploratory techniques to explore your data
  • When, why, and how do we pre-process?
  • Regular expressions
  • Natural Language Processing with NLTK and spacy
  • From text to features: count vectorizers and tf-idf vectorizers
  • Wednesday, 27.09.
  • Principles and techniques of Unsupervised Machine Learning techniques
  • Topic modeling with Latent Dirichlet Allocation (LDA)
  • Hands-on instructions to apply these techniques, using modules such as scikit-learn and gensim
  • Comparing techniques of unsupervised learning with supervised learning
  • Thursday, 28.09.
  • Principles and techniques of Supervised Machine Learning
  • Discussion of how logistic regression and Naive Bayes classifiers can be used to predict, for instance, movie ratings or topics of news articles.
  • Evaluation metrics (accuracy, precision, recall, ...)
  • Hands-on instructions to apply these techniques, using modules such as scikit-learn
  • Alternative models (e.g., Random Forests)
  • Advanced Supervised Machine Learning (e.g., cross-validation, grid search, model selection, and tuning)
  • Friday, 29.09.
  • Visualization and presentation of findings
  • Outlook: Recent developments that are out of the scope of this course (e.g., embedding models, Transformer models, deep learning with keras


  • Recommended readings