Scientific Coordination

Marlene Mauk
Tel: +49 221 47694-579

Administrative Coordination

Noemi Hartung
Tel: +49 621 1246-211

From Embeddings to LLMs: Advanced Text Analysis with Python

About

Date:
23.09 - 27.09.2024

Location:
Mannheim, B6 4-5

Course duration:

9:30-12:30 and 13:30-16:30 CEST

General Topics:

Computational Social Science, Data Analysis

Course Level:

Format:

Software used:

Duration:

Language:

Fees:

Students: 550 €

Academics: 825 €

Commercial: 1650 €

Keywords

computational text analysis, word embeddings, large language models (LLMs), Transformers, deep learning, Python

Additional links

Terms and Conditions
FAQs

Lecturer(s): Lisa Maria Lechner, Hauke Licht

About the lecturer - Lisa Maria Lechner

About the lecturer - Hauke Licht

Course description

Basic “bag-of-words” methods of text analysis that rely on counting words or n-grams are limited in their ability to account for the complexity of natural language. This has implications for our ability to apply these approaches to measure social science concepts in textual data. Deep learning methods for text embedding and neural language modeling help overcome the limitations of bag-of-words text analysis approaches, and thus are an essential addition to the toolkit of computational social science researchers.

This course thus introduces social scientists to advanced, deep learning-based text analysis methods such as word embeddings and large neural language models. Participants will learn about the conceptual motivation and methodological foundations of text embedding methods and large neural language models (LLMs). Moreover, they will gather plenty of practical experience with applying these methods in social science research using the Python programming language. Next to conveying a solid conceptual understanding as well as hands-on experience with applying these methods, the course puts a strong emphasis on introducing and discussing potential social science use cases as well as ethical considerations.

We will start by introducing classical word embedding models like GloVe and word2vec and participants will learn how to use word embeddings in social science research. We will then introduce state-of-the-art Transformer models like BERT and GPT. We will first cover their methodological foundations: the attention mechanism, masked and autoregressive language modeling, and the neural network architectures that characterize BERT and GPT. Participants will then apply these models in exercises covering various supervised learning tasks (single- and multilabel sentence classification, token classification, and pairwise comparison) as well as topic modeling with BERTopic. Finally, we will introduce strategies and techniques to prompt pre-trained generative language models to code texts based on no or only a few labelled examples (i.e., zero-shot prompting and few-shot in-context learning).

This is an advanced-level course. Participants should have prior knowledge of basic text analysis techniques. Specifically, they should have experience with standard bag-of-words pre-processing techniques and text representation approaches, such as word count-based document-feature matrices. Those looking for a more introductory-level course should consider taking “Introduction to Machine Learning for Text Analysis with Python” (16-20 September). Moreover, participants should have experience with programming in Python. The instructors cannot provide an introduction to or recap of basics in Python programming in the course due to limited time.

Organizational Structure of the Course

The course will be organized as a mixture of lectures and exercise sessions. We will switch between lectures and exercises throughout the morning and afternoon sessions of the course. In the lecture sessions, we will focus on explaining core concepts and methods. In the exercise sessions, participants will apply their newly acquired knowledge. Both instructors will be available to answer questions and provide guidance during the entire course.

Target group

You will find the course useful if:

you have a background in the social sciences or humanities (e.g., communication science, economics, political science, sociology, or related fields)
you have a solid understanding of basic text analysis methods and
you want to advance their knowledge, skills, and practical experience
you want get up to speed with applying state-of-the-art NLP methods to text analysis problems in social science research

Learning objectives

By the end of the course you will:

know the methodological foundations of text embedding methods, transfer learning, Transformers, large language models (LLMs)

be able to apply these methods to analyze social scientific text data

be able to reflect critically about the application of the techniques in social science research, including relevant ethical considerations

Prerequisites

Prior knowledge of basic quantitative text analysis methods
bag-of-words text pre-processing (“tokenization”) and representation (i.e., how to represent document with word count vectors)
(conceptual) knowledge of dictionary analysis, topic modeling, and supervised text classification methods is strongly recommended
Basic knowledge of Python
creating and manipulating strings, lists and dictionaries
creating and interacting with objects, classes and methods
reading and manipulating data frames with pandas
using loops
defining new functions
Basic knowledge of quantitative research methods
understanding of linear and logistic regression analysis
a basic understanding of matrix algebra might be helpful but is not required

For those who would like a primer or refresher in Python, we recommend taking the online workshop “Introduction to Python” (26-29 August) and/or the online blended learning course “Introduction to Computational Social Science with Python” (30 August-06 September).

Software and Hardware Requirements

You should bring your own laptop.
You should have Python (≥ 3.12), miniconda, and Jupyter Notebook installed (see this link)
Required Python libraries
text processing: nltk, scikit-learn, gensim, tokenizers, datasets, transformers, openai, sentence-transformers, BERTopic, setfit
others: numpy, scipy, pandas
The instructors will distribute concrete instructions for the Python setup and a comprehensive list of required libraries before the course and assist with any remaining setup problems on the first day of the course.
Parts of the exercises focusing on LLM prompting techniques will require participants to (i) sign up for an account with a commercial provider (OpenAI) and (ii) add credit to their account. However, the instructors will ensure that the costs for using commercial providers' models will remain below U.S. $ 10. Moreover, the instructors will present open-source alternatives participants will be able to use free of charge through Google Colab or locally on their computers and laptops. The relevant information and setup instructions will be shared with registered participants 4 weeks in advance of the course to allow the instructors to adapt to the currently rapid evolution of available open-source models and software solutions.