Tel: +49 221 47694-579
Tel: +49 221 47694-579
Tel: +49 621 1246-211
Tel: +49 621 1246-211
From Embeddings to Transformers: Advanced Text Analysis with Python
25.09 - 29.09.2023
25.09 - 29.09.2023
Mannheim B6, 4-5
Mannheim B6, 4-5
Students: 500 €
Academics: 750 €
Commercial: 1500 €
Lecturer(s): Hauke Licht, Jennifer Victoria Scurrell
This course introduces social scientists to advanced, deep-learning based text analysis methods such as word embedding and large neural language models such as the Transformer. Basic methods of text analysis like counting words or n-grams have limitations in handling the complexity of natural language. By allowing to capture the semantic relationships between words in the contexts in which they appear, text embedding methods and neural language modeling techniques help overcome these limitations.
Participants will learn about the conceptual motivation and methodological foundations of text embedding methods and large neural language models. Moreover, they will gather plenty of practical experience with applying these methods in social science research using the Python programming language. Next to conveying a solid conceptual understanding as well as hands-on experience with applying these methods, the course puts a strong emphasis on introducing and discussing potential social science use cases as well as ethical considerations.
We will start by introducing classical word embedding models like GloVe and word2vec and participants will learn how to use word embeddings in social science research. Specifically, participants will apply word embeddings, for example, to identify relevant keywords when expanding a dictionary or to identify semantic dimensions in their corpus such as an emotion-reason dimension. In a second part of the course, we will introduce state-of-the-art Transformer models like BERT and GPT. We will first cover their methodological foundations: the attention mechanism, (masked) language modeling, and the encoder-decoder architecture. Participants will then apply these models in exercises covering supervised learning and topic modeling with BERTopic. This is an advanced level course. Participants should have prior knowledge of basic text analysis techniques. Specifically, they should have experience with standard bag-of-words pre-processing techniques and text representation approaches, such as word count-based document-feature matrices.
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.
Participants will find the course useful if:
By the end of the course participants will:
Organisational Structure of the Course
The course will be organized as a mixture of lectures and exercise sessions. We will switch between lectures and exercises throughout the morning and afternoon sessions of the course. In the lecture sessions, we will focus on explaining core concepts and methods. In the exercise sessions, participants will apply their newly acquired knowledge. Both instructors will be available to answer questions and provide guidance during the entire course.
Prior knowledge of basic quantitative text analysis methods
Basic knowledge of Python
For those who would like a primer or refresher in Python, we recommend taking the online workshop “Introduction to Python” that takes place from 04-06 September 2023.
Basic knowledge of quantitative research methods
Software and hardware requirements
|Morning Session||We will begin the first day by getting to know each other and use this as an opportunity to learn about everyone's motivations to participate in the course. We will then outline the day-by-day schedule of the course.|
In the second half of the morning session, we will begin the first of two thematic blocks of the course covering classic word embedding methods. Through a mixture of lectures and practical exercises, participants will review the limitations of count-based bag-of-words document representations (insensitivity to words' context, high dimensionality, and sparsity) and the methodological intuition that motivates embedding-based alternatives.
|Afternoon Session||In the afternoon session, we will introduce GloVe and word2vec - the popular word embedding models - and illustrate their commonalities and differences. However, the last part of the afternoon session will be reserved for our course-internal Help Café: We will ensure that everyone's Python environment and Jupyter Notebook setup is working and troubleshoot any remaining technical issues|
|Morning Session||On the second day of the course, we will focus on using word embedding models. We will begin by demonstrating how to work with pre-trained word embedding models. Participants will then learn and practice how to compute with embeddings. For example, we will learn how to assess the similarity between two words by computing the cosine similarity between their embeddings. We will build on these examples to learn techniques for assessing the quality and validity of embeddings|
|Afternoon Session||In the first part of the afternoon session, we will then move on and show how to train an embedding model “from scratch” (i.e., on a new text corpus), using a number of pre-selected text corpora taken from the domains of politics and the media. We will take this exercise as an opportunity to cover the foundations of deep learning (back propagation and stochastic gradient descent).|
In the second part of the afternoon session, we will then illustrate how one can employ word embeddings in social science research. We will focus our attention on two particular techniques: using word embeddings for dictionary expansion/keyword discovery and for constructing/extracting semantic dimensions.
|Morning Session||In the morning session, we will continue the block from the previous day by introducing another important use case of word embeddings: to create document representations that can be used in downstream analyses such as supervised classification. |
We will continue the morning session by pointing participants to two important advanced uses and extensions of standard word embedding models in social science research: (i) measuring over-time shifts in word meaning using dynamic embedding methods and (ii) computing embeddings for documents. Last but not least, we will discuss the limitations of classic word embedding models, focusing on the issues that arise for words with multiple senses and words whose meaning depends on sentence context.
|Afternoon Session||In the afternoon of day 3, we move on to the second thematic block focusing on transformer models. The instructors provide a brief introduction to transformer models and their advantages over traditional NLP models. We delve directly into the subject matter through practical exercises illustrating how transformer models overcome the multiple word senses problem of “traditional” word embeddings by generating contextualized word embeddings.|
|Morning Session||During the morning session, we explore how transformer models can be used in social science research. Sentiment analysis, fake news detection, and topic identification are all high-level NLP tasks that can be accomplished with cutting-edge transformer models. We will discuss how model pre-training and fine-tuning work. We will deepen participants' understanding of neural language modeling by doing exercises on masked language models.|
|Afternoon Session||In the afternoon, participants will learn in a series of hands-on exercises how to fine-tune transformer models for different NLP tasks, such as supervised text classification, with the Hugging Face's transformers library.|
|Morning Session||In the morning of day 5, the instructors will introduce BERTopic - a BERT-based approach to topic modeling. Participants will have time to experiment with this method through hands-on exercises.|
We will then shift our focus to large language models like GPT. Participants are introduced to ChatGPT and we discuss ethical considerations of large language models.
|Afternoon Session||In the afternoon of day 5, we recapitulate the course material of the previous days and answer open questions. At the end of the session, there will be time for 1-on-1 meetings where participants can consult the instructors with questions and problems they face with their research projects.|