GESIS Training Courses
user_jsdisabled
Search

Scientific Coordination

Dr.
Marlene Mauk
Tel: +49 221 47694-579

Administrative Coordination

Noemi Hartung
Tel: +49 621 1246-211

From Embeddings to Transformers: Advanced Text Analysis with Python

About
Location:
Mannheim B6, 4-5
 
Course duration:
09:30-16:30 CEST
General Topics:
Course Level:
Format:
Software used:
Duration:
Language:
Fees:
Students: 500 €
Academics: 750 €
Commercial: 1500 €
 
Keywords
Additional links
Lecturer(s): Hauke Licht, Jennifer Victoria Scurrell

About the lecturer - Hauke Licht

About the lecturer - Jennifer Victoria Scurrell

Course description

This course introduces social scientists to advanced, deep-learning based text analysis methods such as word embedding and large neural language models such as the Transformer. Basic methods of text analysis like counting words or n-grams have limitations in handling the complexity of natural language. By allowing to capture the semantic relationships between words in the contexts in which they appear, text embedding methods and neural language modeling techniques help overcome these limitations.
Participants will learn about the conceptual motivation and methodological foundations of text embedding methods and large neural language models. Moreover, they will gather plenty of practical experience with applying these methods in social science research using the Python programming language. Next to conveying a solid conceptual understanding as well as hands-on experience with applying these methods, the course puts a strong emphasis on introducing and discussing potential social science use cases as well as ethical considerations.
We will start by introducing classical word embedding models like GloVe and word2vec and participants will learn how to use word embeddings in social science research. Specifically, participants will apply word embeddings, for example, to identify relevant keywords when expanding a dictionary or to identify semantic dimensions in their corpus such as an emotion-reason dimension. In a second part of the course, we will introduce state-of-the-art Transformer models like BERT and GPT. We will first cover their methodological foundations: the attention mechanism, (masked) language modeling, and the encoder-decoder architecture. Participants will then apply these models in exercises covering supervised learning and topic modeling with BERTopic. This is an advanced level course. Participants should have prior knowledge of basic text analysis techniques. Specifically, they should have experience with standard bag-of-words pre-processing techniques and text representation approaches, such as word count-based document-feature matrices.
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.


Target group

Participants will find the course useful if:
  • they have a background in the social sciences or humanities (e.g., communication science, economics, political science, sociology, or related fields)
  • they have a solid understanding of basic text analysis methods and want to advance their knowledge, skills, and practical experience


  • Learning objectives

    By the end of the course participants will:
  • know the methodological foundations of text embeddings methods, large neural language models (at a conceptual level)
  • be able to apply these methods to analyze social science text data
  • be able to reflect critically about the application of the techniques in social science research, including relevant ethical considerations
  •  
    Organisational Structure of the Course
    The course will be organized as a mixture of lectures and exercise sessions. We will switch between lectures and exercises throughout the morning and afternoon sessions of the course. In the lecture sessions, we will focus on explaining core concepts and methods. In the exercise sessions, participants will apply their newly acquired knowledge. Both instructors will be available to answer questions and provide guidance during the entire course.


    Prerequisites

    Prior knowledge of basic quantitative text analysis methods
  • bag-of-words text pre-processing (“tokenization”) and representation (i.e., how to represent document with word count vectors)
  • (conceptual) knowledge of dictionary analysis, topic modeling, and supervised text classification methods is strongly recommended
  • Basic knowledge of Python
  • creating and manipulating strings, lists and dictionaries
  • creating and interacting with objects, classes and methods
  • using loops and defining new functions
  • For those who would like a primer or refresher in Python, we recommend taking the online workshop “Introduction to Python” that takes place from 04-06 September 2023.
    Basic knowledge of quantitative research methods
  • knowledge of basic statistics (distributions, correlation)
  • understanding of linear and logistic regression analysis
  • a basic understanding of matrix algebra might be helpful but is not required
  •  
    Software and hardware requirements  
  • Participants should bring their own laptops.
  • They should have Python (≥ 3.10), conda, and Jupyter Notebook installed (see this link).
  • Required Python libraries
  • text processing: nltk,  scikit-learn, gensim, transformers
  • others: numpy, scipy, tqdm
  • The instructors will distribute concrete instructions for the Python setup and a comprehensive list of required libraries before the course and assist with any remaining setup problems on the first day of the course.
  •  
     
    Agenda
    Monday, 25.09.
    Morning SessionWe will begin the first day by getting to know each other and use this as an opportunity to learn about everyone's motivations to participate in the course. We will then outline the day-by-day schedule of the course.
    In the second half of the morning session, we will begin the first of two thematic blocks of the course covering classic word embedding methods. Through a mixture of lectures and practical exercises, participants will review the limitations of count-based bag-of-words document representations (insensitivity to words' context, high dimensionality, and sparsity) and the methodological intuition that motivates embedding-based alternatives.
    Afternoon SessionIn the afternoon session, we will introduce GloVe and word2vec - the popular word embedding models - and illustrate their commonalities and differences. However, the last part of the afternoon session will be reserved for our course-internal Help Café: We will ensure that everyone's Python environment and Jupyter Notebook setup is working and troubleshoot any remaining technical issues
    Tuesday, 26.09.
    Morning SessionOn the second day of the course, we will focus on using word embedding models. We will begin by demonstrating how to work with pre-trained word embedding models. Participants will then learn and practice how to compute with embeddings. For example, we will learn how to assess the similarity between two words by computing the cosine similarity between their embeddings. We will build on these examples to learn techniques for assessing the quality and validity of embeddings
    Afternoon SessionIn the first part of the afternoon session, we will then move on and show how to train an embedding model “from scratch” (i.e., on a new text corpus), using a number of pre-selected text corpora taken from the domains of politics and the media. We will take this exercise as an opportunity to cover the foundations of deep learning (back propagation and stochastic gradient descent).
     
    In the second part of the afternoon session, we will then illustrate how one can employ word embeddings in social science research. We will focus our attention on two particular techniques: using word embeddings for dictionary expansion/keyword discovery and for constructing/extracting semantic dimensions.
    Wednesday, 27.09.
    Morning SessionIn the morning session, we will continue the block from the previous day by introducing another important use case of word embeddings: to create document representations that can be used in downstream analyses such as supervised classification.
     
    We will continue the morning session by pointing participants to two important advanced uses and extensions of standard word embedding models in social science research: (i) measuring over-time shifts in word meaning using dynamic embedding methods and (ii) computing embeddings for documents. Last but not least, we will discuss the limitations of classic word embedding models, focusing on the issues that arise for words with multiple senses and words whose meaning depends on sentence context.
    Afternoon SessionIn the afternoon of day 3, we move on to the second thematic block focusing on transformer models. The instructors provide a brief introduction to transformer models and their advantages over traditional NLP models. We delve directly into the subject matter through practical exercises illustrating how transformer models overcome the multiple word senses problem of “traditional” word embeddings by generating contextualized word embeddings.
    Thursday, 28.09.
    Morning SessionDuring the morning session, we explore how transformer models can be used in social science research. Sentiment analysis, fake news detection, and topic identification are all high-level NLP tasks that can be accomplished with cutting-edge transformer models. We will discuss how model pre-training and fine-tuning work. We will deepen participants' understanding of neural language modeling by doing exercises on masked language models.
    Afternoon SessionIn the afternoon, participants will learn in a series of hands-on exercises how to fine-tune transformer models for different NLP tasks, such as supervised text classification, with the Hugging Face's transformers library.
    Friday, 29.09.
    Morning SessionIn the morning of day 5, the instructors will introduce BERTopic - a BERT-based approach to topic modeling. Participants will have time to experiment with this method through hands-on exercises.
    We will then shift our focus to large language models like GPT. Participants are introduced to ChatGPT and we discuss ethical considerations of large language models.
    Afternoon SessionIn the afternoon of day 5, we recapitulate the course material of the previous days and answer open questions. At the end of the session, there will be time for 1-on-1 meetings where participants can consult the instructors with questions and problems they face with their research projects.