GESIS Training Courses
user_jsdisabled
Search

Scientific Coordination

Verena Kunz

Administrative Coordination

Janina Götsche

Going Cross-Lingual: Computational Methods for Multilingual Text Analysis

About
Location:
Cologne / Unter Sachsenhausen 6-8
 
General Topics:
Course Level:
Format:
Software used:
Duration:
Language:
Fees:
Students: 300 €
Academics: 450 €
Commercial: 900 €
 
Keywords
Additional links
Lecturer(s): Hauke Licht, Fabienne Lind

About the lecturer - Hauke Licht

About the lecturer - Fabienne Lind

Course description

The wide-reaching and still growing digitalization of communication in the form of text data raises demands for international, cross-lingual comparative research. For example, large, multilingual text collections of political parties' campaign materials or politicians' parliamentary speeches invite cross-country comparative analysis of political behavior. Likewise, the availability of large collections of national news outlets' coverage about internationally highly relevant topics like economic inequality, climate change, or immigration allow the comparative analysis of various national perspectives.
 
Fortunately, an increasing number of contributions to the (computational) social science literature present approaches to analyze multilingual text collections with text-as-data methods. In this workshop, participants will learn about these approaches and strategies for studying social science-related concepts in multilingual text collections with automated content analysis methods. Specifically, we will focus on (machine) translation, multilingual embedding and transfer learning approaches.
 
We will focus on aspects relevant for applying these methods to compare concepts across socio-political contexts. Through a combination of theoretical discussions and practical exercises, participants will learn how to effectively apply (neural) machine translation and multilingual embedding techniques to analyze texts quantitatively across languages. Additionally, we will delve into the underlying assumptions that motivate these approaches and practice validating cross-lingual measurements.
 
By the end of the workshop, participants will have a strong understanding of key concepts and approaches in the existing multilingual text analysis literature, as well as the ability to implement them in R and/or Python through hands-on exercises.


Target group

Participants will find the course useful if they
  • have a background in the social sciences or humanities (e.g., communication science, economics, political science, sociology, or related fields)
  • are interested in applying quantitative text analysis methods in comparative, cross-lingual research
  • have an understanding of basic text analysis methods and want to advance their knowledge, skills, and practical experience
  • have some experience with quantitative text analysis but confront the challenge of multilinguality in their research


  • Learning objectives

    Throughout the course, participants will
  • develop a comprehensive understanding of current methods, resources, and tools available for analyzing multilingual text.
  • gain practical experience in implementing state-of-the-art methods for cross-lingual text analysis.
  • acquire practical skills to identify and address challenges encountered while performing cross-lingual text analysis.
  • foster critical thinking skills to evaluate the applications and validation of various multilingual text analysis methods.
  • generate innovative ideas to apply the learned methods to their research, leading to new insights and discoveries.
  •  
    Organizational structure of the course
    The workshop introduces all topics through a lecture format followed by practical examples to illustrate the concepts. During the lecture, the instructors provide a thorough overview of each topic, highlighting the latest research methods, and introducing relevant resources. The lectures also include shorter interactive parts where students can participate in reflections of and discussions on different methods in both plenary and small group settings. The practical part of the course consists of hands-on exercises in the lab. Here, participants work together on preselected data to apply the concepts learned. In addition, the instructors offer small coding challenges that students can complete on their own or in groups after class hours on a voluntary basis. Solutions for both the in-class exercises and voluntary take-home assignments are provided.


    Prerequisites

  • Prior knowledge of basic quantitative text analysis methods
  • bag-of-words text pre-processing (“tokenization”) and representation (i.e., how to represent document with word count vectors)
  • conceptual knowledge of dictionary analysis, topic modeling, and supervised text classification methods is (strongly recommended)
  • Basic knowledge of R and/or Python:
  • R: When we will apply bag-of-words methods (e.g., dictionaries or topic modeling) in the course, we will provide R code. To be able to comfortably follow the course, you should have the following R programming skills:
  • create and manipulate vectors, data frames, and list objects
  • load tabular data files (e.g., CSVs) into R
  • perform simple operations (subsetting/filtering, indexing, creating/changing columns) on data frames
  • write simple for loops and custom functions
  • pre-process and tokenize text data with the quantada or tm package (optional)
  • Python: Some of the methods we will cover are (mainly/only) available in Python (e.g., open-source neural machine translation or multilingual sentence embedding). Below, we list the tasks you should be able to perform in Python to comfortably follow the course. If you have never worked with Python before, you can contact the instructors and they will provide you with a collection of links to learning materials that will allow  you to attain the Python skills required for the course.
  • create and manipulate strings, lists, and dictionaries
  • load tabular data files (e.g., CSVs) with pandas
  • perform simple operations (subsetting/filtering, indexing, creating/changing columns) on pandas data frames
  • write simple for loops and custom functions
  • pre-process and tokenize text data with re or regex and nltk  (optional)
  • train and evaluate a bag-of-words classifiers with scikit-learn (optional)
  • Basic understanding of key ideas and concepts of quantitative research and measurement, especially quantification (i.e., turning qualitative, unstructured data/symbols into numbers), validity, and reliability.
  •  
     
    Software and hardware requirements
     
  • Participants should bring their own laptops.
  • R setup
  • R (≥ 4.0.0) and RStudio installed
  • required packages
  • text processing:  stringr, quanteda, topicmodels, stm, deeplr
  • others: readr, dplyr, tidyr, purrr, ggplot2
  • Python setup
  • Python (≥ 3.10), conda, and Jupyter Notebook installed (see this link)
  • required Python libraries
  • text processing: nltk, scikit-learn, gensim, easyNMT, transformers, sentence-transformers,
  • others: numpy, scipy, pandas, matplotlib
  • The instructors will distribute concrete instructions for the Python setup and a comprehensive list of required libraries before the course and assist with any remaining setup problems on the first day of the course.
  •  
     
    Agenda
     
    Wednesday, 06.12.
    10:00 - 11:30Introduction to the topic, overview about applications and main problems (input by instructors)
    11:30 - 11:45Coffee break
    11:45 - 13:00Introduction to the main solutions approaches (input by instructors + group discussion)
    13:00 - 14:00Lunch break
    14:00 - 14:30Valid data selection in multilingual & multi-context scenarios  (input by instructors)
    14:30 - 15:30Data source selection (group exercise)
    15:30 - 15:45Coffee break
    15:45 - 17:00Search string/keyword selection and testing (hands-on exercise in the lab with preselected data )
    Thursday, 07.12.
    09:30 - 11:00Machine translation, multilingual embeddings, large language models (input by instructors)
    11:00 - 11:15Coffee break
    11:15 - 12:30Implementing the main solutions for supervised machine learning (hands-on exercise in the lab with preselected data, code with solutions is prepared)
    12:30 - 13:30Lunch break
    13:30 - 15:00Implementing the main solutions for unsupervised machine learning (hands-on exercise in the lab with preselected data, code with solutions is prepared)
    15:00 - 15:15Coffee break
    15:15 - 15:45Valid outputs in multilingual & multi-context scenarios (input by instructors)
    15:45 - 16:30Creation of a validation benchmark (group exercise)
    Friday, 08.12.
    09:30 - 10:15Valid inputs and processes in multilingual & multi-context scenarios (input by instructors)
    10:15 - 11:00Pre-processing of multilingual data (hands-on exercise in the lab with preselected data, code with solutions is prepared)
    10:00 - 11:15Coffee break
    11:15 - 12:30Process monitoring of multilingual data (hands-on exercise in the lab with preselected data, code with solutions is prepared)
    12:30 - 13:30Lunch break
    13:30 - 15:00Lecturers are available for individual consultations on participants' projects. Time can also be used by participants to work on their projects. We further prepare case studies for participants who prefer to work on prepared datasets and questions.
    15:00 - 15:15Coffee break
    15:15 - 16:30Lecturers are available for individual consultations on participants' projects. Time can also be used by participants to work on their projects or the prepared examples the instructors provide.


    Recommended readings