´╗┐´╗┐ GESIS Training Courses
user_jsdisabled
Search

Scientific Coordination

Dr.
Marlene Mauk
Tel: +49 221 47694-579

Administrative Coordination

Claudia O'Donovan-Bellante
Tel: +49 621 1246-221

Big Data Management and Analytics

About
Location:
Mannheim B6, 4-5
 
General Topics
Course Level
Format
Software used
Duration
Language
Fees
Students: 500 €
Academics: 750 €
Commercial: 1500 €
Keywords
Additional links
Lecturer(s): Prof. Dr. Rainer Gemulla, Adrian Kochsiek

About the lecturer - Prof. Dr. Rainer Gemulla

About the lecturer - Adrian Kochsiek

Course description

This course introduces systems and techniques for storing, querying, and working with datasets that are too large, too complex, or simply too inconvenient to work with on a single machine or programming language. Participants learn the foundations necessary to work with available “Big Data systems” on their own, whether in a local installation or via cloud computing. It is organized in a workshop format, i.e., morning sessions that introduce and discuss key concepts and techniques, followed by practical sessions in which participants gain hands-on experience on selected systems and applications. The course makes use of Python as the main programming language; it's one of the most suitable languages for data science with large, complex datasets.
 
We start with an introduction (or refresher, depending on the participant's background) of processing structured data (e.g., data frames), first directly within Python, then using a relational database system and the SQL query language for data access. Building on these foundations, the course introduces the large-scale computation engine Apache Spark for pre-processing and analysing data in a scalable fashion. We subsequently introduce and discuss non-relational data representation formats that are suitable for more complex data, most notably JSON (JavaScript Object Notation, for semi-structured data and documents) and, if time permits, RDF (Resource Description Framework, for graph data and knowledge graphs). The course concludes with an introduction into selected NoSQL databases that are useful for managing such data.
 
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.


Target group

Participants will find the course useful if:
▪ They want to work with large and/or complex datasets.
▪ They want to leverage available data management and processing solutions (either locally or in the cloud) for improved    efficiency and ease of use.


Learning objectives

By the end of the course participants will:
  • Understand different data representations (including relational data, semi-structured data, and graph data) and their    advantages/disadvantages.
  • Be able to process structured data in Python (using Pandas).
  • Know how to insert, update, and query structured data in a relational database system using the SQL query language    (using MySQL).
  • Be familiar with the Apache Spark framework for performing computations on large datasets.
  • Be able to perform basic parallel data processing with Apache Spark.
  • Know basic types of NoSQL systems as well as their properties.
  • Be able to store, query, and process semi-structured data in a NoSQL database (e.g., Apache HBase or MongoDB).
  •  
    Organizational structure of the course
    The course is organized in a workshop format with 6 hours per day. Each day, we introduce and discuss key concepts and techniques in the morning, followed by practical sessions in the afternoon. In the latter sessions, participants gain hands-on experience on selected systems and applications through exercises and practical assignments. Lecturers will be available throughout to provide guidance and for individual consultations.


    Prerequisites

  • We assume that participants already have experience with programming (e.g., in Python or R).
  • Although we will discuss fundamental aspects of working with data in Python, we highly recommend those with no experience with Python to take part in the course Introduction to CSS with Python (week 1) or familiarize themselves with the very basics of Python.
  • Participants are expected to have a working Python environment installed (see below).
  •  
    Software and hardware requirements
    Participants need to bring a laptop and have a Python 3 environment installed. Additional installation instructions (i.e., additional Python packages) will be provided later on. The required data management software (such as MariaDB, Apache Spark or MongoDB) will be hosted on our servers and does not need to be installed locally.
     
    Recommended related courses
  • Introduction to Machine Learning for Text Analysis with Python (Fall Seminar, Mannheim, Week 3)
  • Automated Image and Video Data Analysis with Python (Fall Seminar, Mannheim, Week 3)
  • Python 101 (Workshop, online, 31.08. - 01.09.2022)
  • Introduction to Computational Social Science with Python (Fall Seminar, Mannheim, Week 1)


  •