Scientific Coordination
Dr.
Marlene Mauk
Tel: +49 221 47694-579
Marlene Mauk
Tel: +49 221 47694-579
Administrative Coordination
Claudia O'Donovan-Bellante
Tel: +49 621 1246-221
Tel: +49 621 1246-221
Please wait...
Big Data Management and Analytics
About
Location:
Mannheim B6, 4-5
Mannheim B6, 4-5
General Topics
Course Level
Format
Software used
Duration
Language
Fees
Students: 500 €
Academics: 750 €
Commercial: 1500 €
Keywords
Additional links
Lecturer(s): Prof. Dr. Rainer Gemulla, Adrian Kochsiek
Course description
This course introduces systems and techniques for storing, querying, and working with datasets that are too large, too complex, or simply too inconvenient to work with on a single machine or programming language. Participants learn the foundations necessary to work with available “Big Data systems” on their own, whether in a local installation or via cloud computing. It is organized in a workshop format, i.e., morning sessions that introduce and discuss key concepts and techniques, followed by practical sessions in which participants gain hands-on experience on selected systems and applications. The course makes use of Python as the main programming language; it's one of the most suitable languages for data science with large, complex datasets.
We start with an introduction (or refresher, depending on the participant's background) of processing structured data (e.g., data frames), first directly within Python, then using a relational database system and the SQL query language for data access. Building on these foundations, the course introduces the large-scale computation engine Apache Spark for pre-processing and analysing data in a scalable fashion. We subsequently introduce and discuss non-relational data representation formats that are suitable for more complex data, most notably JSON (JavaScript Object Notation, for semi-structured data and documents) and, if time permits, RDF (Resource Description Framework, for graph data and knowledge graphs). The course concludes with an introduction into selected NoSQL databases that are useful for managing such data.
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.
Target group
Participants will find the course useful if:
▪ They want to work with large and/or complex datasets.
▪ They want to leverage available data management and processing solutions (either locally or in the cloud) for improved efficiency and ease of use.
Learning objectives
By the end of the course participants will:
Organizational structure of the course
The course is organized in a workshop format with 6 hours per day. Each day, we introduce and discuss key concepts and techniques in the morning, followed by practical sessions in the afternoon. In the latter sessions, participants gain hands-on experience on selected systems and applications through exercises and practical assignments. Lecturers will be available throughout to provide guidance and for individual consultations.
Prerequisites
Software and hardware requirements
Participants need to bring a laptop and have a Python 3 environment installed. Additional installation instructions (i.e., additional Python packages) will be provided later on. The required data management software (such as MariaDB, Apache Spark or MongoDB) will be hosted on our servers and does not need to be installed locally.
Recommended related courses