Scientific Coordination
Dr.
Marlene Mauk
Tel: +49 221 47694-579
Marlene Mauk
Tel: +49 221 47694-579
Administrative Coordination
Noemi Hartung
Tel: +49 621 1246-211
Tel: +49 621 1246-211
Please wait...
Big Data and Computation for Social Data Science
About
Location:
Mannheim B6, 4-5
Mannheim B6, 4-5
Course duration:
09:00-16:00 CEST
General Topics:
Course Level:
Format:
Software used:
Duration:
Language:
Fees:
Students: 500 €
Academics: 750 €
Commercial: 1500 €
Keywords
Additional links
Lecturer(s): Akitaka Matsuo, David (Yen-Chieh) Liao
Course description
This course is intended for social science researchers and practitioners who wish to gain insight by analyzing large data sets (“big data”), teaching them the infrastructure for data manipulation and analysis, and how to use that infrastructure with statistical and programming languages.
The amount of data available to social scientists is increasing every year, and such large amounts of data have the potential to provide novel insights that were previously unavailable. However, as the volume of data increases, it becomes less feasible to load and process them on a personal computer. What is needed in such cases are databases for data storage and parallel processing, and distributed computing systems for data processing and computation. Learning about them is the objective of this course.
With regard to database systems, after learning the basic concepts, participants will learn SQL, the most widely used relational database language, and its management systems. As a more advanced topic, we will overview databases other than SQL, especially MongoDB, which is an excellent non-relational destination for storing large unstructured data (e.g. text data). For data processing and computation, students will learn how to parallelize data processing and analysis and how to use distributed computation systems, such as Apache Spark.
To learn these technologies, both theory and practice are very important, and thus the course will provide both lectures and labs as one set. The primary programming language will be R, as it is a language familiar to most quantitative social scientists, but given the increased importance of Python in social data science, the course will show how to use Python to do what we have learned in R, when appropriate.
For additional details on the course and a day-to-day schedule, please download the full-length syllabus.
Target group
Participants will find the course useful if:
Learning objectives
By the end of the course, students should have a good understanding of how to work with SQL as well as NoSQL databases in R, as well as how to leverage distributed computation systems like Spark for large-scale data processing. They should also be able to work with databases and compute clusters in the cloud. To be more concrete:
Organisational Structure of the Course
Each day of the course will have two 3-hour units. Each unit will include both lectures and labs.
In the lab, students will receive exercise problems to work on. The exercises are essentially given in R, and students answer them in the time allotted by the instructor. Students will work with other students to answer the questions on their own, and the two instructors will both be present in the classroom, so if they have any questions, they can always ask. The instructor will then provide the answer and, if possible, a demonstration of how to do the same thing in Python.
The instructor will also have office hours after class, where students can not only ask questions regarding the lectures and lab but also consult with the instructors on the methodological issues with their own research projects.
Prerequisites
For those who would like a primer or refresher in R or Python, we recommend taking the online workshop "Introduction to R" that takes place from 05-07 September 2023 or the online workshop “Introduction to Python” that takes place from 04-06 September 2023.
Software and hardware requirements
In this course, we will access various cloud data and computational environments, mainly using R and RStudio as well as Python and JupyterLab as a client. Participants should bring their own laptops with the following software installed:
- R (preferably latest, minimum 4.1.0)
- RStudio (latest)
- Miniconda (latest)
- Git environment (for Windows users who do not have a Bash environment)
R and RStudio should be installed beforehand, and Windows users should install Git for Windows. For Python, please install Miniconda, but building a conda environment will be done in the lab. Detailed instructions on packages and additional software (e.g., VS Code, MongoDB client) installation will be provided during the lecture and lab.
Agenda
Monday, 11.09. | |
Morning Session | |
Afternoon Session | |
Tuesday, 12.09. | |
Morning Session | |
Afternoon Session | |
Wednesday, 13.09. | |
Morning Session | |
Afternoon Session | |
Thursday, 14.09. | |
Morning Session | |
Afternoon Session | |
Friday, 15.09. | |
Morning Session | |
Afternoon Session |