A course of Statistics and Data Science master, Leiden University.
- Szymon M. Kiełbasa [LUMC/BDS], coordinator,
[email protected]
- Ramin Monajemi [LUMC/BDS],
[email protected]
- Mo Arkani [LUMC/BDS],
[email protected]
The course offers a practical introduction to a few programming languages and tools currently used in data science:
- Python is a general-purpose, high-level and easy to learn programming language. It provides a large number of data science libraries (e.g. machine learning, neural networks, data manipulation, data visualization).
- SQL is a standard language used to create, query, update and manage relational databases. For example, such databases are used in data science to store large tables with results of experiments.
- Git is a tool that allows to track changes in files during development of programs. It is a current standard for collaborative code development.
During the course the students will write Python programs of growing complexity (from basic coding examples to fitting a machine learning model). After the course the students shall be able to program simple reproducible data analyses (consisting of data reading, cleaning, simple modelling, and reporting steps). The state-of-the-art Python-specific data manipulation/visualization (pandas, Matplotlib) and data science libraries will be discussed.
Fundamentals of the relational databases and of the SQL language will be presented in a context of an example database (SQLite). The database will be accessed through direct SQL statements and through high-level, object-oriented Python library (SQLAlchemy).
First, the students will work alone and practice individual code development. Later, shared code development will be practiced in groups. The students will be requested to use git to track changes in their code and to share their code with other students through GitHub.
Finally, the relevance of data stewardship and FAIR principles (Findable, Accessible, Interoperable, Reusable) will be discussed.
During the course you will practice writing Python code. After the course you will be able to:
- ✍️ use Python collections (
) - ✍️ use Python flow control statements (
, exceptions), context managers (with
) and define user functions - 🚫 understand Python classes (instance variables, methods, inheritance)
- ✍️ use Python standard libraries (reading/writing files in different formats;
) - ✍️ use common data science libraries (NumPy, pandas, Matplotlib)
- 🚫 understand relational databases and use SQL to create, query, update a database
- 🚫 understand basics of SQLAlchemy for Python object-oriented database access
- ✍️ understand how to execute several machine learning algorithms
- 🚫 use git and GitHub for individual and collaborative code development
- 🚫 explain the relevance of data stewardship and FAIR principles for scientific research
Check the Essentials for Data Science course (4433EDASCY
) at https://rooster.universiteitleiden.nl/schedule.
Feb. 6th, 2023:- General course introduction
- Git/GitHub introduction
- Python notebooks
- Python basic
- Memory organization
- Python lists and tuples
Feb. 13th:(03)
Feb. 20th:- Python flow control and user functions
- 📙 Assignment A: start
Feb. 27th:(05)
Mar. 6th:- Python standard libraries and scripts
- 📗 Assignment B: start
Mar. 13th:- Data manipulation:NumPy [Exercises] [Solutions]
- 📙 Assignment A: primary deadline, 17:00
Mar. 20th:(08)
Apr. 3rd:- Data visualisation [Exercises]
- 📗 Assignment B: primary deadline, 23:59
- 📘 Assignment C: start
Apr. 17th:- Relational databases:
- SQL language:
- Downloading and connecting to the example database
- Querying and selecting data (
) [Exercises] - Grouping and summarising (
) [Exercises]
Apr. 24th- Relational databases:
- SQL language:
- Modification statements (
) [Exercises] - Data definition language (
) - Joining tables 1 (
) [Exercises] - Joining tables 2 (
, self joins,CROSS JOIN
, subqueries,EXIST
) [Exercises]
- Modification statements (
- 📚 Group Assignment: start
May 1st:- Python SQL Toolkit and Object Relational Mapper (SQLAlchemy)
- 📘 Assignment C: primary deadline, 23:59
May 8th:- Git branching and merging
- General Q&A and group assignment Q&A, programming practice
May 15th:- Machine learning libraries (examples)
May 22nd:- FAIR & data stewardship
- 📝 Data stewardship quiz: start
June 5th- 🏢 Exam
June 12th- 📙 📗 📘 Assignments A, B, C: secondary deadline, 23:59
- 📝 Data stewardship quiz: deadline, 23:59
- 📚 Group Assignment: deadline, 23:59
June 26th- 🏢 Retake
- Components of the final grade:
- Assignments A, B, C (each of weight 1; total weight 3):
- Assignments A, B and C are separately graded.
- The grade range is 1-10 but when the primary deadline is not met then the maximum grade is 8.
- The mean grade of Assignments A, B and C is calculated and then rounded to 0.2 steps (e.g. ...7.6, 7.8, 8.0...).
- To pass the course, the Assignments A, B, C rounded mean grade must be greater than 5.5.
- The Assignments A, B, C rounded mean grade has weight=3 in the final grade.
- Group Assignment (weight 3):
- The grade range is 1-10, rounded to 0.2 steps.
- To pass the course, the group assignment rounded grade must be greater than 5.5.
- The group assignment rounded grade has weight=3 in the final grade.
- Data stewardship quiz:
- To pass the course, the quiz needs to be solved with the PASS result.
- The quiz grade is not part of the final grade formula.
- Exam/Retake (weight 4):
- The grade range is 1-10, rounded to 0.2 steps.
- To pass the course, the exam/retake grade must be greater than 5.5.
- The exam/retake grade has weight=4 in the final grade.
- The exam will cover the course objectives marked with ✍️.
- The exam will not cover the course objectives marked with 🚫 - these objectives are evaluated in the group assignment and the quiz.
- Assignments A, B, C (each of weight 1; total weight 3):
- Final grade:
- The final grade is calculated as a weighted mean of the component grades.
- To pass the course, the final grade needs to be greater or equal 6.0.
For the course you will need to bring a laptop with properly installed Python and a development environment.
Install (in the order listed below):
- Python (version >= 3.?.?): Follow the download instructions at https://www.python.org/.
- pip: The Python Package Installer. It should already be installed during Python installation. If that is not the case, follow https://pip.pypa.io/en/stable/installation/.
- Microsoft Visual Code: A free source-code editor made by Microsoft for Windows, Linux and MacOS. Follow the instructions at https://code.visualstudio.com/.
- Jupyter Notebook (optional): A (web) application for creating and sharing computational documents. Follow the instructions at https://jupyter.org/.
Moreover, you will need:
- git: Free and open source distributed version control system. Follow the Downloads instructions provided at https://git-scm.com/. Additional GUI (graphical) clients will not be used during the course but might be useful.