01-edu · Oumaimafisaoui · Oct 1, 2024 · Oct 1, 2024 · Oct 2, 2024 · Oct 6, 2024
diff --git a/subjects/ai/classification/README.md b/subjects/ai/classification/README.md
@@ -1,7 +1,15 @@
-# Classification
+## Classification
+
+### Overview
 
 The goal of this day is to understand practical classification with Scikit Learn.
 
+### Role play
+
+Imagine you're a data scientist working for a cutting-edge medical research company. Your team has been tasked with developing a machine learning model to assist doctors in diagnosing breast cancer. You'll be using logistic regression to classify tumors as benign or malignant based on various features.
+
+### Learning Objectives
+
 Today we will learn a different approach in Machine Learning: the classification which is a large domain in the field of statistics and machine learning. Generally, it can be broken down in two areas:
 
 - **Binary classification**, where we wish to group an outcome into one of two groups.
@@ -45,23 +53,23 @@ The **logloss** or **cross entropy** is the loss used for classification. Simila
 
 _Version of Scikit Learn I used to do the exercises: 0.22_. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years.
 
-### **Resources**
+### Resources
 
-### Logistic regression
+#### Logistic regression
 
 - https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102
 
-### Logloss
+#### Logloss
 
-- https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451
+- https://www.datacamp.com/tutorial/the-cross-entropy-loss-function-in-machine-learning
 
 - https://medium.com/swlh/what-is-logistic-regression-62807de62efa
 
 ---
 
 ---
 
-# Exercise 0: Environment and libraries
+### Exercise 0: Environment and libraries
 
 The goal of this exercise is to set up the Python work environment with the required libraries.
 
@@ -73,13 +81,13 @@ I recommend to use:
 - the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
 - one of the most recents versions of the libraries required
 
-1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.
+1. Create a virtual environment named `ex00`, with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.
 
 ---
 
 ---
 
-# Exercise 1: Logistic regression in Scikit-learn
+### Exercise 1: Logistic regression in Scikit-learn
 
 The goal of this exercise is to learn to use Scikit-learn to classify data.
 
@@ -98,7 +106,7 @@ y = [0,0,0,1,1,1,0]
 
 ---
 
-# Exercise 2: Sigmoid
+### Exercise 2: Sigmoid
 
 The goal of this exercise is to learn to compute and plot the sigmoid function.
 
@@ -120,11 +128,11 @@ The plot should look like this:
 
 ---
 
-# Exercise 3: Decision boundary
+### Exercise 3: Decision boundary
 
 The goal of this exercise is to learn to fit a logistic regression on simple examples and to understand how the algorithm separated the data from the different classes.
 
-## 1 dimension
+#### 1 dimension
 
 First, we will start as usual with features data in 1 dimension. Use `make classification` from Scikit-learn to generate 100 data points:
 
@@ -191,7 +199,7 @@ def predict_probability(coefs, X):
 
 [ex3q6]: ./w2_day2_ex3_q5.png "Scatter plot + Logistic regression + predictions"
 
-## 2 dimensions
+#### 2 dimensions
 
 Now, let us repeat this process on 2-dimensional data. The goal is to focus on the decision boundary and to understand how the Logistic Regression create a line that separates the data. The code to plot the decision boundary is provided, however it is important to understand the way it works.
 
@@ -247,7 +255,7 @@ The plot should look like this:
 
 ---
 
-# Exercise 4: Train test split
+### Exercise 4: Train test split
 
 The goal of this exercise is to learn to split a classification data set. The idea is the same as splitting a regression data set but there's one important detail specific to the classification: the proportion of each class in the train set and test set.
 
@@ -271,7 +279,7 @@ y[70:] = 1
 
 ---
 
-# Exercise 5: Breast Cancer prediction
+### Exercise 5: Breast Cancer prediction
 
 The goal of this exercise is to use Logistic Regression
 to predict breast cancer. It is always important to understand the data before training any Machine Learning algorithm. The data is described in **breast-cancer-wisconsin.names**. I suggest to add manually the column names in the DataFrame.
@@ -296,10 +304,10 @@ Preliminary:
 - [Database](data/breast-cancer-wisconsin.data) and [database information](data/breast-cancer-wisconsin.names)
 
 ---
----
 
+---
 
-# Exercise 6: Multi-class (Optional)
+### Exercise 6: Multi-class (Optional)
 
 The goal of this exercise is to learn to train a classification algorithm on a multi-class labelled data.
 Some algorithms as SVM or Logistic Regression do not natively support multi-class (more than 2 classes). There are some approaches that allow to use these algorithms on multi-class data.
@@ -310,7 +318,7 @@ Let's assume we work with 3 classes: A, B and C.
 
 More details:
 
-- https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/
+- https://medium.com/@agrawalsam1997/multiclass-classification-onevsrest-and-onevsone-classification-strategy-2c293a91571a
 
 Let's implement the One-vs-Rest approach from `LogisticRegression`.
 
@@ -354,6 +362,8 @@ def predict_one_vs_all(X, clf0, clf1, clf2 ):
        return classes
 ```
 
-- https://randerson112358.medium.com/python-logistic-regression-program-5e1b32f964db
+Resources :
+
+- https://www.kaggle.com/code/rahulrajpandey31/logistic-regression-from-scratch-iris-data-set
 
 - https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a
diff --git a/subjects/ai/classification/audit/README.md b/subjects/ai/classification/audit/README.md
@@ -6,7 +6,7 @@
 
 ##### Run `python --version`
 
-###### Does it print `Python 3.x`? x >= 8?
+###### Does it print `Python 3.x`? x >= 9?
 
 ###### Does `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error?
 
@@ -31,7 +31,6 @@ Score:
  0.7142857142857143
 ```
 
-
 ---
 
 ---
@@ -73,9 +72,9 @@ Coefficient:  [[1.18866075]]
 
 ###### For question 4, does `predict_probability` output the same probabilities as `predict_proba`? Note that the values have to match one of the class probabilities, not both. To do so, compare the output with: `clf.predict_proba(X)[:,1]`. The shape of the arrays is not important.
 
-###### Does `predict_class` output the same classes as `cfl.predict(X)` for  question 5? The shape of the arrays is not important.
+###### Does `predict_class` output the same classes as `cfl.predict(X)` for question 5? The shape of the arrays is not important.
 
-######  Does the plot for question 6 look like the plot below? As mentioned, it is not required to shift the class prediction to make the plot easier to understand.
+###### Does the plot for question 6 look like the plot below? As mentioned, it is not required to shift the class prediction to make the plot easier to understand.
 
 ![alt text][ex3q6]
 
@@ -193,6 +192,7 @@ As said, for some reasons, the results may be slightly different from mine becau
 ---
 
 #### Bonus
+
 #### Exercise 6: Multi-class (Optional)
 
 ##### The exercise is validated if all questions of the exercise are validated

diff --git a/subjects/ai/data-wrangling/README.md b/subjects/ai/data-wrangling/README.md
@@ -1,13 +1,21 @@
-# Data wrangling
+## Data wrangling
 
-Data wrangling is one of the crucial tasks in data science and analysis which includes operations like:
+### Overview
+
+Data wrangling is one of the crucial tasks in data science and analysis
+
+### Role Play
+
+You are a newly hired data analyst at a major e-commerce company. Your first assignment is to clean and prepare various datasets for analysis. The company's data comes from multiple sources and in different formats. Your manager has tasked you with combining these datasets, dealing with missing or inconsistent data, and preparing summary reports. You'll need to use your data wrangling skills to transform raw data into a format suitable for analysis and visualization.
+
+### Learning Objectives
 
 - Data Sorting: To rearrange values in ascending or descending order.
 - Data Filtration: To create a subset of available data.
 - Data Reduction: To eliminate or replace unwanted values.
 - Data Access: To read or write data files.
 - Data Processing: To perform aggregation, statistical, and similar operations on specific values.
-  Ax explained before, Pandas is an open source library, specifically developed for data science and analysis. It is built upon the Numpy (to handle numeric data in tabular form) package and has inbuilt data structures to ease-up the process of data manipulation, aka data munging/wrangling.
+  As explained before, Pandas is an open source library, specifically developed for data science and analysis. It is built upon the Numpy (to handle numeric data in tabular form) package and has inbuilt data structures to ease-up the process of data manipulation, aka data munging/wrangling.
 
 ### Exercises of the day
 
@@ -45,7 +53,7 @@ I suggest to use the most recent one.
 
 ---
 
-# Exercise 0: Environment and libraries
+### Exercise 0: Environment and libraries
 
 The goal of this exercise is to set up the Python work environment with the required libraries.
 
@@ -57,13 +65,13 @@ I recommend to use:
 - the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
 - one of the most recents versions of the libraries required
 
-1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` ,`tabulate` and `jupyter`.
+1. Create a virtual environment named `ex00`, with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy` ,`tabulate` and `jupyter`.
 
 ---
 
 ---
 
-# Exercise 1: Concatenate
+### Exercise 1: Concatenate
 
 The goal of this exercise is to learn to concatenate DataFrames. The logic is the same for the Series.
 
@@ -82,7 +90,7 @@ df2 = pd.DataFrame([['c', 1], ['d', 2]],
 
 ---
 
-# Exercise 2: Merge
+### Exercise 2: Merge
 
 The goal of this exercise is to learn to merge DataFrames
 The logic of merging DataFrames in Pandas is quite similar as the one used in SQL.
@@ -132,7 +140,7 @@ df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2'])
 
 ---
 
-# Exercise 3: Merge MultiIndex
+### Exercise 3: Merge MultiIndex
 
 The goal of this exercise is to learn to merge DataFrames with MultiIndex.
 Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data.
@@ -171,7 +179,7 @@ Use the code below to generate the DataFrames. `market_data` contains fake marke
 
 ---
 
-# Exercise 4: Groupby Apply
+### Exercise 4: Groupby Apply
 
 The goal of this exercise is to learn to group the data and apply a function on the groups.
 The use case we will work on is computing
@@ -241,7 +249,7 @@ Here is what the function should output:
 
 ---
 
-# Exercise 5: Groupby Agg
+### Exercise 5: Groupby Agg
 
 The goal of this exercise is to learn to compute different type of aggregations on the groups. This small DataFrame contains products and prices.
 
@@ -269,7 +277,7 @@ Note: The columns don't have to be MultiIndex
 
 ---
 
-# Exercise 6: Unstack
+### Exercise 6: Unstack
 
 The goal of this exercise is to learn to unstack a MultiIndex
 Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest, ...

diff --git a/subjects/ai/data-wrangling/audit/README.md b/subjects/ai/data-wrangling/audit/README.md
@@ -6,7 +6,7 @@
 
 ##### Run `python --version`.
 
-###### Does it print `Python 3.x`? x >= 8
+###### Does it print `Python 3.x`? x >= 9
 
 ###### Does `import jupyter`, `import numpy` and `import pandas` run without any error?
 
@@ -52,7 +52,7 @@
     |  5 |    6 | nan            | nan            | O              | P              |
     |  6 |    7 | nan            | nan            | Q              | R              |
     |  7 |    8 | nan            | nan            | S              | T              |
- 
+
     Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name.
 
 ---

diff --git a/subjects/ai/keras-2/README.md b/subjects/ai/keras-2/README.md
@@ -1,4 +1,14 @@
-# Keras 2
+## Keras 2
+
+### Overview
+
+This exercise set focuses on advanced applications of Keras for building and training neural networks. You'll work on both regression and multi-class classification problems, using real-world datasets like the Auto MPG and Iris datasets.
+
+### Role Play
+
+You're a data scientist at a biotech company developing AI-powered systems for various applications. Your current project involves creating neural networks for both regression and multi-class classification tasks. You'll be working on predicting car fuel efficiency and classifying flower species, showcasing the versatility of neural networks in different domains.
+
+### Learning Objectives
 
 The goal of this day is to learn to use Keras to build Neural Networks and train them on small data sets. This helps to understand the specifics of networks for classification and regression.
 
@@ -28,15 +38,15 @@ The audit will provide the code and output because it is not straightforward to
 _Version of Keras I used to do the exercises: 2.4.3_.
 I suggest to use the most recent one.
 
-### **Resources**
+### Resources
 
 - https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
 
 ---
 
 ---
 
-# Exercise 0: Environment and libraries
+### Exercise 0: Environment and libraries
 
 The goal of this exercise is to set up the Python work environment with the required libraries.
 
@@ -48,13 +58,13 @@ I recommend to use:
 - the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science.
 - one of the most recent versions of the libraries required
 
-1. Create a virtual environment named with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter` and `keras`.
+1. Create a virtual environment named with a version of Python >= `3.9`, with the following libraries: `pandas`, `numpy`, `jupyter` and `keras`.
 
 ---
 
 ---
 
-# Exercise 1: Regression - Optimize
+### Exercise 1: Regression - Optimize
 
 The goal of this exercise is to learn to set up the optimization for a regression neural network. There's no code to run in that exercise. In W2D2E3, we implemented a neural network designed for regression. We will be using this neural network:
 
@@ -88,7 +98,7 @@ https://keras.io/api/metrics/regression_metrics/
 
 ---
 
-# Exercise 2: Regression example
+### Exercise 2: Regression example
 
 The goal of this exercise is to learn to train a neural network to perform a regression on a data set.
 The data set is [Auto MPG Dataset](auto-mpg.csv) and the go is to build a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.
@@ -109,7 +119,7 @@ https://www.tensorflow.org/tutorials/keras/regression
 
 ---
 
-# Exercise 3: Multi classification - Softmax
+### Exercise 3: Multi classification - Softmax
 
 The goal of this exercise is to learn to a neural network architecture for multi-class data. This is an important type of problem on which to practice with neural networks because the three class values require specialized handling. A multi-classification neural network uses as output layer a **softmax** layer. The **softmax** activation function is an extension of the sigmoid as it is designed to output the probabilities to belong to each class in a multi-class problem. This output layer has to contain as much neurons as classes in the multi-classification problem. This article explains in detail how it works. https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax
 
@@ -126,7 +136,7 @@ Let us assume we want to classify images and we know they contain either apples,
 
 ---
 
-# Exercise 4: Multi classification - Optimize
+### Exercise 4: Multi classification - Optimize
 
 The goal of this exercise is to learn to optimize a multi-classification neural network. As learnt previously, the loss function used in binary classification is the log loss - also called in Keras `binary_crossentropy`. This function is defined for binary classification and can be extended to multi-classification. In Keras, the extended loss that supports multi-classification is `binary_crossentropy`. There's no code to run in that exercise.
 
@@ -142,7 +152,7 @@ model.compile(loss='',#TODO1
 
 ---
 
-# Exercise 5 Multi classification example
+### Exercise 5 Multi classification example
 
 The goal of this exercise is to learn to use a neural network to classify a multiclass data set. The data set used is the Iris data set which allows to classify flower given basic features as flower's measurement.
 

diff --git a/subjects/ai/keras-2/audit/README.md b/subjects/ai/keras-2/audit/README.md
@@ -6,7 +6,7 @@
 
 ##### Run `python --version`.
 
-###### Does it print `Python 3.x`? x >= 8
+###### Does it print `Python 3.x`? x >= 9
 
 ###### Do `import jupyter`, `import numpy`, `import pandas` and `import keras` run without any error?
 
@@ -131,7 +131,6 @@ model.compile(loss='categorical_crossentropy',
 
 ---
 
-
 #### Exercise 5: Multi classification example
 
 ##### The exercise is validated if all questions of the exercise are validated