Skip to content

chagantvj/PracticalApplicationM17

Repository files navigation

Project: Module 17 Practical Application Example 3

Author: Vijay Chaganti

Dataset information

*The provided dataset contains information on 41188 customers

Given dataset has 41188 entries with 21 columns

Data Source: https://github.com/chagantvj/PracticalApplicationM17/blob/main/bank-additional-full.csv

Python Code: https://github.com/chagantvj/PracticalApplicationM10/blob/main/VijayChaganti_Module17_Practical_Example3.ipynb

Date Understanding and Cleaning

Total entries for each column of the data frame is 12684.

From data given,
1. column named 'default', 'housing' and 'loan' is going to play active role on cusomers comitting to a term plan that is going to put financial strain on an individual.
2. Given priority for above three columns removing 'unknown' rows from these three columsn. Numbers of unknowns are very in-significant compared to the rows that has valid data like yes or no.
3. Column named 'job' has lot many variants that can lead over fitting of a model and hence igonored it.
4. Columns named cpi, cci, employed etc are given less proority as that data is not going to play any active role on an individual to choose to either commit for a term loan or not.

 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          41188 non-null  int64  
 1   job          41188 non-null  object 
 2   marital      41188 non-null  object 
 3   education    41188 non-null  object 
 4   default      41188 non-null  object 
 5   housing      41188 non-null  object 
 6   loan         41188 non-null  object 
 7   contact      41188 non-null  object 
 8   month        41188 non-null  object 
 9   day_of_week  41188 non-null  object 
 10  duration     41188 non-null  int64  
 11  campaign     41188 non-null  int64  
 12  pdays        41188 non-null  int64  
 13  previous     41188 non-null  int64  
 14  poutcome     41188 non-null  object 
 15  evr          41188 non-null  float64
 16  cpi          41188 non-null  float64
 17  cci          41188 non-null  float64
 18  e3m          41188 non-null  float64
 19  employed     41188 non-null  float64
 20  y            41188 non-null  object 

Removing unknown data from columns and imputate categorical to numerical data

countUnknown = (dfm == 'unknown').any(axis=1).sum()
print(countUnknown)
# 9359

dfm = dfm[~dfm[['loan', 'housing', 'default']].isin(['unknown']).any(axis=1)]

dfm['y'].mean()
dfm['loan'] = dfm['loan'].replace({'yes': 1, 'no': 0})
dfm['default'] = dfm['default'].replace({'yes': 1, 'no': 0})
dfm['housing'] = dfm['housing'].replace({'yes': 1, 'no': 0})

Histograms of given dataset

Screen Shot 2025-01-20 at 12 58 56 PM

Line graph for average calls per campaign

Screen Shot 2025-01-20 at 1 06 26 PM

Code for data processing

dfm = df.drop(['marital','job','education','month','day_of_week','pdays','previous','poutcome','cpi','cci','evr','e3m','contact'], axis=1)
dfm = dfm[~dfm[['loan', 'housing', 'default']].isin(['unknown']).any(axis=1)]
dfm['y'].mean()
dfm['loan'] = dfm['loan'].replace({'yes': 1, 'no': 0})
dfm['default'] = dfm['default'].replace({'yes': 1, 'no': 0})
dfm['housing'] = dfm['housing'].replace({'yes': 1, 'no': 0})

Heatmaps withput columns 'loan', 'housing' and 'default' Screen Shot 2025-01-20 at 1 08 43 PM

Heatmaps with columns 'loan', 'housing' and 'default' Screen Shot 2025-01-20 at 1 09 04 PM

Code for modeling

X = dfm.drop(columns = 'y')
y = dfm['y']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 2)
num_columns = X_train.select_dtypes(["int","float"]).columns
num_transformer = StandardScaler()
preprocessor = ColumnTransformer(transformers=[('num',num_transformer,num_columns)])
pipeline = Pipeline(steps = [("preprocessor",preprocessor),("classifier",LogisticRegression())])
pipeline.fit(X_train, y_train)
print(f"Train data accuracy: {pipeline.score(X_train, y_train):.2f}")
print(f"Test data accuracy: {pipeline.score(X_test, y_test):.2f}")

##Train data accuracy: 0.89
## Test data accuracy: 0.89

Confusion Matrix

Screen Shot 2025-01-20 at 1 19 50 PM

Code for Model Comparison

models = {
    'KNN': KNeighborsClassifier(),
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'SVM': SVC(probability=True)  # SVM with probability estimates
}

for model_name, model in models.items():
    print(f"Training {model_name}...")

    start_time = time.time()
    # Fit the model
    model.fit(X_train, y_train)
    end_time = time.time()
    runtime = end_time - start_time
    # Predict on the test set
    y_pred_test = model.predict(X_test)
    y_pred_prob_test = model.predict_proba(X_test)[:, 1]  # Probabilities for the positive class
    
    y_pred_train = model.predict(X_train)
    y_pred_prob_train = model.predict_proba(X_train)[:, 1]  # Probabilities for the positive class

    # Calculate evaluation metrics
    test_accuracy = accuracy_score(y_test, y_pred_test)
    train_accuracy = accuracy_score(y_train, y_pred_train)

    # Store the results
    results[model_name] = {
        'Test Accuracy': test_accuracy,
        'Train Accuracy': train_accuracy,
        'Runtime': runtime
    }

Model Comparison

                     Test Accuracy  Train Accuracy     Runtime
KNN                       0.886706        0.916474    0.035695
Logistic Regression       0.880578        0.877696    0.119656
Decision Tree             0.858265        0.999175    0.078799
SVM                       0.877121        0.875850  102.151783

Improving Model

pipeline = Pipeline(steps = [("preprocessor",preprocessor),("classifier",KNeighborsClassifier())])
knn = KNeighborsClassifier()

param_grid_knn = {
    'classifier__n_neighbors': [3, 5, 7, 9, 11],
    'classifier__weights': ['uniform', 'distance'],
    'classifier__p': [1, 2]
}

grid_search_knn = GridSearchCV(pipeline, param_grid_knn, cv=5, scoring='accuracy', verbose=1)
grid_search_knn.fit(X_train, y_train)

print(f"Best Parameters for KNN: {grid_search_knn.best_params_}")
print(f"Best Score for KNN: {grid_search_knn.best_score_}")

# Example for Decision Tree
pipeline = Pipeline(steps = [("preprocessor",preprocessor),("classifier",DecisionTreeClassifier())])
dt = DecisionTreeClassifier()

param_grid_dt = {
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

grid_search_dt = GridSearchCV(pipeline, param_grid_dt, cv=5, scoring='accuracy', verbose=1)
grid_search_dt.fit(X_train, y_train)

print(f"Best Parameters for Decision Tree: {grid_search_dt.best_params_}")
print(f"Best Score for Decision Tree: {grid_search_dt.best_score_}")

>>> Fitting 5 folds for each of 20 candidates, totalling 100 fits
    Best Parameters for KNN:
    'classifier__n_neighbors': 11,
    'classifier__p': 2,
    'classifier__weights': 'uniform'
>>> Best Score for KNN: 0.8896003232330717


>>> Fitting 5 folds for each of 36 candidates, totalling 180 fits
    Best Parameters for Decision Tree:
       'classifier__max_depth': 10,
       'classifier__min_samples_leaf': 2,
       'classifier__min_samples_split': 2
>>> Best Score for Decision Tree: 0.8885789283372676

Conclusion & Recommendation

From different models and its results, Its almost clear that the test accuracy is between 0.85 to 0.89 which is also close to the best score given by Decision Tree and also for KNN. From this I can conclue that these models can be used for accurately predicting if the client will be subscribed for term products offered by the bank.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published